LiteLLM AI Gateway

Deploy LiteLLM as an AI Gateway proxy for 100+ LLMs on Clore.ai GPUs

LiteLLM is an open-source AI Gateway that provides a unified OpenAI-compatible API for 100+ language model providers — including OpenAI, Anthropic, Azure, Bedrock, HuggingFace, and locally-hosted models. Deploy it on CLORE.AI to route, load-balance, and manage all your LLM API calls through a single endpoint with built-in cost tracking, rate limiting, and fallback logic.

The real power of LiteLLM shows up at scale: teams running mixed local+cloud stacks can hot-swap models without touching application code. Replace gpt-4o with mistral-7b-local in config, restart — done.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

4 GB

8 GB+

VRAM

N/A (proxy only)

N/A

Disk

10 GB

20 GB+

GPU

Not required

Optional (for local models)

circle-info

LiteLLM itself is a CPU-based proxy and doesn't require a GPU. However, deploying it on a CLORE.AI GPU server makes sense when you want to run local models (via Ollama, TGI, vLLM) alongside LiteLLM as a unified gateway on the same machine.

Quick Deploy on CLORE.AI

Docker Image: ghcr.io/berriai/litellm:main-latest

Ports: 22/tcp, 4000/http

Environment Variables:

Variable
Example
Description

OPENAI_API_KEY

sk-xxx...

OpenAI API key

ANTHROPIC_API_KEY

sk-ant-xxx...

Anthropic API key

AZURE_API_KEY

xxx...

Azure OpenAI key

LITELLM_MASTER_KEY

sk-my-master-key

Master auth key for the proxy

DATABASE_URL

postgresql://...

PostgreSQL for cost tracking

STORE_MODEL_IN_DB

True

Persist model config to DB

Step-by-Step Setup

1. Rent a Server on CLORE.AI

LiteLLM works great even on CPU-only servers. Go to CLORE.AI Marketplacearrow-up-right and filter for:

  • Lowest price CPU servers for a pure proxy setup

  • GPU servers (RTX 3090+) if you want to run local models too

2. SSH into Your Server

3. Create a Config File

LiteLLM uses a YAML config file to define models:

4. Launch LiteLLM

Basic launch:

With PostgreSQL for cost tracking:

First, start a PostgreSQL container:

Using Docker Compose (recommended):

5. Verify the Server

6. Access via CLORE.AI HTTP Proxy

Your CLORE.AI http_pub URL for port 4000:

Use this as your api_base in any OpenAI-compatible client.


Usage Examples

Example 1: Direct API Call via Proxy

Example 2: OpenAI Python SDK with LiteLLM Proxy

Example 3: LiteLLM Python SDK (Direct)

Example 4: Fallback Configuration

Configure automatic fallbacks between models:

Example 5: Cost Tracking Dashboard

After enabling PostgreSQL, access spend analytics:


Configuration

Virtual Keys (Per-User API Keys)

Create separate keys with rate limits and budgets:

Load Balancing

Caching

Rate Limiting


Performance Tips

1. Enable Caching for Repeated Prompts

For RAG or chatbot applications with common questions, Redis caching cuts costs by 30–70% and drops P50 latency to <5ms on cache hits:

2. Use Async Requests

3. Local Model Routing

Route cheap/simple requests to local models on Clore.ai GPUs, complex ones to GPT-4:

A typical setup: run Mistral 7B or Llama 3 8B locally on a Clore.ai RTX 3090 ($0.10–0.15/hr), handle 80% of traffic there, escalate complex tasks to GPT-4o. Cost savings of 3–5× vs cloud-only are common.

4. Set Timeouts and Retries


Clore.ai GPU Recommendations

LiteLLM itself needs no GPU — it's a proxy. The GPU choice only matters when you co-deploy local inference alongside it.

Local Model
GPU
Why

Mistral 7B / Llama 3 8B (bf16)

RTX 3090 24 GB

Fits comfortably, ~200 tok/s throughput

Mixtral 8×7B or Llama 3 70B (AWQ)

RTX 4090 24 GB

Faster memory bandwidth than 3090; fits 70B AWQ 4-bit

Llama 3 70B (bf16) or multi-model serving

A100 80 GB

Run multiple 7–13B models simultaneously; HBM2e for low latency

Recommended stack for a solo developer: RTX 3090 + Mistral 7B + LiteLLM gateway. Total cost on Clore.ai: ~$0.12/hr. Handles ~50 req/min easily, with GPT-4o fallback for complex tasks.

Team / production stack: A100 80GB, run Llama 3 70B + LiteLLM + PostgreSQL. Serves 20+ concurrent users, full cost tracking, zero cloud LLM spend for most requests.


Troubleshooting

Problem: "model not found"

Ensure the model name in your request matches exactly what's in config.yaml:

Problem: "authentication failed"

Check your LITELLM_MASTER_KEY environment variable and use it as the Bearer token.

Problem: Config changes not reflected

Restart the container after config changes:

Problem: High latency on first request

LiteLLM loads model configs on startup. The first few requests may be slower as connections are established.

Problem: Database connection errors

Problem: 429 rate limit errors from providers

Configure fallbacks:


Clore.ai GPU Recommendations

LiteLLM is an API gateway/proxy — it doesn't do inference itself. GPU selection depends on whether you're routing to cloud APIs or local models.

Setup
GPU
Clore.ai Price
Use Case

Cloud API proxy only

CPU-only

~$0.02/hr

Route to OpenAI, Anthropic, Gemini — no GPU needed

Local vLLM backend

RTX 3090 (24GB)

~$0.12/hr

Self-hosted 7B–13B models with LiteLLM as frontend

Local vLLM backend

RTX 4090 (24GB)

~$0.70/hr

Higher throughput 7B–34B local models

Local vLLM backend

A100 40GB

~$1.20/hr

70B models, production local serving

circle-info

Most common setup: Run LiteLLM as a unified proxy in front of your Clore.ai-hosted vLLM/Ollama instances. This gives you provider fallbacks, rate limiting, cost tracking, and OpenAI-compatible routing — while keeping all inference local and cheap.

Example cost: Run LiteLLM proxy on a CPU-only instance ($0.02/hr) and point it at a vLLM server on RTX 3090 ($0.12/hr). Total cost ~$0.14/hr for a production-ready, self-hosted LLM API with fallbacks, logging, and rate limiting.


Last updated

Was this helpful?