# LiteLLM AI Gateway

LiteLLM is an open-source AI Gateway that provides a unified OpenAI-compatible API for 100+ language model providers — including OpenAI, Anthropic, Azure, Bedrock, HuggingFace, and locally-hosted models. Deploy it on CLORE.AI to route, load-balance, and manage all your LLM API calls through a single endpoint with built-in cost tracking, rate limiting, and fallback logic.

The real power of LiteLLM shows up at scale: teams running mixed local+cloud stacks can hot-swap models without touching application code. Replace `gpt-4o` with `mistral-7b-local` in config, restart — done.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter | Minimum          | Recommended                 |
| --------- | ---------------- | --------------------------- |
| RAM       | 4 GB             | 8 GB+                       |
| VRAM      | N/A (proxy only) | N/A                         |
| Disk      | 10 GB            | 20 GB+                      |
| GPU       | Not required     | Optional (for local models) |

{% hint style="info" %}
LiteLLM itself is a CPU-based proxy and doesn't require a GPU. However, deploying it on a CLORE.AI GPU server makes sense when you want to run local models (via Ollama, TGI, vLLM) alongside LiteLLM as a unified gateway on the same machine.
{% endhint %}

## Quick Deploy on CLORE.AI

**Docker Image:** `ghcr.io/berriai/litellm:main-latest`

**Ports:** `22/tcp`, `4000/http`

**Environment Variables:**

| Variable             | Example            | Description                   |
| -------------------- | ------------------ | ----------------------------- |
| `OPENAI_API_KEY`     | `sk-xxx...`        | OpenAI API key                |
| `ANTHROPIC_API_KEY`  | `sk-ant-xxx...`    | Anthropic API key             |
| `AZURE_API_KEY`      | `xxx...`           | Azure OpenAI key              |
| `LITELLM_MASTER_KEY` | `sk-my-master-key` | Master auth key for the proxy |
| `DATABASE_URL`       | `postgresql://...` | PostgreSQL for cost tracking  |
| `STORE_MODEL_IN_DB`  | `True`             | Persist model config to DB    |

## Step-by-Step Setup

### 1. Rent a Server on CLORE.AI

LiteLLM works great even on CPU-only servers. Go to [CLORE.AI Marketplace](https://clore.ai/marketplace) and filter for:

* Lowest price CPU servers for a pure proxy setup
* GPU servers (RTX 3090+) if you want to run local models too

### 2. SSH into Your Server

```bash
ssh -p <PORT> root@<SERVER_IP>
```

### 3. Create a Config File

LiteLLM uses a YAML config file to define models:

```bash
mkdir -p /root/litellm
cat > /root/litellm/config.yaml << 'EOF'
model_list:
  # OpenAI models
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"

  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: "os.environ/OPENAI_API_KEY"

  # Anthropic models
  - model_name: claude-3-5-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: "os.environ/ANTHROPIC_API_KEY"

  # Local model via TGI (on same server, port 8080)
  - model_name: mistral-7b-local
    litellm_params:
      model: openai/mistralai/Mistral-7B-Instruct-v0.3
      api_base: "http://localhost:8080/v1"
      api_key: "none"

  # Load balancer: route to multiple endpoints
  - model_name: fast-model
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: "os.environ/OPENAI_API_KEY"
    model_info:
      mode: chat

litellm_settings:
  drop_params: True
  set_verbose: False
  num_retries: 3
  request_timeout: 60

general_settings:
  master_key: "sk-my-secret-master-key"  # Change this!
  alerting: []
EOF
```

### 4. Launch LiteLLM

**Basic launch:**

```bash
docker run -d \
  --name litellm \
  --network host \
  -v /root/litellm/config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=sk-your-openai-key \
  -e ANTHROPIC_API_KEY=sk-ant-your-anthropic-key \
  -e LITELLM_MASTER_KEY=sk-my-secret-master-key \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml \
  --port 4000 \
  --host 0.0.0.0
```

**With PostgreSQL for cost tracking:**

First, start a PostgreSQL container:

```bash
docker run -d \
  --name postgres \
  -e POSTGRES_PASSWORD=litellm_pass \
  -e POSTGRES_DB=litellm \
  -p 5432:5432 \
  postgres:15

# Then launch LiteLLM with DB
docker run -d \
  --name litellm \
  -p 4000:4000 \
  -v /root/litellm/config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=sk-your-openai-key \
  -e ANTHROPIC_API_KEY=sk-ant-your-anthropic-key \
  -e LITELLM_MASTER_KEY=sk-my-secret-master-key \
  -e DATABASE_URL="postgresql://postgres:litellm_pass@localhost:5432/litellm" \
  --network host \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml \
  --port 4000 \
  --host 0.0.0.0
```

**Using Docker Compose (recommended):**

```bash
cat > /root/litellm/docker-compose.yml << 'EOF'
version: "3.8"
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=sk-my-secret-master-key
      - DATABASE_URL=postgresql://postgres:litellm_pass@db:5432/litellm
    command: --config /app/config.yaml --port 4000 --host 0.0.0.0
    depends_on:
      - db

  db:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: litellm_pass
      POSTGRES_DB: litellm
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:
EOF

cd /root/litellm && docker compose up -d
```

### 5. Verify the Server

```bash
# Check health
curl http://localhost:4000/health

# List available models
curl http://localhost:4000/v1/models \
  -H "Authorization: Bearer sk-my-secret-master-key"
```

### 6. Access via CLORE.AI HTTP Proxy

Your CLORE.AI http\_pub URL for port 4000:

```
https://<order-id>-4000.clore.ai/v1
```

Use this as your `api_base` in any OpenAI-compatible client.

***

## Usage Examples

### Example 1: Direct API Call via Proxy

```bash
curl http://localhost:4000/v1/chat/completions \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-my-secret-master-key" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "What is the capital of Germany?"}
    ]
  }'
```

### Example 2: OpenAI Python SDK with LiteLLM Proxy

```python
from openai import OpenAI

# Just change base_url and api_key — everything else is identical
client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="sk-my-secret-master-key",
)

# Use any model from your config
response = client.chat.completions.create(
    model="gpt-4o-mini",  # or "claude-3-5-sonnet", "mistral-7b-local"
    messages=[{"role": "user", "content": "Summarize the benefits of GPU computing."}],
)
print(response.choices[0].message.content)

# Switch models with zero code changes
response2 = client.chat.completions.create(
    model="claude-3-5-sonnet",
    messages=[{"role": "user", "content": "Same question, different model."}],
)
print(response2.choices[0].message.content)
```

### Example 3: LiteLLM Python SDK (Direct)

```python
import litellm

# Use directly without proxy
response = litellm.completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key="your-openai-key",
)

# Or route through your proxy
response = litellm.completion(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://localhost:4000",
    api_key="sk-my-secret-master-key",
)
```

### Example 4: Fallback Configuration

Configure automatic fallbacks between models:

```yaml
# In config.yaml
model_list:
  - model_name: smart-fallback
    litellm_params:
      model: gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"

router_settings:
  routing_strategy: least-busy
  model_group_alias:
    "gpt-4-fallback":
      - "gpt-4o"
      - "claude-3-5-sonnet"
      - "mistral-7b-local"
  num_retries: 3
  fallbacks:
    - gpt-4o:
        - claude-3-5-sonnet
        - mistral-7b-local
```

### Example 5: Cost Tracking Dashboard

After enabling PostgreSQL, access spend analytics:

```bash
# Get spend by user
curl http://localhost:4000/global/spend/users \
  -H "Authorization: Bearer sk-my-secret-master-key"

# Get spend by model
curl http://localhost:4000/global/spend/models \
  -H "Authorization: Bearer sk-my-secret-master-key"

# Generate spend report
curl "http://localhost:4000/global/spend?start_date=2024-01-01&end_date=2024-12-31" \
  -H "Authorization: Bearer sk-my-secret-master-key"
```

***

## Configuration

### Virtual Keys (Per-User API Keys)

Create separate keys with rate limits and budgets:

```bash
# Create a key with budget
curl http://localhost:4000/key/generate \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-my-secret-master-key" \
  -d '{
    "models": ["gpt-4o-mini", "claude-3-5-sonnet"],
    "duration": "30d",
    "max_budget": 10.0,
    "metadata": {"user_id": "user_123"}
  }'
```

### Load Balancing

```yaml
model_list:
  # Round-robin between multiple OpenAI API keys
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-key-1
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-key-2

router_settings:
  routing_strategy: least-busy  # or: simple-shuffle, latency-based-routing
```

### Caching

```yaml
litellm_settings:
  cache: True
  cache_params:
    type: redis
    host: localhost
    port: 6379
    ttl: 3600  # 1 hour
```

### Rate Limiting

```yaml
general_settings:
  default_team_settings:
    tpm_limit: 100000   # tokens per minute
    rpm_limit: 1000     # requests per minute
```

***

## Performance Tips

### 1. Enable Caching for Repeated Prompts

For RAG or chatbot applications with common questions, Redis caching cuts costs by 30–70% and drops P50 latency to <5ms on cache hits:

```yaml
litellm_settings:
  cache: True
  cache_params:
    type: redis
    host: localhost
    port: 6379
```

### 2. Use Async Requests

```python
import asyncio
import litellm

async def batch_complete(prompts):
    tasks = [
        litellm.acompletion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": p}],
        )
        for p in prompts
    ]
    return await asyncio.gather(*tasks)

results = asyncio.run(batch_complete(["Hello", "World", "Test"]))
```

### 3. Local Model Routing

Route cheap/simple requests to local models on Clore.ai GPUs, complex ones to GPT-4:

```yaml
model_list:
  - model_name: smart-router
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"
```

A typical setup: run Mistral 7B or Llama 3 8B locally on a Clore.ai RTX 3090 ($0.10–0.15/hr), handle 80% of traffic there, escalate complex tasks to GPT-4o. Cost savings of 3–5× vs cloud-only are common.

### 4. Set Timeouts and Retries

```yaml
litellm_settings:
  request_timeout: 30
  num_retries: 3
  retry_after: 5
```

***

## Clore.ai GPU Recommendations

LiteLLM itself needs no GPU — it's a proxy. The GPU choice only matters when you co-deploy local inference alongside it.

| Local Model                               | GPU                | Why                                                             |
| ----------------------------------------- | ------------------ | --------------------------------------------------------------- |
| Mistral 7B / Llama 3 8B (bf16)            | **RTX 3090** 24 GB | Fits comfortably, \~200 tok/s throughput                        |
| Mixtral 8×7B or Llama 3 70B (AWQ)         | **RTX 4090** 24 GB | Faster memory bandwidth than 3090; fits 70B AWQ 4-bit           |
| Llama 3 70B (bf16) or multi-model serving | **A100 80 GB**     | Run multiple 7–13B models simultaneously; HBM2e for low latency |

**Recommended stack for a solo developer:** RTX 3090 + Mistral 7B + LiteLLM gateway. Total cost on Clore.ai: \~$0.12/hr. Handles \~50 req/min easily, with GPT-4o fallback for complex tasks.

**Team / production stack:** A100 80GB, run Llama 3 70B + LiteLLM + PostgreSQL. Serves 20+ concurrent users, full cost tracking, zero cloud LLM spend for most requests.

***

## Troubleshooting

### Problem: "model not found"

Ensure the model name in your request matches exactly what's in `config.yaml`:

```bash
curl http://localhost:4000/v1/models -H "Authorization: Bearer sk-my-secret-master-key"
```

### Problem: "authentication failed"

Check your `LITELLM_MASTER_KEY` environment variable and use it as the Bearer token.

### Problem: Config changes not reflected

Restart the container after config changes:

```bash
docker restart litellm
```

### Problem: High latency on first request

LiteLLM loads model configs on startup. The first few requests may be slower as connections are established.

### Problem: Database connection errors

```bash
# Check PostgreSQL is running
docker logs postgres

# Verify connection string format
DATABASE_URL="postgresql://user:password@host:5432/dbname"
```

### Problem: 429 rate limit errors from providers

Configure fallbacks:

```yaml
litellm_settings:
  num_retries: 5
  fallbacks:
    - gpt-4o: [claude-3-5-sonnet]
```

***

## Clore.ai GPU Recommendations

LiteLLM is an API gateway/proxy — it doesn't do inference itself. GPU selection depends on whether you're routing to cloud APIs or local models.

| Setup                | GPU             | Clore.ai Price | Use Case                                           |
| -------------------- | --------------- | -------------- | -------------------------------------------------- |
| Cloud API proxy only | CPU-only        | \~$0.02/hr     | Route to OpenAI, Anthropic, Gemini — no GPU needed |
| Local vLLM backend   | RTX 3090 (24GB) | \~$0.12/hr     | Self-hosted 7B–13B models with LiteLLM as frontend |
| Local vLLM backend   | RTX 4090 (24GB) | \~$0.70/hr     | Higher throughput 7B–34B local models              |
| Local vLLM backend   | A100 40GB       | \~$1.20/hr     | 70B models, production local serving               |

{% hint style="info" %}
**Most common setup:** Run LiteLLM as a unified proxy in front of your Clore.ai-hosted vLLM/Ollama instances. This gives you provider fallbacks, rate limiting, cost tracking, and OpenAI-compatible routing — while keeping all inference local and cheap.

**Example cost:** Run LiteLLM proxy on a CPU-only instance (~~$0.02/hr) and point it at a vLLM server on RTX 3090 (~~$0.12/hr). Total cost \~$0.14/hr for a production-ready, self-hosted LLM API with fallbacks, logging, and rate limiting.
{% endhint %}

***

## Links

* [GitHub](https://github.com/BerriAI/litellm)
* [Documentation](https://docs.litellm.ai)
* [Docker Hub / GHCR](https://github.com/BerriAI/litellm/pkgs/container/litellm)
* [Supported Providers](https://docs.litellm.ai/docs/providers)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)
