# LiteLLM AI Gateway

LiteLLM is an open-source AI Gateway that provides a unified OpenAI-compatible API for 100+ language model providers — including OpenAI, Anthropic, Azure, Bedrock, HuggingFace, and locally-hosted models. Deploy it on CLORE.AI to route, load-balance, and manage all your LLM API calls through a single endpoint with built-in cost tracking, rate limiting, and fallback logic.

The real power of LiteLLM shows up at scale: teams running mixed local+cloud stacks can hot-swap models without touching application code. Replace `gpt-4o` with `mistral-7b-local` in config, restart — done.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter | Minimum          | Recommended                 |
| --------- | ---------------- | --------------------------- |
| RAM       | 4 GB             | 8 GB+                       |
| VRAM      | N/A (proxy only) | N/A                         |
| Disk      | 10 GB            | 20 GB+                      |
| GPU       | Not required     | Optional (for local models) |

{% hint style="info" %}
LiteLLM itself is a CPU-based proxy and doesn't require a GPU. However, deploying it on a CLORE.AI GPU server makes sense when you want to run local models (via Ollama, TGI, vLLM) alongside LiteLLM as a unified gateway on the same machine.
{% endhint %}

## Quick Deploy on CLORE.AI

**Docker Image:** `ghcr.io/berriai/litellm:main-latest`

**Ports:** `22/tcp`, `4000/http`

**Environment Variables:**

| Variable             | Example            | Description                   |
| -------------------- | ------------------ | ----------------------------- |
| `OPENAI_API_KEY`     | `sk-xxx...`        | OpenAI API key                |
| `ANTHROPIC_API_KEY`  | `sk-ant-xxx...`    | Anthropic API key             |
| `AZURE_API_KEY`      | `xxx...`           | Azure OpenAI key              |
| `LITELLM_MASTER_KEY` | `sk-my-master-key` | Master auth key for the proxy |
| `DATABASE_URL`       | `postgresql://...` | PostgreSQL for cost tracking  |
| `STORE_MODEL_IN_DB`  | `True`             | Persist model config to DB    |

## Step-by-Step Setup

### 1. Rent a Server on CLORE.AI

LiteLLM works great even on CPU-only servers. Go to [CLORE.AI Marketplace](https://clore.ai/marketplace) and filter for:

* Lowest price CPU servers for a pure proxy setup
* GPU servers (RTX 3090+) if you want to run local models too

### 2. SSH into Your Server

```bash
ssh -p <PORT> root@<SERVER_IP>
```

### 3. Create a Config File

LiteLLM uses a YAML config file to define models:

```bash
mkdir -p /root/litellm
cat > /root/litellm/config.yaml << 'EOF'
model_list:
  # OpenAI models
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"

  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: "os.environ/OPENAI_API_KEY"

  # Anthropic models
  - model_name: claude-3-5-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: "os.environ/ANTHROPIC_API_KEY"

  # Local model via TGI (on same server, port 8080)
  - model_name: mistral-7b-local
    litellm_params:
      model: openai/mistralai/Mistral-7B-Instruct-v0.3
      api_base: "http://localhost:8080/v1"
      api_key: "none"

  # Load balancer: route to multiple endpoints
  - model_name: fast-model
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: "os.environ/OPENAI_API_KEY"
    model_info:
      mode: chat

litellm_settings:
  drop_params: True
  set_verbose: False
  num_retries: 3
  request_timeout: 60

general_settings:
  master_key: "sk-my-secret-master-key"  # Change this!
  alerting: []
EOF
```

### 4. Launch LiteLLM

**Basic launch:**

```bash
docker run -d \
  --name litellm \
  --network host \
  -v /root/litellm/config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=sk-your-openai-key \
  -e ANTHROPIC_API_KEY=sk-ant-your-anthropic-key \
  -e LITELLM_MASTER_KEY=sk-my-secret-master-key \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml \
  --port 4000 \
  --host 0.0.0.0
```

**With PostgreSQL for cost tracking:**

First, start a PostgreSQL container:

```bash
docker run -d \
  --name postgres \
  -e POSTGRES_PASSWORD=litellm_pass \
  -e POSTGRES_DB=litellm \
  -p 5432:5432 \
  postgres:15

# Then launch LiteLLM with DB
docker run -d \
  --name litellm \
  -p 4000:4000 \
  -v /root/litellm/config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=sk-your-openai-key \
  -e ANTHROPIC_API_KEY=sk-ant-your-anthropic-key \
  -e LITELLM_MASTER_KEY=sk-my-secret-master-key \
  -e DATABASE_URL="postgresql://postgres:litellm_pass@localhost:5432/litellm" \
  --network host \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml \
  --port 4000 \
  --host 0.0.0.0
```

**Using Docker Compose (recommended):**

```bash
cat > /root/litellm/docker-compose.yml << 'EOF'
version: "3.8"
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=sk-my-secret-master-key
      - DATABASE_URL=postgresql://postgres:litellm_pass@db:5432/litellm
    command: --config /app/config.yaml --port 4000 --host 0.0.0.0
    depends_on:
      - db

  db:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: litellm_pass
      POSTGRES_DB: litellm
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:
EOF

cd /root/litellm && docker compose up -d
```

### 5. Verify the Server

```bash
# Check health
curl http://localhost:4000/health

# List available models
curl http://localhost:4000/v1/models \
  -H "Authorization: Bearer sk-my-secret-master-key"
```

### 6. Access via CLORE.AI HTTP Proxy

Your CLORE.AI http\_pub URL for port 4000:

```
https://<order-id>-4000.clore.ai/v1
```

Use this as your `api_base` in any OpenAI-compatible client.

***

## Usage Examples

### Example 1: Direct API Call via Proxy

```bash
curl http://localhost:4000/v1/chat/completions \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-my-secret-master-key" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "What is the capital of Germany?"}
    ]
  }'
```

### Example 2: OpenAI Python SDK with LiteLLM Proxy

```python
from openai import OpenAI

# Just change base_url and api_key — everything else is identical
client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="sk-my-secret-master-key",
)

# Use any model from your config
response = client.chat.completions.create(
    model="gpt-4o-mini",  # or "claude-3-5-sonnet", "mistral-7b-local"
    messages=[{"role": "user", "content": "Summarize the benefits of GPU computing."}],
)
print(response.choices[0].message.content)

# Switch models with zero code changes
response2 = client.chat.completions.create(
    model="claude-3-5-sonnet",
    messages=[{"role": "user", "content": "Same question, different model."}],
)
print(response2.choices[0].message.content)
```

### Example 3: LiteLLM Python SDK (Direct)

```python
import litellm

# Use directly without proxy
response = litellm.completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key="your-openai-key",
)

# Or route through your proxy
response = litellm.completion(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://localhost:4000",
    api_key="sk-my-secret-master-key",
)
```

### Example 4: Fallback Configuration

Configure automatic fallbacks between models:

```yaml
# In config.yaml
model_list:
  - model_name: smart-fallback
    litellm_params:
      model: gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"

router_settings:
  routing_strategy: least-busy
  model_group_alias:
    "gpt-4-fallback":
      - "gpt-4o"
      - "claude-3-5-sonnet"
      - "mistral-7b-local"
  num_retries: 3
  fallbacks:
    - gpt-4o:
        - claude-3-5-sonnet
        - mistral-7b-local
```

### Example 5: Cost Tracking Dashboard

After enabling PostgreSQL, access spend analytics:

```bash
# Get spend by user
curl http://localhost:4000/global/spend/users \
  -H "Authorization: Bearer sk-my-secret-master-key"

# Get spend by model
curl http://localhost:4000/global/spend/models \
  -H "Authorization: Bearer sk-my-secret-master-key"

# Generate spend report
curl "http://localhost:4000/global/spend?start_date=2024-01-01&end_date=2024-12-31" \
  -H "Authorization: Bearer sk-my-secret-master-key"
```

***

## Configuration

### Virtual Keys (Per-User API Keys)

Create separate keys with rate limits and budgets:

```bash
# Create a key with budget
curl http://localhost:4000/key/generate \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-my-secret-master-key" \
  -d '{
    "models": ["gpt-4o-mini", "claude-3-5-sonnet"],
    "duration": "30d",
    "max_budget": 10.0,
    "metadata": {"user_id": "user_123"}
  }'
```

### Load Balancing

```yaml
model_list:
  # Round-robin between multiple OpenAI API keys
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-key-1
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-key-2

router_settings:
  routing_strategy: least-busy  # or: simple-shuffle, latency-based-routing
```

### Caching

```yaml
litellm_settings:
  cache: True
  cache_params:
    type: redis
    host: localhost
    port: 6379
    ttl: 3600  # 1 hour
```

### Rate Limiting

```yaml
general_settings:
  default_team_settings:
    tpm_limit: 100000   # tokens per minute
    rpm_limit: 1000     # requests per minute
```

***

## Performance Tips

### 1. Enable Caching for Repeated Prompts

For RAG or chatbot applications with common questions, Redis caching cuts costs by 30–70% and drops P50 latency to <5ms on cache hits:

```yaml
litellm_settings:
  cache: True
  cache_params:
    type: redis
    host: localhost
    port: 6379
```

### 2. Use Async Requests

```python
import asyncio
import litellm

async def batch_complete(prompts):
    tasks = [
        litellm.acompletion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": p}],
        )
        for p in prompts
    ]
    return await asyncio.gather(*tasks)

results = asyncio.run(batch_complete(["Hello", "World", "Test"]))
```

### 3. Local Model Routing

Route cheap/simple requests to local models on Clore.ai GPUs, complex ones to GPT-4:

```yaml
model_list:
  - model_name: smart-router
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"
```

A typical setup: run Mistral 7B or Llama 3 8B locally on a Clore.ai RTX 3090 ($0.10–0.15/hr), handle 80% of traffic there, escalate complex tasks to GPT-4o. Cost savings of 3–5× vs cloud-only are common.

### 4. Set Timeouts and Retries

```yaml
litellm_settings:
  request_timeout: 30
  num_retries: 3
  retry_after: 5
```

***

## Clore.ai GPU Recommendations

LiteLLM itself needs no GPU — it's a proxy. The GPU choice only matters when you co-deploy local inference alongside it.

| Local Model                               | GPU                | Why                                                             |
| ----------------------------------------- | ------------------ | --------------------------------------------------------------- |
| Mistral 7B / Llama 3 8B (bf16)            | **RTX 3090** 24 GB | Fits comfortably, \~200 tok/s throughput                        |
| Mixtral 8×7B or Llama 3 70B (AWQ)         | **RTX 4090** 24 GB | Faster memory bandwidth than 3090; fits 70B AWQ 4-bit           |
| Llama 3 70B (bf16) or multi-model serving | **A100 80 GB**     | Run multiple 7–13B models simultaneously; HBM2e for low latency |

**Recommended stack for a solo developer:** RTX 3090 + Mistral 7B + LiteLLM gateway. Total cost on Clore.ai: \~$0.12/hr. Handles \~50 req/min easily, with GPT-4o fallback for complex tasks.

**Team / production stack:** A100 80GB, run Llama 3 70B + LiteLLM + PostgreSQL. Serves 20+ concurrent users, full cost tracking, zero cloud LLM spend for most requests.

***

## Troubleshooting

### Problem: "model not found"

Ensure the model name in your request matches exactly what's in `config.yaml`:

```bash
curl http://localhost:4000/v1/models -H "Authorization: Bearer sk-my-secret-master-key"
```

### Problem: "authentication failed"

Check your `LITELLM_MASTER_KEY` environment variable and use it as the Bearer token.

### Problem: Config changes not reflected

Restart the container after config changes:

```bash
docker restart litellm
```

### Problem: High latency on first request

LiteLLM loads model configs on startup. The first few requests may be slower as connections are established.

### Problem: Database connection errors

```bash
# Check PostgreSQL is running
docker logs postgres

# Verify connection string format
DATABASE_URL="postgresql://user:password@host:5432/dbname"
```

### Problem: 429 rate limit errors from providers

Configure fallbacks:

```yaml
litellm_settings:
  num_retries: 5
  fallbacks:
    - gpt-4o: [claude-3-5-sonnet]
```

***

## Clore.ai GPU Recommendations

LiteLLM is an API gateway/proxy — it doesn't do inference itself. GPU selection depends on whether you're routing to cloud APIs or local models.

| Setup                | GPU             | Clore.ai Price | Use Case                                           |
| -------------------- | --------------- | -------------- | -------------------------------------------------- |
| Cloud API proxy only | CPU-only        | \~$0.02/hr     | Route to OpenAI, Anthropic, Gemini — no GPU needed |
| Local vLLM backend   | RTX 3090 (24GB) | \~$0.12/hr     | Self-hosted 7B–13B models with LiteLLM as frontend |
| Local vLLM backend   | RTX 4090 (24GB) | \~$0.70/hr     | Higher throughput 7B–34B local models              |
| Local vLLM backend   | A100 40GB       | \~$1.20/hr     | 70B models, production local serving               |

{% hint style="info" %}
**Most common setup:** Run LiteLLM as a unified proxy in front of your Clore.ai-hosted vLLM/Ollama instances. This gives you provider fallbacks, rate limiting, cost tracking, and OpenAI-compatible routing — while keeping all inference local and cheap.

**Example cost:** Run LiteLLM proxy on a CPU-only instance (~~$0.02/hr) and point it at a vLLM server on RTX 3090 (~~$0.12/hr). Total cost \~$0.14/hr for a production-ready, self-hosted LLM API with fallbacks, logging, and rate limiting.
{% endhint %}

***

## Links

* [GitHub](https://github.com/BerriAI/litellm)
* [Documentation](https://docs.litellm.ai)
* [Docker Hub / GHCR](https://github.com/BerriAI/litellm/pkgs/container/litellm)
* [Supported Providers](https://docs.litellm.ai/docs/providers)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/litellm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
