# LiteLLM AI Gateway LiteLLM is an open-source AI Gateway that provides a unified OpenAI-compatible API for 100+ language model providers — including OpenAI, Anthropic, Azure, Bedrock, HuggingFace, and locally-hosted models. Deploy it on CLORE.AI to route, load-balance, and manage all your LLM API calls through a single endpoint with built-in cost tracking, rate limiting, and fallback logic. The real power of LiteLLM shows up at scale: teams running mixed local+cloud stacks can hot-swap models without touching application code. Replace `gpt-4o` with `mistral-7b-local` in config, restart — done. {% hint style="success" %} All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace). {% endhint %} ## Server Requirements | Parameter | Minimum | Recommended | | --------- | ---------------- | --------------------------- | | RAM | 4 GB | 8 GB+ | | VRAM | N/A (proxy only) | N/A | | Disk | 10 GB | 20 GB+ | | GPU | Not required | Optional (for local models) | {% hint style="info" %} LiteLLM itself is a CPU-based proxy and doesn't require a GPU. However, deploying it on a CLORE.AI GPU server makes sense when you want to run local models (via Ollama, TGI, vLLM) alongside LiteLLM as a unified gateway on the same machine. {% endhint %} ## Quick Deploy on CLORE.AI **Docker Image:** `ghcr.io/berriai/litellm:main-latest` **Ports:** `22/tcp`, `4000/http` **Environment Variables:** | Variable | Example | Description | | -------------------- | ------------------ | ----------------------------- | | `OPENAI_API_KEY` | `sk-xxx...` | OpenAI API key | | `ANTHROPIC_API_KEY` | `sk-ant-xxx...` | Anthropic API key | | `AZURE_API_KEY` | `xxx...` | Azure OpenAI key | | `LITELLM_MASTER_KEY` | `sk-my-master-key` | Master auth key for the proxy | | `DATABASE_URL` | `postgresql://...` | PostgreSQL for cost tracking | | `STORE_MODEL_IN_DB` | `True` | Persist model config to DB | ## Step-by-Step Setup ### 1. Rent a Server on CLORE.AI LiteLLM works great even on CPU-only servers. Go to [CLORE.AI Marketplace](https://clore.ai/marketplace) and filter for: * Lowest price CPU servers for a pure proxy setup * GPU servers (RTX 3090+) if you want to run local models too ### 2. SSH into Your Server ```bash ssh -p root@ ``` ### 3. Create a Config File LiteLLM uses a YAML config file to define models: ```bash mkdir -p /root/litellm cat > /root/litellm/config.yaml << 'EOF' model_list: # OpenAI models - model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: "os.environ/OPENAI_API_KEY" - model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: "os.environ/OPENAI_API_KEY" # Anthropic models - model_name: claude-3-5-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: "os.environ/ANTHROPIC_API_KEY" # Local model via TGI (on same server, port 8080) - model_name: mistral-7b-local litellm_params: model: openai/mistralai/Mistral-7B-Instruct-v0.3 api_base: "http://localhost:8080/v1" api_key: "none" # Load balancer: route to multiple endpoints - model_name: fast-model litellm_params: model: openai/gpt-4o-mini api_key: "os.environ/OPENAI_API_KEY" model_info: mode: chat litellm_settings: drop_params: True set_verbose: False num_retries: 3 request_timeout: 60 general_settings: master_key: "sk-my-secret-master-key" # Change this! alerting: [] EOF ``` ### 4. Launch LiteLLM **Basic launch:** ```bash docker run -d \ --name litellm \ --network host \ -v /root/litellm/config.yaml:/app/config.yaml \ -e OPENAI_API_KEY=sk-your-openai-key \ -e ANTHROPIC_API_KEY=sk-ant-your-anthropic-key \ -e LITELLM_MASTER_KEY=sk-my-secret-master-key \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml \ --port 4000 \ --host 0.0.0.0 ``` **With PostgreSQL for cost tracking:** First, start a PostgreSQL container: ```bash docker run -d \ --name postgres \ -e POSTGRES_PASSWORD=litellm_pass \ -e POSTGRES_DB=litellm \ -p 5432:5432 \ postgres:15 # Then launch LiteLLM with DB docker run -d \ --name litellm \ -p 4000:4000 \ -v /root/litellm/config.yaml:/app/config.yaml \ -e OPENAI_API_KEY=sk-your-openai-key \ -e ANTHROPIC_API_KEY=sk-ant-your-anthropic-key \ -e LITELLM_MASTER_KEY=sk-my-secret-master-key \ -e DATABASE_URL="postgresql://postgres:litellm_pass@localhost:5432/litellm" \ --network host \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml \ --port 4000 \ --host 0.0.0.0 ``` **Using Docker Compose (recommended):** ```bash cat > /root/litellm/docker-compose.yml << 'EOF' version: "3.8" services: litellm: image: ghcr.io/berriai/litellm:main-latest ports: - "4000:4000" volumes: - ./config.yaml:/app/config.yaml environment: - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=sk-my-secret-master-key - DATABASE_URL=postgresql://postgres:litellm_pass@db:5432/litellm command: --config /app/config.yaml --port 4000 --host 0.0.0.0 depends_on: - db db: image: postgres:15 environment: POSTGRES_PASSWORD: litellm_pass POSTGRES_DB: litellm volumes: - postgres_data:/var/lib/postgresql/data volumes: postgres_data: EOF cd /root/litellm && docker compose up -d ``` ### 5. Verify the Server ```bash # Check health curl http://localhost:4000/health # List available models curl http://localhost:4000/v1/models \ -H "Authorization: Bearer sk-my-secret-master-key" ``` ### 6. Access via CLORE.AI HTTP Proxy Your CLORE.AI http\_pub URL for port 4000: ``` https://-4000.clore.ai/v1 ``` Use this as your `api_base` in any OpenAI-compatible client. *** ## Usage Examples ### Example 1: Direct API Call via Proxy ```bash curl http://localhost:4000/v1/chat/completions \ -X POST \ -H "Content-Type: application/json" \ -H "Authorization: Bearer sk-my-secret-master-key" \ -d '{ "model": "gpt-4o-mini", "messages": [ {"role": "user", "content": "What is the capital of Germany?"} ] }' ``` ### Example 2: OpenAI Python SDK with LiteLLM Proxy ```python from openai import OpenAI # Just change base_url and api_key — everything else is identical client = OpenAI( base_url="http://localhost:4000/v1", api_key="sk-my-secret-master-key", ) # Use any model from your config response = client.chat.completions.create( model="gpt-4o-mini", # or "claude-3-5-sonnet", "mistral-7b-local" messages=[{"role": "user", "content": "Summarize the benefits of GPU computing."}], ) print(response.choices[0].message.content) # Switch models with zero code changes response2 = client.chat.completions.create( model="claude-3-5-sonnet", messages=[{"role": "user", "content": "Same question, different model."}], ) print(response2.choices[0].message.content) ``` ### Example 3: LiteLLM Python SDK (Direct) ```python import litellm # Use directly without proxy response = litellm.completion( model="gpt-4o-mini", messages=[{"role": "user", "content": "Hello!"}], api_key="your-openai-key", ) # Or route through your proxy response = litellm.completion( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello!"}], api_base="http://localhost:4000", api_key="sk-my-secret-master-key", ) ``` ### Example 4: Fallback Configuration Configure automatic fallbacks between models: ```yaml # In config.yaml model_list: - model_name: smart-fallback litellm_params: model: gpt-4o api_key: "os.environ/OPENAI_API_KEY" router_settings: routing_strategy: least-busy model_group_alias: "gpt-4-fallback": - "gpt-4o" - "claude-3-5-sonnet" - "mistral-7b-local" num_retries: 3 fallbacks: - gpt-4o: - claude-3-5-sonnet - mistral-7b-local ``` ### Example 5: Cost Tracking Dashboard After enabling PostgreSQL, access spend analytics: ```bash # Get spend by user curl http://localhost:4000/global/spend/users \ -H "Authorization: Bearer sk-my-secret-master-key" # Get spend by model curl http://localhost:4000/global/spend/models \ -H "Authorization: Bearer sk-my-secret-master-key" # Generate spend report curl "http://localhost:4000/global/spend?start_date=2024-01-01&end_date=2024-12-31" \ -H "Authorization: Bearer sk-my-secret-master-key" ``` *** ## Configuration ### Virtual Keys (Per-User API Keys) Create separate keys with rate limits and budgets: ```bash # Create a key with budget curl http://localhost:4000/key/generate \ -X POST \ -H "Content-Type: application/json" \ -H "Authorization: Bearer sk-my-secret-master-key" \ -d '{ "models": ["gpt-4o-mini", "claude-3-5-sonnet"], "duration": "30d", "max_budget": 10.0, "metadata": {"user_id": "user_123"} }' ``` ### Load Balancing ```yaml model_list: # Round-robin between multiple OpenAI API keys - model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-key-1 - model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-key-2 router_settings: routing_strategy: least-busy # or: simple-shuffle, latency-based-routing ``` ### Caching ```yaml litellm_settings: cache: True cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour ``` ### Rate Limiting ```yaml general_settings: default_team_settings: tpm_limit: 100000 # tokens per minute rpm_limit: 1000 # requests per minute ``` *** ## Performance Tips ### 1. Enable Caching for Repeated Prompts For RAG or chatbot applications with common questions, Redis caching cuts costs by 30–70% and drops P50 latency to <5ms on cache hits: ```yaml litellm_settings: cache: True cache_params: type: redis host: localhost port: 6379 ``` ### 2. Use Async Requests ```python import asyncio import litellm async def batch_complete(prompts): tasks = [ litellm.acompletion( model="gpt-4o-mini", messages=[{"role": "user", "content": p}], ) for p in prompts ] return await asyncio.gather(*tasks) results = asyncio.run(batch_complete(["Hello", "World", "Test"])) ``` ### 3. Local Model Routing Route cheap/simple requests to local models on Clore.ai GPUs, complex ones to GPT-4: ```yaml model_list: - model_name: smart-router litellm_params: model: openai/gpt-4o api_key: "os.environ/OPENAI_API_KEY" ``` A typical setup: run Mistral 7B or Llama 3 8B locally on a Clore.ai RTX 3090 ($0.10–0.15/hr), handle 80% of traffic there, escalate complex tasks to GPT-4o. Cost savings of 3–5× vs cloud-only are common. ### 4. Set Timeouts and Retries ```yaml litellm_settings: request_timeout: 30 num_retries: 3 retry_after: 5 ``` *** ## Clore.ai GPU Recommendations LiteLLM itself needs no GPU — it's a proxy. The GPU choice only matters when you co-deploy local inference alongside it. | Local Model | GPU | Why | | ----------------------------------------- | ------------------ | --------------------------------------------------------------- | | Mistral 7B / Llama 3 8B (bf16) | **RTX 3090** 24 GB | Fits comfortably, \~200 tok/s throughput | | Mixtral 8×7B or Llama 3 70B (AWQ) | **RTX 4090** 24 GB | Faster memory bandwidth than 3090; fits 70B AWQ 4-bit | | Llama 3 70B (bf16) or multi-model serving | **A100 80 GB** | Run multiple 7–13B models simultaneously; HBM2e for low latency | **Recommended stack for a solo developer:** RTX 3090 + Mistral 7B + LiteLLM gateway. Total cost on Clore.ai: \~$0.12/hr. Handles \~50 req/min easily, with GPT-4o fallback for complex tasks. **Team / production stack:** A100 80GB, run Llama 3 70B + LiteLLM + PostgreSQL. Serves 20+ concurrent users, full cost tracking, zero cloud LLM spend for most requests. *** ## Troubleshooting ### Problem: "model not found" Ensure the model name in your request matches exactly what's in `config.yaml`: ```bash curl http://localhost:4000/v1/models -H "Authorization: Bearer sk-my-secret-master-key" ``` ### Problem: "authentication failed" Check your `LITELLM_MASTER_KEY` environment variable and use it as the Bearer token. ### Problem: Config changes not reflected Restart the container after config changes: ```bash docker restart litellm ``` ### Problem: High latency on first request LiteLLM loads model configs on startup. The first few requests may be slower as connections are established. ### Problem: Database connection errors ```bash # Check PostgreSQL is running docker logs postgres # Verify connection string format DATABASE_URL="postgresql://user:password@host:5432/dbname" ``` ### Problem: 429 rate limit errors from providers Configure fallbacks: ```yaml litellm_settings: num_retries: 5 fallbacks: - gpt-4o: [claude-3-5-sonnet] ``` *** ## Clore.ai GPU Recommendations LiteLLM is an API gateway/proxy — it doesn't do inference itself. GPU selection depends on whether you're routing to cloud APIs or local models. | Setup | GPU | Clore.ai Price | Use Case | | -------------------- | --------------- | -------------- | -------------------------------------------------- | | Cloud API proxy only | CPU-only | \~$0.02/hr | Route to OpenAI, Anthropic, Gemini — no GPU needed | | Local vLLM backend | RTX 3090 (24GB) | \~$0.12/hr | Self-hosted 7B–13B models with LiteLLM as frontend | | Local vLLM backend | RTX 4090 (24GB) | \~$0.70/hr | Higher throughput 7B–34B local models | | Local vLLM backend | A100 40GB | \~$1.20/hr | 70B models, production local serving | {% hint style="info" %} **Most common setup:** Run LiteLLM as a unified proxy in front of your Clore.ai-hosted vLLM/Ollama instances. This gives you provider fallbacks, rate limiting, cost tracking, and OpenAI-compatible routing — while keeping all inference local and cheap. **Example cost:** Run LiteLLM proxy on a CPU-only instance (~~$0.02/hr) and point it at a vLLM server on RTX 3090 (~~$0.12/hr). Total cost \~$0.14/hr for a production-ready, self-hosted LLM API with fallbacks, logging, and rate limiting. {% endhint %} *** ## Links * [GitHub](https://github.com/BerriAI/litellm) * [Documentation](https://docs.litellm.ai) * [Docker Hub / GHCR](https://github.com/BerriAI/litellm/pkgs/container/litellm) * [Supported Providers](https://docs.litellm.ai/docs/providers) * [CLORE.AI Marketplace](https://clore.ai/marketplace) --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://docs.clore.ai/guides/language-models/litellm.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.