# LLM Serving: Ollama vs vLLM vs TGI

Choose the right LLM serving solution for your needs on CLORE.AI.

{% hint style="success" %}
All options available on [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
**2025 Update:** SGLang has emerged as a top-tier framework, often **outperforming vLLM** in throughput and TTFT benchmarks. Both vLLM v0.7 and SGLang v0.4 are recommended for production workloads.
{% endhint %}

## Quick Decision Guide

| Use Case                           | Best Choice            | Why                               |
| ---------------------------------- | ---------------------- | --------------------------------- |
| Quick testing & chat               | **Ollama**             | Easiest setup, fastest startup    |
| Production API (max throughput)    | **SGLang** or **vLLM** | Highest throughput in 2025        |
| Reasoning models (DeepSeek-R1)     | **SGLang**             | Best support for reasoning chains |
| HuggingFace integration            | **TGI**                | Native HF support                 |
| Local development                  | **Ollama**             | Works everywhere                  |
| High concurrency                   | **SGLang** or **vLLM** | Continuous batching               |
| Multi-modal (TTS, STT, Embeddings) | **LocalAI**            | All-in-one solution               |
| Streaming apps                     | **vLLM** or **SGLang** | Both excellent                    |

## Startup Time Comparison

| Solution | Typical Startup | Notes                     |
| -------- | --------------- | ------------------------- |
| Ollama   | 30-60 seconds   | Fastest, lightweight      |
| SGLang   | 3-8 minutes     | Downloads model from HF   |
| vLLM     | 5-15 minutes    | Downloads model from HF   |
| TGI      | 3-10 minutes    | Downloads model from HF   |
| LocalAI  | 5-10 minutes    | Pre-loads multiple models |

{% hint style="info" %}
HTTP 502 errors during startup are normal - the service is still initializing.
{% endhint %}

***

## Overview Comparison

| Feature               | Ollama          | vLLM        | SGLang                 | TGI             | LocalAI         |
| --------------------- | --------------- | ----------- | ---------------------- | --------------- | --------------- |
| **Ease of Setup**     | ⭐⭐⭐⭐⭐           | ⭐⭐⭐         | ⭐⭐⭐                    | ⭐⭐⭐             | ⭐⭐⭐⭐            |
| **Performance**       | ⭐⭐⭐             | ⭐⭐⭐⭐⭐       | ⭐⭐⭐⭐⭐                  | ⭐⭐⭐⭐            | ⭐⭐⭐             |
| **Model Support**     | ⭐⭐⭐⭐            | ⭐⭐⭐⭐⭐       | ⭐⭐⭐⭐⭐                  | ⭐⭐⭐⭐            | ⭐⭐⭐⭐            |
| **API Compatibility** | Custom + OpenAI | OpenAI      | OpenAI                 | Custom + OpenAI | OpenAI          |
| **Multi-GPU**         | Limited         | Excellent   | Excellent              | Good            | Limited         |
| **Memory Efficiency** | Good            | Excellent   | Excellent              | Very Good       | Good            |
| **Multi-Modal**       | Vision only     | Vision only | Vision only            | No              | TTS, STT, Embed |
| **Startup Time**      | 30 sec          | 5-15 min    | 3-8 min                | 3-10 min        | 5-10 min        |
| **Reasoning Models**  | Limited         | Good        | Excellent              | Good            | Limited         |
| **Best For**          | Development     | Production  | Production + Reasoning | HF Ecosystem    | Multi-modal     |

***

## 2025 Benchmarks: DeepSeek-R1-32B

### TTFT, TPOT & Throughput (A100 80GB, batch=32, input=512, output=512)

| Framework       | TTFT (ms) | TPOT (ms/tok) | Throughput (tok/s) | Notes                      |
| --------------- | --------- | ------------- | ------------------ | -------------------------- |
| **SGLang v0.4** | **180**   | **14**        | **2,850**          | Best overall 2025          |
| **vLLM v0.7**   | 240       | 17            | 2,400              | Excellent, close to SGLang |
| llama.cpp       | 420       | 28            | 1,100              | CPU+GPU, quantized         |
| Ollama          | 510       | 35            | 820                | Ease of use priority       |

> **TTFT** = Time to First Token (latency). **TPOT** = Time Per Output Token. Lower is better for both.

### Throughput Comparison (RTX 4090, Llama 3.1 8B, 10 concurrent users)

| Framework   | Tokens/sec | Concurrent Users | Notes                   |
| ----------- | ---------- | ---------------- | ----------------------- |
| SGLang v0.4 | 920        | 20-30            | Radix attention caching |
| vLLM v0.7   | 870        | 20-30            | PagedAttention          |
| TGI         | 550        | 10-20            |                         |
| Ollama      | 160\*      | —                | Sequential by default   |

\*Ollama serves requests sequentially by default

***

## SGLang

### Overview

SGLang (Structured Generation Language) is a high-throughput LLM serving framework developed by researchers from UC Berkeley and LMSYS. In 2025 benchmarks it frequently matches or exceeds vLLM — especially for reasoning models like DeepSeek-R1.

### Pros

* ✅ Often fastest TTFT and throughput in 2025 benchmarks
* ✅ Radix attention for efficient KV-cache reuse
* ✅ Excellent support for reasoning models (DeepSeek-R1, QwQ)
* ✅ OpenAI-compatible API
* ✅ Continuous batching and prefix caching
* ✅ Speculative decoding support
* ✅ Multi-GPU tensor parallelism

### Cons

* ❌ Newer ecosystem, fewer community resources than vLLM
* ❌ More complex setup than Ollama
* ❌ Linux-only

### Quick Start

```bash
pip install sglang[all]

# Serve a model
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

### DeepSeek-R1 with SGLang

```bash
python -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --host 0.0.0.0 \
    --port 8000 \
    --tp 2 \
    --reasoning-parser deepseek-r1
```

### API Usage

```python
from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')

response = client.chat.completions.create(
    model='meta-llama/Llama-3.1-8B-Instruct',
    messages=[
        {'role': 'user', 'content': 'Explain quantum entanglement'}
    ],
    temperature=0.7,
    max_tokens=512
)
print(response.choices[0].message.content)
```

### Multi-GPU

```bash
# 2 GPUs (tensor parallel)
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-70B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tp 2
```

### Best For

* 🎯 Maximum throughput production APIs
* 🎯 Reasoning models (DeepSeek-R1, QwQ, o1-style)
* 🎯 Low-latency (TTFT) applications
* 🎯 Prefix-heavy workloads (high KV-cache reuse)

***

## Ollama

### Overview

Ollama is the easiest way to run LLMs locally. Perfect for development, testing, and personal use.

### Pros

* ✅ One-command install and run
* ✅ Built-in model library
* ✅ Great CLI experience
* ✅ Works on Mac, Linux, Windows
* ✅ Automatic quantization
* ✅ Low resource overhead

### Cons

* ❌ Lower throughput than alternatives
* ❌ Limited multi-GPU support
* ❌ Less production-ready
* ❌ Fewer optimization options

### Quick Start

```bash
# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run any model
ollama run llama3.2
ollama run mistral
ollama run codellama

# Serve API
ollama serve
```

### API Usage

```python
import requests

# Generate
response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.2',
    'prompt': 'Explain quantum computing',
    'stream': False
})
print(response.json()['response'])

# Chat
response = requests.post('http://localhost:11434/api/chat', json={
    'model': 'llama3.2',
    'messages': [
        {'role': 'user', 'content': 'Hello!'}
    ]
})
```

### OpenAI Compatibility

```python
from openai import OpenAI

client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)
```

### Performance

| Model         | GPU       | Tokens/sec |
| ------------- | --------- | ---------- |
| Llama 3.2 3B  | RTX 3060  | 45-55      |
| Llama 3.1 8B  | RTX 3090  | 35-45      |
| Llama 3.1 70B | A100 40GB | 15-20      |

### Best For

* 🎯 Quick prototyping
* 🎯 Personal AI assistant
* 🎯 Learning and experimentation
* 🎯 Simple deployments

***

## vLLM

### Overview

vLLM is a battle-tested high-throughput LLM inference engine for production. v0.7 (2025) brings improved performance, better quantization support, and new speculative decoding options.

### Pros

* ✅ Highest throughput (continuous batching + PagedAttention)
* ✅ PagedAttention for efficient memory
* ✅ Excellent multi-GPU support
* ✅ OpenAI-compatible API
* ✅ Production-ready, large community
* ✅ Supports many quantization formats (AWQ, GPTQ, FP8)
* ✅ Speculative decoding in v0.7

### Cons

* ❌ More complex setup
* ❌ Higher memory overhead at start
* ❌ Linux-only (no native Windows/Mac)
* ❌ Requires more configuration

### Quick Start

```bash
pip install vllm

# Serve model (vLLM v0.7)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

### Docker Deploy

```bash
docker run --gpus all -p 8000:8000 \
    vllm/vllm-openai:v0.7.0 \
    --model meta-llama/Llama-3.1-8B-Instruct
```

### API Usage

```python
from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')

# Chat completion
response = client.chat.completions.create(
    model='meta-llama/Llama-3.1-8B-Instruct',
    messages=[
        {'role': 'system', 'content': 'You are helpful.'},
        {'role': 'user', 'content': 'Write a haiku about coding'}
    ],
    temperature=0.7,
    max_tokens=100
)

# Streaming
stream = client.chat.completions.create(
    model='meta-llama/Llama-3.1-8B-Instruct',
    messages=[{'role': 'user', 'content': 'Tell me a story'}],
    stream=True
)
for chunk in stream:
    print(chunk.choices[0].delta.content, end='')
```

### Multi-GPU

```bash
# 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

# 4 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4
```

### Performance

| Model         | GPU       | Tokens/sec | Concurrent Users |
| ------------- | --------- | ---------- | ---------------- |
| Llama 3.1 8B  | RTX 3090  | 80-100     | 10-20            |
| Llama 3.1 8B  | RTX 4090  | 120-150    | 20-30            |
| Llama 3.1 70B | A100 40GB | 25-35      | 5-10             |
| Llama 3.1 70B | 2x A100   | 50-70      | 15-25            |

### Best For

* 🎯 Production APIs with large community
* 🎯 High-traffic applications
* 🎯 Multi-user chat services
* 🎯 Maximum throughput needs

***

## Text Generation Inference (TGI)

### Overview

HuggingFace's production server, tightly integrated with the HF ecosystem.

### Pros

* ✅ Native HuggingFace integration
* ✅ Great for HF models
* ✅ Good multi-GPU support
* ✅ Built-in safety features
* ✅ Prometheus metrics
* ✅ Well-documented

### Cons

* ❌ Slightly lower throughput than vLLM/SGLang
* ❌ More resource intensive
* ❌ Complex configuration
* ❌ Longer startup times

### Quick Start

```bash
# Docker (recommended)
docker run --gpus all -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-8B-Instruct

# With HF token for gated models
docker run --gpus all -p 8080:80 \
    -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-8B-Instruct
```

### Performance

| Model         | GPU       | Tokens/sec | Concurrent Users |
| ------------- | --------- | ---------- | ---------------- |
| Llama 3.1 8B  | RTX 3090  | 60-80      | 8-15             |
| Llama 3.1 8B  | RTX 4090  | 90-120     | 15-25            |
| Llama 3.1 70B | A100 40GB | 20-30      | 3-8              |

### Best For

* 🎯 HuggingFace model users
* 🎯 Research environments
* 🎯 Need built-in safety features
* 🎯 Prometheus monitoring needs

***

## LocalAI

### Overview

LocalAI is an OpenAI-compatible API that supports multiple modalities: LLMs, TTS, STT, embeddings, and image generation.

### Pros

* ✅ Multi-modal support (LLM, TTS, STT, embeddings)
* ✅ Drop-in OpenAI replacement
* ✅ Pre-built models available
* ✅ Supports GGUF models
* ✅ Reranking support
* ✅ Swagger UI documentation

### Cons

* ❌ Longer startup time (5-10 minutes)
* ❌ Lower LLM throughput than vLLM/SGLang
* ❌ Image generation may have CUDA issues
* ❌ More complex for pure LLM use

### Quick Start

```bash
docker run --gpus all -p 8080:8080 localai/localai:master-aio-gpu-nvidia-cuda-12
```

### API Usage

```python
from openai import OpenAI

client = OpenAI(base_url='http://localhost:8080/v1', api_key='dummy')

# Chat
response = client.chat.completions.create(
    model='gpt-4',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

# TTS
audio = client.audio.speech.create(model='tts-1', input='Hello world', voice='alloy')

# STT
transcript = client.audio.transcriptions.create(model='whisper-1', file=open('audio.mp3', 'rb'))

# Embeddings
embeddings = client.embeddings.create(model='text-embedding-ada-002', input='Hello world')
```

### Best For

* 🎯 Need multiple modalities (TTS, STT, LLM)
* 🎯 Want OpenAI API compatibility
* 🎯 Running GGUF models
* 🎯 Document reranking workflows

***

## Performance Comparison (2025)

### Throughput (tokens/second) — Single User

| Model                     | Ollama | vLLM v0.7 | SGLang v0.4 | TGI |
| ------------------------- | ------ | --------- | ----------- | --- |
| Llama 3.1 8B (RTX 3090)   | 40     | 90        | 100         | 70  |
| Llama 3.1 8B (RTX 4090)   | 65     | 140       | 160         | 110 |
| Llama 3.1 70B (A100 40GB) | 18     | 30        | 35          | 25  |

### Throughput — Multiple Users (10 concurrent)

| Model                     | Ollama | vLLM v0.7 | SGLang v0.4 | TGI |
| ------------------------- | ------ | --------- | ----------- | --- |
| Llama 3.1 8B (RTX 4090)   | 150\*  | 800       | 920         | 500 |
| Llama 3.1 70B (A100 40GB) | 50\*   | 200       | 240         | 150 |

\*Ollama serves sequentially by default

### Memory Usage

| Model              | Ollama | vLLM v0.7 | SGLang v0.4 | TGI  |
| ------------------ | ------ | --------- | ----------- | ---- |
| Llama 3.1 8B       | 5GB    | 6GB       | 6GB         | 7GB  |
| Llama 3.1 70B (Q4) | 38GB   | 40GB      | 39GB        | 42GB |

### Time to First Token (TTFT) — DeepSeek-R1-32B

| Framework   | TTFT (A100 80GB) | TPOT (ms/tok) |
| ----------- | ---------------- | ------------- |
| SGLang v0.4 | **180ms**        | **14ms**      |
| vLLM v0.7   | 240ms            | 17ms          |
| llama.cpp   | 420ms            | 28ms          |
| Ollama      | 510ms            | 35ms          |

***

## Feature Comparison

| Feature              | Ollama  | vLLM v0.7      | SGLang v0.4    | TGI               | LocalAI    |
| -------------------- | ------- | -------------- | -------------- | ----------------- | ---------- |
| OpenAI API           | ✅       | ✅              | ✅              | ✅                 | ✅          |
| Streaming            | ✅       | ✅              | ✅              | ✅                 | ✅          |
| Batching             | Basic   | Continuous     | Continuous     | Dynamic           | Basic      |
| Multi-GPU            | Limited | Excellent      | Excellent      | Good              | Limited    |
| Quantization         | GGUF    | AWQ, GPTQ, FP8 | AWQ, GPTQ, FP8 | bitsandbytes, AWQ | GGUF       |
| LoRA                 | ✅       | ✅              | ✅              | ✅                 | ✅          |
| Speculative Decoding | ❌       | ✅              | ✅              | ✅                 | ❌          |
| Prefix Caching       | ❌       | ✅              | ✅ (Radix)      | ✅                 | ❌          |
| Reasoning Models     | Limited | Good           | Excellent      | Good              | Limited    |
| Metrics              | Basic   | Prometheus     | Prometheus     | Prometheus        | Prometheus |
| Function Calling     | ✅       | ✅              | ✅              | ✅                 | ✅          |
| Vision Models        | ✅       | ✅              | ✅              | ✅                 | Limited    |
| TTS                  | ❌       | ❌              | ❌              | ❌                 | ✅          |
| STT                  | ❌       | ❌              | ❌              | ❌                 | ✅          |
| Embeddings           | ✅       | Limited        | Limited        | Limited           | ✅          |

***

## When to Use What

### Use Ollama When:

* You want to get started in 5 minutes
* You're prototyping or learning
* You need a personal AI assistant
* You're on Mac or Windows
* Simplicity matters more than speed

### Use SGLang When:

* You need the **absolute lowest latency** (TTFT)
* You're serving **reasoning models** (DeepSeek-R1, QwQ, o1-style)
* You have workloads with heavy **prefix sharing** (RAG, system prompts)
* You need top-tier throughput in 2025 benchmarks
* You want cutting-edge optimizations (Radix attention)

### Use vLLM When:

* You need maximum throughput with a **mature, well-supported** framework
* You're serving many users at scale
* You need production reliability with a large community
* You want OpenAI drop-in replacement
* You have multi-GPU setups
* You need broad model format support (AWQ, GPTQ, FP8)

### Use TGI When:

* You're in the HuggingFace ecosystem
* You need built-in safety features
* You want detailed Prometheus metrics
* You need to serve HF models directly
* You're in a research environment

### Use LocalAI When:

* You need TTS and STT alongside LLM
* You want embeddings for RAG
* You need document reranking
* You want a single all-in-one solution
* You're building voice-enabled apps

***

## Migration Guide

### From Ollama to SGLang

```python
# Ollama
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
response = client.chat.completions.create(model='llama3.2', ...)

# SGLang - just change URL and model name
client = OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')
response = client.chat.completions.create(model='meta-llama/Llama-3.2-3B-Instruct', ...)
```

### From vLLM to SGLang

Both support OpenAI API - just change the endpoint URL. APIs are fully compatible.

```bash
# vLLM
python -m vllm.entrypoints.openai.api_server --model ... --port 8000

# SGLang (equivalent)
python -m sglang.launch_server --model-path ... --port 8000
```

***

## Recommendations by GPU

| GPU           | Single User | Multi User  | Reasoning Models |
| ------------- | ----------- | ----------- | ---------------- |
| RTX 3060 12GB | Ollama      | Ollama      | Ollama           |
| RTX 3090 24GB | Ollama      | vLLM        | SGLang           |
| RTX 4090 24GB | SGLang/vLLM | SGLang/vLLM | SGLang           |
| A100 40GB+    | SGLang      | SGLang      | SGLang           |

***

## Next Steps

* [Ollama Guide](https://docs.clore.ai/guides/language-models/ollama) - Easiest setup
* [vLLM Guide](https://docs.clore.ai/guides/language-models/vllm) - Highest throughput
* [LocalAI Guide](https://docs.clore.ai/guides/language-models/localai-openai-compatible) - Multi-modal support
* [DeepSeek-R1 Guide](https://docs.clore.ai/guides/language-models/deepseek-r1) - Reasoning models
* [Multi-GPU Setup](https://docs.clore.ai/guides/advanced/multi-gpu-setup) - Scale to larger models
* [API Integration](https://docs.clore.ai/guides/advanced/api-integration) - Build applications
