# LLM Serving: Ollama vs vLLM vs TGI

Choose the right LLM serving solution for your needs on CLORE.AI.

{% hint style="success" %}
All options available on [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
**2025 Update:** SGLang has emerged as a top-tier framework, often **outperforming vLLM** in throughput and TTFT benchmarks. Both vLLM v0.7 and SGLang v0.4 are recommended for production workloads.
{% endhint %}

## Quick Decision Guide

| Use Case                           | Best Choice            | Why                               |
| ---------------------------------- | ---------------------- | --------------------------------- |
| Quick testing & chat               | **Ollama**             | Easiest setup, fastest startup    |
| Production API (max throughput)    | **SGLang** or **vLLM** | Highest throughput in 2025        |
| Reasoning models (DeepSeek-R1)     | **SGLang**             | Best support for reasoning chains |
| HuggingFace integration            | **TGI**                | Native HF support                 |
| Local development                  | **Ollama**             | Works everywhere                  |
| High concurrency                   | **SGLang** or **vLLM** | Continuous batching               |
| Multi-modal (TTS, STT, Embeddings) | **LocalAI**            | All-in-one solution               |
| Streaming apps                     | **vLLM** or **SGLang** | Both excellent                    |

## Startup Time Comparison

| Solution | Typical Startup | Notes                     |
| -------- | --------------- | ------------------------- |
| Ollama   | 30-60 seconds   | Fastest, lightweight      |
| SGLang   | 3-8 minutes     | Downloads model from HF   |
| vLLM     | 5-15 minutes    | Downloads model from HF   |
| TGI      | 3-10 minutes    | Downloads model from HF   |
| LocalAI  | 5-10 minutes    | Pre-loads multiple models |

{% hint style="info" %}
HTTP 502 errors during startup are normal - the service is still initializing.
{% endhint %}

***

## Overview Comparison

| Feature               | Ollama          | vLLM        | SGLang                 | TGI             | LocalAI         |
| --------------------- | --------------- | ----------- | ---------------------- | --------------- | --------------- |
| **Ease of Setup**     | ⭐⭐⭐⭐⭐           | ⭐⭐⭐         | ⭐⭐⭐                    | ⭐⭐⭐             | ⭐⭐⭐⭐            |
| **Performance**       | ⭐⭐⭐             | ⭐⭐⭐⭐⭐       | ⭐⭐⭐⭐⭐                  | ⭐⭐⭐⭐            | ⭐⭐⭐             |
| **Model Support**     | ⭐⭐⭐⭐            | ⭐⭐⭐⭐⭐       | ⭐⭐⭐⭐⭐                  | ⭐⭐⭐⭐            | ⭐⭐⭐⭐            |
| **API Compatibility** | Custom + OpenAI | OpenAI      | OpenAI                 | Custom + OpenAI | OpenAI          |
| **Multi-GPU**         | Limited         | Excellent   | Excellent              | Good            | Limited         |
| **Memory Efficiency** | Good            | Excellent   | Excellent              | Very Good       | Good            |
| **Multi-Modal**       | Vision only     | Vision only | Vision only            | No              | TTS, STT, Embed |
| **Startup Time**      | 30 sec          | 5-15 min    | 3-8 min                | 3-10 min        | 5-10 min        |
| **Reasoning Models**  | Limited         | Good        | Excellent              | Good            | Limited         |
| **Best For**          | Development     | Production  | Production + Reasoning | HF Ecosystem    | Multi-modal     |

***

## 2025 Benchmarks: DeepSeek-R1-32B

### TTFT, TPOT & Throughput (A100 80GB, batch=32, input=512, output=512)

| Framework       | TTFT (ms) | TPOT (ms/tok) | Throughput (tok/s) | Notes                      |
| --------------- | --------- | ------------- | ------------------ | -------------------------- |
| **SGLang v0.4** | **180**   | **14**        | **2,850**          | Best overall 2025          |
| **vLLM v0.7**   | 240       | 17            | 2,400              | Excellent, close to SGLang |
| llama.cpp       | 420       | 28            | 1,100              | CPU+GPU, quantized         |
| Ollama          | 510       | 35            | 820                | Ease of use priority       |

> **TTFT** = Time to First Token (latency). **TPOT** = Time Per Output Token. Lower is better for both.

### Throughput Comparison (RTX 4090, Llama 3.1 8B, 10 concurrent users)

| Framework   | Tokens/sec | Concurrent Users | Notes                   |
| ----------- | ---------- | ---------------- | ----------------------- |
| SGLang v0.4 | 920        | 20-30            | Radix attention caching |
| vLLM v0.7   | 870        | 20-30            | PagedAttention          |
| TGI         | 550        | 10-20            |                         |
| Ollama      | 160\*      | —                | Sequential by default   |

\*Ollama serves requests sequentially by default

***

## SGLang

### Overview

SGLang (Structured Generation Language) is a high-throughput LLM serving framework developed by researchers from UC Berkeley and LMSYS. In 2025 benchmarks it frequently matches or exceeds vLLM — especially for reasoning models like DeepSeek-R1.

### Pros

* ✅ Often fastest TTFT and throughput in 2025 benchmarks
* ✅ Radix attention for efficient KV-cache reuse
* ✅ Excellent support for reasoning models (DeepSeek-R1, QwQ)
* ✅ OpenAI-compatible API
* ✅ Continuous batching and prefix caching
* ✅ Speculative decoding support
* ✅ Multi-GPU tensor parallelism

### Cons

* ❌ Newer ecosystem, fewer community resources than vLLM
* ❌ More complex setup than Ollama
* ❌ Linux-only

### Quick Start

```bash
pip install sglang[all]

# Serve a model
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

### DeepSeek-R1 with SGLang

```bash
python -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --host 0.0.0.0 \
    --port 8000 \
    --tp 2 \
    --reasoning-parser deepseek-r1
```

### API Usage

```python
from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')

response = client.chat.completions.create(
    model='meta-llama/Llama-3.1-8B-Instruct',
    messages=[
        {'role': 'user', 'content': 'Explain quantum entanglement'}
    ],
    temperature=0.7,
    max_tokens=512
)
print(response.choices[0].message.content)
```

### Multi-GPU

```bash
# 2 GPUs (tensor parallel)
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-70B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tp 2
```

### Best For

* 🎯 Maximum throughput production APIs
* 🎯 Reasoning models (DeepSeek-R1, QwQ, o1-style)
* 🎯 Low-latency (TTFT) applications
* 🎯 Prefix-heavy workloads (high KV-cache reuse)

***

## Ollama

### Overview

Ollama is the easiest way to run LLMs locally. Perfect for development, testing, and personal use.

### Pros

* ✅ One-command install and run
* ✅ Built-in model library
* ✅ Great CLI experience
* ✅ Works on Mac, Linux, Windows
* ✅ Automatic quantization
* ✅ Low resource overhead

### Cons

* ❌ Lower throughput than alternatives
* ❌ Limited multi-GPU support
* ❌ Less production-ready
* ❌ Fewer optimization options

### Quick Start

```bash
# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run any model
ollama run llama3.2
ollama run mistral
ollama run codellama

# Serve API
ollama serve
```

### API Usage

```python
import requests

# Generate
response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.2',
    'prompt': 'Explain quantum computing',
    'stream': False
})
print(response.json()['response'])

# Chat
response = requests.post('http://localhost:11434/api/chat', json={
    'model': 'llama3.2',
    'messages': [
        {'role': 'user', 'content': 'Hello!'}
    ]
})
```

### OpenAI Compatibility

```python
from openai import OpenAI

client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)
```

### Performance

| Model         | GPU       | Tokens/sec |
| ------------- | --------- | ---------- |
| Llama 3.2 3B  | RTX 3060  | 45-55      |
| Llama 3.1 8B  | RTX 3090  | 35-45      |
| Llama 3.1 70B | A100 40GB | 15-20      |

### Best For

* 🎯 Quick prototyping
* 🎯 Personal AI assistant
* 🎯 Learning and experimentation
* 🎯 Simple deployments

***

## vLLM

### Overview

vLLM is a battle-tested high-throughput LLM inference engine for production. v0.7 (2025) brings improved performance, better quantization support, and new speculative decoding options.

### Pros

* ✅ Highest throughput (continuous batching + PagedAttention)
* ✅ PagedAttention for efficient memory
* ✅ Excellent multi-GPU support
* ✅ OpenAI-compatible API
* ✅ Production-ready, large community
* ✅ Supports many quantization formats (AWQ, GPTQ, FP8)
* ✅ Speculative decoding in v0.7

### Cons

* ❌ More complex setup
* ❌ Higher memory overhead at start
* ❌ Linux-only (no native Windows/Mac)
* ❌ Requires more configuration

### Quick Start

```bash
pip install vllm

# Serve model (vLLM v0.7)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

### Docker Deploy

```bash
docker run --gpus all -p 8000:8000 \
    vllm/vllm-openai:v0.7.0 \
    --model meta-llama/Llama-3.1-8B-Instruct
```

### API Usage

```python
from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')

# Chat completion
response = client.chat.completions.create(
    model='meta-llama/Llama-3.1-8B-Instruct',
    messages=[
        {'role': 'system', 'content': 'You are helpful.'},
        {'role': 'user', 'content': 'Write a haiku about coding'}
    ],
    temperature=0.7,
    max_tokens=100
)

# Streaming
stream = client.chat.completions.create(
    model='meta-llama/Llama-3.1-8B-Instruct',
    messages=[{'role': 'user', 'content': 'Tell me a story'}],
    stream=True
)
for chunk in stream:
    print(chunk.choices[0].delta.content, end='')
```

### Multi-GPU

```bash
# 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

# 4 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4
```

### Performance

| Model         | GPU       | Tokens/sec | Concurrent Users |
| ------------- | --------- | ---------- | ---------------- |
| Llama 3.1 8B  | RTX 3090  | 80-100     | 10-20            |
| Llama 3.1 8B  | RTX 4090  | 120-150    | 20-30            |
| Llama 3.1 70B | A100 40GB | 25-35      | 5-10             |
| Llama 3.1 70B | 2x A100   | 50-70      | 15-25            |

### Best For

* 🎯 Production APIs with large community
* 🎯 High-traffic applications
* 🎯 Multi-user chat services
* 🎯 Maximum throughput needs

***

## Text Generation Inference (TGI)

### Overview

HuggingFace's production server, tightly integrated with the HF ecosystem.

### Pros

* ✅ Native HuggingFace integration
* ✅ Great for HF models
* ✅ Good multi-GPU support
* ✅ Built-in safety features
* ✅ Prometheus metrics
* ✅ Well-documented

### Cons

* ❌ Slightly lower throughput than vLLM/SGLang
* ❌ More resource intensive
* ❌ Complex configuration
* ❌ Longer startup times

### Quick Start

```bash
# Docker (recommended)
docker run --gpus all -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-8B-Instruct

# With HF token for gated models
docker run --gpus all -p 8080:80 \
    -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-8B-Instruct
```

### Performance

| Model         | GPU       | Tokens/sec | Concurrent Users |
| ------------- | --------- | ---------- | ---------------- |
| Llama 3.1 8B  | RTX 3090  | 60-80      | 8-15             |
| Llama 3.1 8B  | RTX 4090  | 90-120     | 15-25            |
| Llama 3.1 70B | A100 40GB | 20-30      | 3-8              |

### Best For

* 🎯 HuggingFace model users
* 🎯 Research environments
* 🎯 Need built-in safety features
* 🎯 Prometheus monitoring needs

***

## LocalAI

### Overview

LocalAI is an OpenAI-compatible API that supports multiple modalities: LLMs, TTS, STT, embeddings, and image generation.

### Pros

* ✅ Multi-modal support (LLM, TTS, STT, embeddings)
* ✅ Drop-in OpenAI replacement
* ✅ Pre-built models available
* ✅ Supports GGUF models
* ✅ Reranking support
* ✅ Swagger UI documentation

### Cons

* ❌ Longer startup time (5-10 minutes)
* ❌ Lower LLM throughput than vLLM/SGLang
* ❌ Image generation may have CUDA issues
* ❌ More complex for pure LLM use

### Quick Start

```bash
docker run --gpus all -p 8080:8080 localai/localai:master-aio-gpu-nvidia-cuda-12
```

### API Usage

```python
from openai import OpenAI

client = OpenAI(base_url='http://localhost:8080/v1', api_key='dummy')

# Chat
response = client.chat.completions.create(
    model='gpt-4',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

# TTS
audio = client.audio.speech.create(model='tts-1', input='Hello world', voice='alloy')

# STT
transcript = client.audio.transcriptions.create(model='whisper-1', file=open('audio.mp3', 'rb'))

# Embeddings
embeddings = client.embeddings.create(model='text-embedding-ada-002', input='Hello world')
```

### Best For

* 🎯 Need multiple modalities (TTS, STT, LLM)
* 🎯 Want OpenAI API compatibility
* 🎯 Running GGUF models
* 🎯 Document reranking workflows

***

## Performance Comparison (2025)

### Throughput (tokens/second) — Single User

| Model                     | Ollama | vLLM v0.7 | SGLang v0.4 | TGI |
| ------------------------- | ------ | --------- | ----------- | --- |
| Llama 3.1 8B (RTX 3090)   | 40     | 90        | 100         | 70  |
| Llama 3.1 8B (RTX 4090)   | 65     | 140       | 160         | 110 |
| Llama 3.1 70B (A100 40GB) | 18     | 30        | 35          | 25  |

### Throughput — Multiple Users (10 concurrent)

| Model                     | Ollama | vLLM v0.7 | SGLang v0.4 | TGI |
| ------------------------- | ------ | --------- | ----------- | --- |
| Llama 3.1 8B (RTX 4090)   | 150\*  | 800       | 920         | 500 |
| Llama 3.1 70B (A100 40GB) | 50\*   | 200       | 240         | 150 |

\*Ollama serves sequentially by default

### Memory Usage

| Model              | Ollama | vLLM v0.7 | SGLang v0.4 | TGI  |
| ------------------ | ------ | --------- | ----------- | ---- |
| Llama 3.1 8B       | 5GB    | 6GB       | 6GB         | 7GB  |
| Llama 3.1 70B (Q4) | 38GB   | 40GB      | 39GB        | 42GB |

### Time to First Token (TTFT) — DeepSeek-R1-32B

| Framework   | TTFT (A100 80GB) | TPOT (ms/tok) |
| ----------- | ---------------- | ------------- |
| SGLang v0.4 | **180ms**        | **14ms**      |
| vLLM v0.7   | 240ms            | 17ms          |
| llama.cpp   | 420ms            | 28ms          |
| Ollama      | 510ms            | 35ms          |

***

## Feature Comparison

| Feature              | Ollama  | vLLM v0.7      | SGLang v0.4    | TGI               | LocalAI    |
| -------------------- | ------- | -------------- | -------------- | ----------------- | ---------- |
| OpenAI API           | ✅       | ✅              | ✅              | ✅                 | ✅          |
| Streaming            | ✅       | ✅              | ✅              | ✅                 | ✅          |
| Batching             | Basic   | Continuous     | Continuous     | Dynamic           | Basic      |
| Multi-GPU            | Limited | Excellent      | Excellent      | Good              | Limited    |
| Quantization         | GGUF    | AWQ, GPTQ, FP8 | AWQ, GPTQ, FP8 | bitsandbytes, AWQ | GGUF       |
| LoRA                 | ✅       | ✅              | ✅              | ✅                 | ✅          |
| Speculative Decoding | ❌       | ✅              | ✅              | ✅                 | ❌          |
| Prefix Caching       | ❌       | ✅              | ✅ (Radix)      | ✅                 | ❌          |
| Reasoning Models     | Limited | Good           | Excellent      | Good              | Limited    |
| Metrics              | Basic   | Prometheus     | Prometheus     | Prometheus        | Prometheus |
| Function Calling     | ✅       | ✅              | ✅              | ✅                 | ✅          |
| Vision Models        | ✅       | ✅              | ✅              | ✅                 | Limited    |
| TTS                  | ❌       | ❌              | ❌              | ❌                 | ✅          |
| STT                  | ❌       | ❌              | ❌              | ❌                 | ✅          |
| Embeddings           | ✅       | Limited        | Limited        | Limited           | ✅          |

***

## When to Use What

### Use Ollama When:

* You want to get started in 5 minutes
* You're prototyping or learning
* You need a personal AI assistant
* You're on Mac or Windows
* Simplicity matters more than speed

### Use SGLang When:

* You need the **absolute lowest latency** (TTFT)
* You're serving **reasoning models** (DeepSeek-R1, QwQ, o1-style)
* You have workloads with heavy **prefix sharing** (RAG, system prompts)
* You need top-tier throughput in 2025 benchmarks
* You want cutting-edge optimizations (Radix attention)

### Use vLLM When:

* You need maximum throughput with a **mature, well-supported** framework
* You're serving many users at scale
* You need production reliability with a large community
* You want OpenAI drop-in replacement
* You have multi-GPU setups
* You need broad model format support (AWQ, GPTQ, FP8)

### Use TGI When:

* You're in the HuggingFace ecosystem
* You need built-in safety features
* You want detailed Prometheus metrics
* You need to serve HF models directly
* You're in a research environment

### Use LocalAI When:

* You need TTS and STT alongside LLM
* You want embeddings for RAG
* You need document reranking
* You want a single all-in-one solution
* You're building voice-enabled apps

***

## Migration Guide

### From Ollama to SGLang

```python
# Ollama
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
response = client.chat.completions.create(model='llama3.2', ...)

# SGLang - just change URL and model name
client = OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')
response = client.chat.completions.create(model='meta-llama/Llama-3.2-3B-Instruct', ...)
```

### From vLLM to SGLang

Both support OpenAI API - just change the endpoint URL. APIs are fully compatible.

```bash
# vLLM
python -m vllm.entrypoints.openai.api_server --model ... --port 8000

# SGLang (equivalent)
python -m sglang.launch_server --model-path ... --port 8000
```

***

## Recommendations by GPU

| GPU           | Single User | Multi User  | Reasoning Models |
| ------------- | ----------- | ----------- | ---------------- |
| RTX 3060 12GB | Ollama      | Ollama      | Ollama           |
| RTX 3090 24GB | Ollama      | vLLM        | SGLang           |
| RTX 4090 24GB | SGLang/vLLM | SGLang/vLLM | SGLang           |
| A100 40GB+    | SGLang      | SGLang      | SGLang           |

***

## Next Steps

* [Ollama Guide](/guides/language-models/ollama.md) - Easiest setup
* [vLLM Guide](/guides/language-models/vllm.md) - Highest throughput
* [LocalAI Guide](/guides/language-models/localai-openai-compatible.md) - Multi-modal support
* [DeepSeek-R1 Guide](/guides/language-models/deepseek-r1.md) - Reasoning models
* [Multi-GPU Setup](/guides/advanced/multi-gpu-setup.md) - Scale to larger models
* [API Integration](/guides/advanced/api-integration.md) - Build applications


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/comparisons/llm-serving-comparison.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
