# vLLM

High-throughput LLM inference server for production workloads on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
**Current Version: v0.7.x** — This guide covers vLLM v0.7.3+. New features include DeepSeek-R1 support, structured outputs with automatic tool choice, multi-LoRA serving, and improved memory efficiency.
{% endhint %}

## Server Requirements

| Parameter    | Minimum      | Recommended |
| ------------ | ------------ | ----------- |
| RAM          | **16GB**     | 32GB+       |
| VRAM         | 16GB (7B)    | 24GB+       |
| Network      | 500Mbps      | 1Gbps+      |
| Startup Time | 5-15 minutes | -           |

{% hint style="danger" %}
**Important:** vLLM requires significant RAM and VRAM. Servers with less than 16GB RAM will fail to run even 7B models.
{% endhint %}

{% hint style="warning" %}
**Startup Time:** The first launch downloads the model from HuggingFace (5-15 minutes depending on model size and network speed). HTTP 502 during this time is normal.
{% endhint %}

## Why vLLM?

* **Fastest throughput** - PagedAttention for 24x higher throughput
* **Production ready** - OpenAI-compatible API out of the box
* **Continuous batching** - Efficient multi-user serving
* **Streaming** - Real-time token generation
* **Multi-GPU** - Tensor parallelism for large models
* **Multi-LoRA** - Serve multiple fine-tuned adapters simultaneously (v0.7+)
* **Structured outputs** - JSON schema enforcement and tool calling (v0.7+)

## Quick Deploy on CLORE.AI

**Docker Image:**

```
vllm/vllm-openai:v0.7.3
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
vllm serve mistralai/Mistral-7B-Instruct-v0.2 --host 0.0.0.0 --port 8000
```

### Verify It's Working

After deployment, find your `http_pub` URL in **My Orders**:

```bash
# Check health (may take 5-15 min on first run)
curl https://your-http-pub.clorecloud.net/health

# List models (only works after model loads)
curl https://your-http-pub.clorecloud.net/v1/models
```

{% hint style="warning" %}
If you get HTTP 502 for more than 15 minutes, check:

1. Server has 16GB+ RAM
2. Server has enough VRAM for the model
3. HuggingFace token is set for gated models
   {% endhint %}

## Accessing Your Service

When deployed on CLORE.AI, access vLLM via the `http_pub` URL:

```bash
# Chat completion
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

{% hint style="info" %}
All `localhost:8000` examples below work when connected via SSH. For external access, replace with your `https://your-http-pub.clorecloud.net/` URL.
{% endhint %}

## Installation

### Using Docker (Recommended)

```bash
docker run -d --gpus all \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:v0.7.3 \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0
```

### Using pip

```bash
pip install vllm==0.7.3

# Run server
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2
```

## Supported Models

| Model                         | Parameters | VRAM Required     | RAM Required |
| ----------------------------- | ---------- | ----------------- | ------------ |
| Mistral 7B                    | 7B         | 14GB              | 16GB+        |
| Llama 3.1 8B                  | 8B         | 16GB              | 16GB+        |
| Llama 3.1 70B                 | 70B        | 140GB (or 2x80GB) | 64GB+        |
| Mixtral 8x7B                  | 47B        | 90GB              | 32GB+        |
| Qwen2.5 7B                    | 7B         | 14GB              | 16GB+        |
| Qwen2.5 72B                   | 72B        | 145GB             | 64GB+        |
| DeepSeek-V3                   | 236B MoE   | Multi-GPU         | 128GB+       |
| DeepSeek-R1-Distill-Qwen-7B   | 7B         | 14GB              | 16GB+        |
| DeepSeek-R1-Distill-Qwen-32B  | 32B        | 64GB              | 32GB+        |
| DeepSeek-R1-Distill-Llama-70B | 70B        | 140GB             | 64GB+        |
| Phi-4                         | 14B        | 28GB              | 32GB+        |
| Gemma 2 9B                    | 9B         | 18GB              | 16GB+        |
| CodeLlama 34B                 | 34B        | 68GB              | 32GB+        |

## Server Options

### Basic Server

```bash
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --port 8000
```

### Production Server

```bash
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 256 \
    --enable-prefix-caching
```

### With Quantization (Lower VRAM)

```bash
# AWQ quantized model (uses less VRAM)
vllm serve TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
    --host 0.0.0.0 \
    --quantization awq
```

### Structured Outputs and Tool Calling (v0.7+)

Enable automatic tool choice and structured JSON outputs:

```bash
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral
```

Use in Python:

```python
from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "What is the weather in Paris?"}],
    tools=tools,
    tool_choice="auto"
)

# Parse tool call
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
print(f"Tool: {tool_call.function.name}, Args: {args}")
```

Structured JSON output via response format:

```python
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Extract: John Smith, 30 years old, software engineer"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "occupation": {"type": "string"}
                },
                "required": ["name", "age", "occupation"]
            }
        }
    }
)
print(response.choices[0].message.content)
```

### Multi-LoRA Serving (v0.7+)

Serve a base model with multiple LoRA adapters simultaneously:

```bash
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --enable-lora \
    --lora-modules \
        sql-adapter=path/to/sql-lora \
        code-adapter=path/to/code-lora \
        chat-adapter=path/to/chat-lora \
    --max-lora-rank 64
```

Query a specific LoRA adapter by model name:

```python
# Use the SQL adapter
response = client.chat.completions.create(
    model="sql-adapter",
    messages=[{"role": "user", "content": "Write a SQL query to find top 10 customers"}]
)

# Use the code adapter
response = client.chat.completions.create(
    model="code-adapter",
    messages=[{"role": "user", "content": "Write a Python function to sort a list"}]
)
```

## DeepSeek-R1 Support (v0.7+)

vLLM v0.7+ has native support for DeepSeek-R1 distill models. These reasoning models produce `<think>` tags showing their reasoning process.

### DeepSeek-R1-Distill-Qwen-7B (Single GPU)

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384
```

### DeepSeek-R1-Distill-Qwen-32B (Dual GPU)

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90
```

### DeepSeek-R1-Distill-Llama-70B (Quad GPU)

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 32768
```

### Querying DeepSeek-R1

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    messages=[
        {
            "role": "user",
            "content": "Solve: If a train travels 120km in 1.5 hours, what is its speed in m/s?"
        }
    ],
    max_tokens=2048,
    temperature=0.6
)

content = response.choices[0].message.content
# Response includes <think>...</think> reasoning block followed by the answer
print(content)
```

Parsing think tags:

```python
import re

def parse_deepseek_r1_response(content: str) -> dict:
    """Extract thinking and answer from DeepSeek-R1 response."""
    think_match = re.search(r'<think>(.*?)</think>', content, re.DOTALL)
    thinking = think_match.group(1).strip() if think_match else ""
    answer = re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL).strip()
    return {"thinking": thinking, "answer": answer}

result = parse_deepseek_r1_response(content)
print("Thinking:", result["thinking"][:200], "...")
print("Answer:", result["answer"])
```

## API Usage

### Chat Completions (OpenAI Compatible)

```python
from openai import OpenAI

# For external access, use your http_pub URL:
client = OpenAI(
    base_url="https://your-http-pub.clorecloud.net/v1",
    api_key="not-needed"
)

# Or via SSH tunnel:
# client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

### cURL

```bash
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'
```

### Text Completions

```bash
curl https://your-http-pub.clorecloud.net/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "The capital of France is",
    "max_tokens": 50
  }'
```

## Complete API Reference

vLLM provides OpenAI-compatible endpoints plus additional utility endpoints.

### Standard Endpoints

| Endpoint               | Method | Description                     |
| ---------------------- | ------ | ------------------------------- |
| `/v1/models`           | GET    | List available models           |
| `/v1/chat/completions` | POST   | Chat completion                 |
| `/v1/completions`      | POST   | Text completion                 |
| `/health`              | GET    | Health check (may return empty) |

### Additional Endpoints

| Endpoint      | Method | Description              |
| ------------- | ------ | ------------------------ |
| `/tokenize`   | POST   | Tokenize text            |
| `/detokenize` | POST   | Convert tokens to text   |
| `/version`    | GET    | Get vLLM version         |
| `/docs`       | GET    | Swagger UI documentation |
| `/metrics`    | GET    | Prometheus metrics       |

#### Tokenize Text

Useful for counting tokens before sending requests:

```bash
curl https://your-http-pub.clorecloud.net/tokenize \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "Hello world"
  }'
```

Response:

```json
{"count": 2, "max_model_len": 32768, "tokens": [9707, 1879]}
```

#### Detokenize

Convert token IDs back to text:

```bash
curl https://your-http-pub.clorecloud.net/detokenize \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "tokens": [9707, 1879]
  }'
```

Response:

```json
{"prompt": "Hello world"}
```

#### Get Version

```bash
curl https://your-http-pub.clorecloud.net/version
```

Response:

```json
{"version": "0.7.3"}
```

#### Swagger Documentation

Open in browser for interactive API documentation:

```
https://your-http-pub.clorecloud.net/docs
```

#### Prometheus Metrics

For monitoring:

```bash
curl https://your-http-pub.clorecloud.net/metrics
```

{% hint style="info" %}
**Reasoning Models:** DeepSeek-R1 and similar models include `<think>` tags in responses showing the model's reasoning process before the final answer.
{% endhint %}

## Benchmarks

### Throughput (tokens/sec per user)

| Model              | RTX 3090 | RTX 4090 | A100 40GB | A100 80GB |
| ------------------ | -------- | -------- | --------- | --------- |
| Mistral 7B         | 100      | 170      | 210       | 230       |
| Llama 3.1 8B       | 95       | 150      | 200       | 220       |
| Llama 3.1 8B (AWQ) | 130      | 190      | 260       | 280       |
| Mixtral 8x7B       | -        | 45       | 70        | 85        |
| Llama 3.1 70B      | -        | -        | 25 (2x)   | 45 (2x)   |
| DeepSeek-R1 7B     | 90       | 145      | 190       | 210       |
| DeepSeek-R1 32B    | -        | -        | 40        | 70 (2x)   |

*Benchmarks updated January 2026.*

### Context Length vs VRAM

| Model    | 4K ctx | 8K ctx | 16K ctx | 32K ctx |
| -------- | ------ | ------ | ------- | ------- |
| 8B FP16  | 18GB   | 22GB   | 30GB    | 46GB    |
| 8B AWQ   | 8GB    | 10GB   | 14GB    | 22GB    |
| 70B FP16 | 145GB  | 160GB  | 190GB   | 250GB   |
| 70B AWQ  | 42GB   | 50GB   | 66GB    | 98GB    |

## Hugging Face Authentication

For gated models (Llama, etc.):

```bash
# Set token in command
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --env HUGGING_FACE_HUB_TOKEN=hf_xxxxx
```

Or set as environment variable:

```bash
export HUGGING_FACE_HUB_TOKEN=hf_xxxxx
```

## GPU Requirements

| Model | Min VRAM | Min RAM  | Recommended         |
| ----- | -------- | -------- | ------------------- |
| 7-8B  | 16GB     | **16GB** | 24GB VRAM, 32GB RAM |
| 13B   | 26GB     | 32GB     | 40GB VRAM           |
| 34B   | 70GB     | 32GB     | 80GB VRAM           |
| 70B   | 140GB    | 64GB     | 2x80GB              |

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU      | VRAM | Price/day  | Best For      |
| -------- | ---- | ---------- | ------------- |
| RTX 3090 | 24GB | $0.30–1.00 | 7-8B models   |
| RTX 4090 | 24GB | $0.50–2.00 | 7-13B, fast   |
| A100     | 40GB | $1.50–3.00 | 13-34B models |
| A100     | 80GB | $2.00–4.00 | 34-70B models |

*Prices in USD/day. Rates vary by provider — check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

## Troubleshooting

### HTTP 502 for a long time

1. **Check RAM:** Server must have 16GB+ RAM
2. **Check VRAM:** Must fit the model
3. **Model downloading:** First run downloads from HuggingFace (5-15 min)
4. **HF Token:** Gated models require authentication

### Out of Memory

```bash
# Reduce memory usage
--gpu-memory-utilization 0.8
--max-model-len 4096
--max-num-seqs 64

# Or use quantization
--quantization awq
```

### Model Download Fails

```bash
# Check HF token
echo $HUGGING_FACE_HUB_TOKEN

# Pre-download model
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2
```

## vLLM vs Others

| Feature      | vLLM        | llama.cpp | Ollama  |
| ------------ | ----------- | --------- | ------- |
| Throughput   | Best        | Good      | Good    |
| VRAM Usage   | High        | Low       | Medium  |
| Ease of Use  | Medium      | Medium    | Easy    |
| Startup Time | 5-15 min    | 1-2 min   | 30 sec  |
| Multi-GPU    | Native      | Limited   | Limited |
| Tool Calling | Yes (v0.7+) | Limited   | Limited |
| Multi-LoRA   | Yes (v0.7+) | No        | No      |

**Use vLLM when:**

* High throughput is priority
* Serving multiple users
* Have enough VRAM and RAM
* Production deployment
* Need tool calling / structured outputs

**Use Ollama when:**

* Quick setup needed
* Single user
* Less resources available

## Next Steps

* [Ollama](https://docs.clore.ai/guides/language-models/ollama) - Simpler alternative with faster startup
* [DeepSeek-R1](https://docs.clore.ai/guides/language-models/deepseek-r1) - Reasoning model guide
* [DeepSeek-V3](https://docs.clore.ai/guides/language-models/deepseek-v3) - Best general model
* [Qwen2.5](https://docs.clore.ai/guides/language-models/qwen25) - Multilingual models
* [Llama.cpp](https://docs.clore.ai/guides/language-models/llamacpp-server) - Lower VRAM option
