# vLLM

High-throughput LLM inference server for production workloads on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
**Current Version: v0.7.x** — This guide covers vLLM v0.7.3+. New features include DeepSeek-R1 support, structured outputs with automatic tool choice, multi-LoRA serving, and improved memory efficiency.
{% endhint %}

## Server Requirements

| Parameter    | Minimum      | Recommended |
| ------------ | ------------ | ----------- |
| RAM          | **16GB**     | 32GB+       |
| VRAM         | 16GB (7B)    | 24GB+       |
| Network      | 500Mbps      | 1Gbps+      |
| Startup Time | 5-15 minutes | -           |

{% hint style="danger" %}
**Important:** vLLM requires significant RAM and VRAM. Servers with less than 16GB RAM will fail to run even 7B models.
{% endhint %}

{% hint style="warning" %}
**Startup Time:** The first launch downloads the model from HuggingFace (5-15 minutes depending on model size and network speed). HTTP 502 during this time is normal.
{% endhint %}

## Why vLLM?

* **Fastest throughput** - PagedAttention for 24x higher throughput
* **Production ready** - OpenAI-compatible API out of the box
* **Continuous batching** - Efficient multi-user serving
* **Streaming** - Real-time token generation
* **Multi-GPU** - Tensor parallelism for large models
* **Multi-LoRA** - Serve multiple fine-tuned adapters simultaneously (v0.7+)
* **Structured outputs** - JSON schema enforcement and tool calling (v0.7+)

## Quick Deploy on CLORE.AI

**Docker Image:**

```
vllm/vllm-openai:v0.7.3
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
vllm serve mistralai/Mistral-7B-Instruct-v0.2 --host 0.0.0.0 --port 8000
```

### Verify It's Working

After deployment, find your `http_pub` URL in **My Orders**:

```bash
# Check health (may take 5-15 min on first run)
curl https://your-http-pub.clorecloud.net/health

# List models (only works after model loads)
curl https://your-http-pub.clorecloud.net/v1/models
```

{% hint style="warning" %}
If you get HTTP 502 for more than 15 minutes, check:

1. Server has 16GB+ RAM
2. Server has enough VRAM for the model
3. HuggingFace token is set for gated models
   {% endhint %}

## Accessing Your Service

When deployed on CLORE.AI, access vLLM via the `http_pub` URL:

```bash
# Chat completion
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

{% hint style="info" %}
All `localhost:8000` examples below work when connected via SSH. For external access, replace with your `https://your-http-pub.clorecloud.net/` URL.
{% endhint %}

## Installation

### Using Docker (Recommended)

```bash
docker run -d --gpus all \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:v0.7.3 \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0
```

### Using pip

```bash
pip install vllm==0.7.3

# Run server
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2
```

## Supported Models

| Model                         | Parameters | VRAM Required     | RAM Required |
| ----------------------------- | ---------- | ----------------- | ------------ |
| Mistral 7B                    | 7B         | 14GB              | 16GB+        |
| Llama 3.1 8B                  | 8B         | 16GB              | 16GB+        |
| Llama 3.1 70B                 | 70B        | 140GB (or 2x80GB) | 64GB+        |
| Mixtral 8x7B                  | 47B        | 90GB              | 32GB+        |
| Qwen2.5 7B                    | 7B         | 14GB              | 16GB+        |
| Qwen2.5 72B                   | 72B        | 145GB             | 64GB+        |
| DeepSeek-V3                   | 236B MoE   | Multi-GPU         | 128GB+       |
| DeepSeek-R1-Distill-Qwen-7B   | 7B         | 14GB              | 16GB+        |
| DeepSeek-R1-Distill-Qwen-32B  | 32B        | 64GB              | 32GB+        |
| DeepSeek-R1-Distill-Llama-70B | 70B        | 140GB             | 64GB+        |
| Phi-4                         | 14B        | 28GB              | 32GB+        |
| Gemma 2 9B                    | 9B         | 18GB              | 16GB+        |
| CodeLlama 34B                 | 34B        | 68GB              | 32GB+        |

## Server Options

### Basic Server

```bash
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --port 8000
```

### Production Server

```bash
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 256 \
    --enable-prefix-caching
```

### With Quantization (Lower VRAM)

```bash
# AWQ quantized model (uses less VRAM)
vllm serve TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
    --host 0.0.0.0 \
    --quantization awq
```

### Structured Outputs and Tool Calling (v0.7+)

Enable automatic tool choice and structured JSON outputs:

```bash
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral
```

Use in Python:

```python
from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "What is the weather in Paris?"}],
    tools=tools,
    tool_choice="auto"
)

# Parse tool call
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
print(f"Tool: {tool_call.function.name}, Args: {args}")
```

Structured JSON output via response format:

```python
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Extract: John Smith, 30 years old, software engineer"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "occupation": {"type": "string"}
                },
                "required": ["name", "age", "occupation"]
            }
        }
    }
)
print(response.choices[0].message.content)
```

### Multi-LoRA Serving (v0.7+)

Serve a base model with multiple LoRA adapters simultaneously:

```bash
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --enable-lora \
    --lora-modules \
        sql-adapter=path/to/sql-lora \
        code-adapter=path/to/code-lora \
        chat-adapter=path/to/chat-lora \
    --max-lora-rank 64
```

Query a specific LoRA adapter by model name:

```python
# Use the SQL adapter
response = client.chat.completions.create(
    model="sql-adapter",
    messages=[{"role": "user", "content": "Write a SQL query to find top 10 customers"}]
)

# Use the code adapter
response = client.chat.completions.create(
    model="code-adapter",
    messages=[{"role": "user", "content": "Write a Python function to sort a list"}]
)
```

## DeepSeek-R1 Support (v0.7+)

vLLM v0.7+ has native support for DeepSeek-R1 distill models. These reasoning models produce `<think>` tags showing their reasoning process.

### DeepSeek-R1-Distill-Qwen-7B (Single GPU)

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384
```

### DeepSeek-R1-Distill-Qwen-32B (Dual GPU)

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90
```

### DeepSeek-R1-Distill-Llama-70B (Quad GPU)

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 32768
```

### Querying DeepSeek-R1

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    messages=[
        {
            "role": "user",
            "content": "Solve: If a train travels 120km in 1.5 hours, what is its speed in m/s?"
        }
    ],
    max_tokens=2048,
    temperature=0.6
)

content = response.choices[0].message.content
# Response includes <think>...</think> reasoning block followed by the answer
print(content)
```

Parsing think tags:

```python
import re

def parse_deepseek_r1_response(content: str) -> dict:
    """Extract thinking and answer from DeepSeek-R1 response."""
    think_match = re.search(r'<think>(.*?)</think>', content, re.DOTALL)
    thinking = think_match.group(1).strip() if think_match else ""
    answer = re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL).strip()
    return {"thinking": thinking, "answer": answer}

result = parse_deepseek_r1_response(content)
print("Thinking:", result["thinking"][:200], "...")
print("Answer:", result["answer"])
```

## API Usage

### Chat Completions (OpenAI Compatible)

```python
from openai import OpenAI

# For external access, use your http_pub URL:
client = OpenAI(
    base_url="https://your-http-pub.clorecloud.net/v1",
    api_key="not-needed"
)

# Or via SSH tunnel:
# client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

### cURL

```bash
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'
```

### Text Completions

```bash
curl https://your-http-pub.clorecloud.net/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "The capital of France is",
    "max_tokens": 50
  }'
```

## Complete API Reference

vLLM provides OpenAI-compatible endpoints plus additional utility endpoints.

### Standard Endpoints

| Endpoint               | Method | Description                     |
| ---------------------- | ------ | ------------------------------- |
| `/v1/models`           | GET    | List available models           |
| `/v1/chat/completions` | POST   | Chat completion                 |
| `/v1/completions`      | POST   | Text completion                 |
| `/health`              | GET    | Health check (may return empty) |

### Additional Endpoints

| Endpoint      | Method | Description              |
| ------------- | ------ | ------------------------ |
| `/tokenize`   | POST   | Tokenize text            |
| `/detokenize` | POST   | Convert tokens to text   |
| `/version`    | GET    | Get vLLM version         |
| `/docs`       | GET    | Swagger UI documentation |
| `/metrics`    | GET    | Prometheus metrics       |

#### Tokenize Text

Useful for counting tokens before sending requests:

```bash
curl https://your-http-pub.clorecloud.net/tokenize \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "Hello world"
  }'
```

Response:

```json
{"count": 2, "max_model_len": 32768, "tokens": [9707, 1879]}
```

#### Detokenize

Convert token IDs back to text:

```bash
curl https://your-http-pub.clorecloud.net/detokenize \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "tokens": [9707, 1879]
  }'
```

Response:

```json
{"prompt": "Hello world"}
```

#### Get Version

```bash
curl https://your-http-pub.clorecloud.net/version
```

Response:

```json
{"version": "0.7.3"}
```

#### Swagger Documentation

Open in browser for interactive API documentation:

```
https://your-http-pub.clorecloud.net/docs
```

#### Prometheus Metrics

For monitoring:

```bash
curl https://your-http-pub.clorecloud.net/metrics
```

{% hint style="info" %}
**Reasoning Models:** DeepSeek-R1 and similar models include `<think>` tags in responses showing the model's reasoning process before the final answer.
{% endhint %}

## Benchmarks

### Throughput (tokens/sec per user)

| Model              | RTX 3090 | RTX 4090 | A100 40GB | A100 80GB |
| ------------------ | -------- | -------- | --------- | --------- |
| Mistral 7B         | 100      | 170      | 210       | 230       |
| Llama 3.1 8B       | 95       | 150      | 200       | 220       |
| Llama 3.1 8B (AWQ) | 130      | 190      | 260       | 280       |
| Mixtral 8x7B       | -        | 45       | 70        | 85        |
| Llama 3.1 70B      | -        | -        | 25 (2x)   | 45 (2x)   |
| DeepSeek-R1 7B     | 90       | 145      | 190       | 210       |
| DeepSeek-R1 32B    | -        | -        | 40        | 70 (2x)   |

*Benchmarks updated January 2026.*

### Context Length vs VRAM

| Model    | 4K ctx | 8K ctx | 16K ctx | 32K ctx |
| -------- | ------ | ------ | ------- | ------- |
| 8B FP16  | 18GB   | 22GB   | 30GB    | 46GB    |
| 8B AWQ   | 8GB    | 10GB   | 14GB    | 22GB    |
| 70B FP16 | 145GB  | 160GB  | 190GB   | 250GB   |
| 70B AWQ  | 42GB   | 50GB   | 66GB    | 98GB    |

## Hugging Face Authentication

For gated models (Llama, etc.):

```bash
# Set token in command
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --env HUGGING_FACE_HUB_TOKEN=hf_xxxxx
```

Or set as environment variable:

```bash
export HUGGING_FACE_HUB_TOKEN=hf_xxxxx
```

## GPU Requirements

| Model | Min VRAM | Min RAM  | Recommended         |
| ----- | -------- | -------- | ------------------- |
| 7-8B  | 16GB     | **16GB** | 24GB VRAM, 32GB RAM |
| 13B   | 26GB     | 32GB     | 40GB VRAM           |
| 34B   | 70GB     | 32GB     | 80GB VRAM           |
| 70B   | 140GB    | 64GB     | 2x80GB              |

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU      | VRAM | Price/day  | Best For      |
| -------- | ---- | ---------- | ------------- |
| RTX 3090 | 24GB | $0.30–1.00 | 7-8B models   |
| RTX 4090 | 24GB | $0.50–2.00 | 7-13B, fast   |
| A100     | 40GB | $1.50–3.00 | 13-34B models |
| A100     | 80GB | $2.00–4.00 | 34-70B models |

*Prices in USD/day. Rates vary by provider — check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

## Troubleshooting

### HTTP 502 for a long time

1. **Check RAM:** Server must have 16GB+ RAM
2. **Check VRAM:** Must fit the model
3. **Model downloading:** First run downloads from HuggingFace (5-15 min)
4. **HF Token:** Gated models require authentication

### Out of Memory

```bash
# Reduce memory usage
--gpu-memory-utilization 0.8
--max-model-len 4096
--max-num-seqs 64

# Or use quantization
--quantization awq
```

### Model Download Fails

```bash
# Check HF token
echo $HUGGING_FACE_HUB_TOKEN

# Pre-download model
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2
```

## vLLM vs Others

| Feature      | vLLM        | llama.cpp | Ollama  |
| ------------ | ----------- | --------- | ------- |
| Throughput   | Best        | Good      | Good    |
| VRAM Usage   | High        | Low       | Medium  |
| Ease of Use  | Medium      | Medium    | Easy    |
| Startup Time | 5-15 min    | 1-2 min   | 30 sec  |
| Multi-GPU    | Native      | Limited   | Limited |
| Tool Calling | Yes (v0.7+) | Limited   | Limited |
| Multi-LoRA   | Yes (v0.7+) | No        | No      |

**Use vLLM when:**

* High throughput is priority
* Serving multiple users
* Have enough VRAM and RAM
* Production deployment
* Need tool calling / structured outputs

**Use Ollama when:**

* Quick setup needed
* Single user
* Less resources available

## Next Steps

* [Ollama](https://docs.clore.ai/guides/language-models/ollama) - Simpler alternative with faster startup
* [DeepSeek-R1](https://docs.clore.ai/guides/language-models/deepseek-r1) - Reasoning model guide
* [DeepSeek-V3](https://docs.clore.ai/guides/language-models/deepseek-v3) - Best general model
* [Qwen2.5](https://docs.clore.ai/guides/language-models/qwen25) - Multilingual models
* [Llama.cpp](https://docs.clore.ai/guides/language-models/llamacpp-server) - Lower VRAM option


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/vllm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
