vLLM

High-throughput LLM inference with vLLM on Clore.ai GPUs

High-throughput LLM inference server for production workloads on CLORE.AI GPUs.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Current Version: v0.7.x — This guide covers vLLM v0.7.3+. New features include DeepSeek-R1 support, structured outputs with automatic tool choice, multi-LoRA serving, and improved memory efficiency.

Server Requirements

Parameter

Minimum

Recommended

RAM

16GB

32GB+

VRAM

16GB (7B)

24GB+

Network

500Mbps

1Gbps+

Startup Time

5-15 minutes

Important: vLLM requires significant RAM and VRAM. Servers with less than 16GB RAM will fail to run even 7B models.

Startup Time: The first launch downloads the model from HuggingFace (5-15 minutes depending on model size and network speed). HTTP 502 during this time is normal.

Why vLLM?

Fastest throughput - PagedAttention for 24x higher throughput
Production ready - OpenAI-compatible API out of the box
Continuous batching - Efficient multi-user serving
Streaming - Real-time token generation
Multi-GPU - Tensor parallelism for large models
Multi-LoRA - Serve multiple fine-tuned adapters simultaneously (v0.7+)
Structured outputs - JSON schema enforcement and tool calling (v0.7+)

Quick Deploy on CLORE.AI

Docker Image:

vllm/vllm-openai:v0.7.3

Ports:

22/tcp
8000/http

Command:

vllm serve mistralai/Mistral-7B-Instruct-v0.2 --host 0.0.0.0 --port 8000

Verify It's Working

After deployment, find your http_pub URL in My Orders:

# Check health (may take 5-15 min on first run)
curl https://your-http-pub.clorecloud.net/health

# List models (only works after model loads)
curl https://your-http-pub.clorecloud.net/v1/models

If you get HTTP 502 for more than 15 minutes, check:

Server has 16GB+ RAM
Server has enough VRAM for the model
HuggingFace token is set for gated models

Accessing Your Service

When deployed on CLORE.AI, access vLLM via the http_pub URL:

# Chat completion
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

All localhost:8000 examples below work when connected via SSH. For external access, replace with your https://your-http-pub.clorecloud.net/ URL.

Installation

Using Docker (Recommended)

docker run -d --gpus all \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:v0.7.3 \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0

Using pip

pip install vllm==0.7.3

# Run server
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2

Supported Models

Model

Parameters

VRAM Required

RAM Required

Mistral 7B

14GB

16GB+

Llama 3.1 8B

16GB

16GB+

Llama 3.1 70B

70B

140GB (or 2x80GB)

64GB+

Mixtral 8x7B

47B

90GB

32GB+

Qwen2.5 7B

14GB

16GB+

Qwen2.5 72B

72B

145GB

64GB+

DeepSeek-V3

236B MoE

Multi-GPU

128GB+

DeepSeek-R1-Distill-Qwen-7B

14GB

16GB+

DeepSeek-R1-Distill-Qwen-32B

32B

64GB

32GB+

DeepSeek-R1-Distill-Llama-70B

70B

140GB

64GB+

Phi-4

14B

28GB

32GB+

Gemma 2 9B

18GB

16GB+

CodeLlama 34B

34B

68GB

32GB+

Server Options

Basic Server

vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --port 8000

Production Server

vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 256 \
    --enable-prefix-caching

With Quantization (Lower VRAM)

# AWQ quantized model (uses less VRAM)
vllm serve TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
    --host 0.0.0.0 \
    --quantization awq

Structured Outputs and Tool Calling (v0.7+)

Enable automatic tool choice and structured JSON outputs:

vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral

Use in Python:

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "What is the weather in Paris?"}],
    tools=tools,
    tool_choice="auto"
)

# Parse tool call
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
print(f"Tool: {tool_call.function.name}, Args: {args}")

Structured JSON output via response format:

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Extract: John Smith, 30 years old, software engineer"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "occupation": {"type": "string"}
                },
                "required": ["name", "age", "occupation"]
            }
        }
    }
)
print(response.choices[0].message.content)

Multi-LoRA Serving (v0.7+)

Serve a base model with multiple LoRA adapters simultaneously:

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --enable-lora \
    --lora-modules \
        sql-adapter=path/to/sql-lora \
        code-adapter=path/to/code-lora \
        chat-adapter=path/to/chat-lora \
    --max-lora-rank 64

Query a specific LoRA adapter by model name:

# Use the SQL adapter
response = client.chat.completions.create(
    model="sql-adapter",
    messages=[{"role": "user", "content": "Write a SQL query to find top 10 customers"}]
)

# Use the code adapter
response = client.chat.completions.create(
    model="code-adapter",
    messages=[{"role": "user", "content": "Write a Python function to sort a list"}]
)

DeepSeek-R1 Support (v0.7+)

vLLM v0.7+ has native support for DeepSeek-R1 distill models. These reasoning models produce <think> tags showing their reasoning process.

DeepSeek-R1-Distill-Qwen-7B (Single GPU)

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384

DeepSeek-R1-Distill-Qwen-32B (Dual GPU)

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90

DeepSeek-R1-Distill-Llama-70B (Quad GPU)

vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 32768

Querying DeepSeek-R1

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    messages=[
        {
            "role": "user",
            "content": "Solve: If a train travels 120km in 1.5 hours, what is its speed in m/s?"
        }
    ],
    max_tokens=2048,
    temperature=0.6
)

content = response.choices[0].message.content
# Response includes <think>...</think> reasoning block followed by the answer
print(content)

Parsing think tags:

import re

def parse_deepseek_r1_response(content: str) -> dict:
    """Extract thinking and answer from DeepSeek-R1 response."""
    think_match = re.search(r'<think>(.*?)</think>', content, re.DOTALL)
    thinking = think_match.group(1).strip() if think_match else ""
    answer = re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL).strip()
    return {"thinking": thinking, "answer": answer}

result = parse_deepseek_r1_response(content)
print("Thinking:", result["thinking"][:200], "...")
print("Answer:", result["answer"])

API Usage

Chat Completions (OpenAI Compatible)

from openai import OpenAI

# For external access, use your http_pub URL:
client = OpenAI(
    base_url="https://your-http-pub.clorecloud.net/v1",
    api_key="not-needed"
)

# Or via SSH tunnel:
# client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

cURL

curl https://your-http-pub.clorecloud.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Text Completions

curl https://your-http-pub.clorecloud.net/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "The capital of France is",
    "max_tokens": 50
  }'

Complete API Reference

vLLM provides OpenAI-compatible endpoints plus additional utility endpoints.

Standard Endpoints

Endpoint

Method

Description

/v1/models

GET

List available models

/v1/chat/completions

POST

Chat completion

/v1/completions

POST

Text completion

/health

GET

Health check (may return empty)

Additional Endpoints

Endpoint

Method

Description

/tokenize

POST

Tokenize text

/detokenize

POST

Convert tokens to text

/version

GET

Get vLLM version

/docs

GET

Swagger UI documentation

/metrics

GET

Prometheus metrics

Tokenize Text

Useful for counting tokens before sending requests:

curl https://your-http-pub.clorecloud.net/tokenize \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "Hello world"
  }'

Response:

{"count": 2, "max_model_len": 32768, "tokens": [9707, 1879]}

Detokenize

Convert token IDs back to text:

curl https://your-http-pub.clorecloud.net/detokenize \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "tokens": [9707, 1879]
  }'

Response:

{"prompt": "Hello world"}

Get Version

curl https://your-http-pub.clorecloud.net/version

Response:

{"version": "0.7.3"}

Swagger Documentation

Open in browser for interactive API documentation:

https://your-http-pub.clorecloud.net/docs

Prometheus Metrics

For monitoring:

curl https://your-http-pub.clorecloud.net/metrics

Reasoning Models: DeepSeek-R1 and similar models include <think> tags in responses showing the model's reasoning process before the final answer.

Benchmarks

Throughput (tokens/sec per user)

Model

RTX 3090

RTX 4090

A100 40GB

A100 80GB

Mistral 7B

100

170

210

230

Llama 3.1 8B

150

200

220

Llama 3.1 8B (AWQ)

130

190

260

280

Mixtral 8x7B

Llama 3.1 70B

25 (2x)

45 (2x)

DeepSeek-R1 7B

145

190

210

DeepSeek-R1 32B

70 (2x)

Benchmarks updated January 2026.

Context Length vs VRAM

Model

4K ctx

8K ctx

16K ctx

32K ctx

8B FP16

18GB

22GB

30GB

46GB

8B AWQ

8GB

10GB

14GB

22GB

70B FP16

145GB

160GB

190GB

250GB

70B AWQ

42GB

50GB

66GB

98GB

Hugging Face Authentication

For gated models (Llama, etc.):

# Set token in command
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --env HUGGING_FACE_HUB_TOKEN=hf_xxxxx

Or set as environment variable:

export HUGGING_FACE_HUB_TOKEN=hf_xxxxx

GPU Requirements

Model

Min VRAM

Min RAM

Recommended

7-8B

16GB

16GB

24GB VRAM, 32GB RAM

13B

26GB

32GB

40GB VRAM

34B

70GB

32GB

80GB VRAM

70B

140GB

64GB

2x80GB

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU

VRAM

Price/day

Best For

RTX 3090

24GB

$0.30–1.00

7-8B models

RTX 4090

24GB

$0.50–2.00

7-13B, fast

A100

40GB

$1.50–3.00

13-34B models

A100

80GB

$2.00–4.00

34-70B models

Prices in USD/day. Rates vary by provider — check CLORE.AI Marketplace for current rates.

Troubleshooting

HTTP 502 for a long time

Check RAM: Server must have 16GB+ RAM
Check VRAM: Must fit the model
Model downloading: First run downloads from HuggingFace (5-15 min)
HF Token: Gated models require authentication

Out of Memory

# Reduce memory usage
--gpu-memory-utilization 0.8
--max-model-len 4096
--max-num-seqs 64

# Or use quantization
--quantization awq

Model Download Fails

# Check HF token
echo $HUGGING_FACE_HUB_TOKEN

# Pre-download model
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2

vLLM vs Others

Feature

vLLM

llama.cpp

Ollama

Throughput

Best

Good

VRAM Usage

High

Low

Medium

Ease of Use

Medium

Easy

Startup Time

5-15 min

1-2 min

30 sec

Multi-GPU

Native

Limited

Tool Calling

Yes (v0.7+)

Limited

Multi-LoRA

Yes (v0.7+)

Use vLLM when:

High throughput is priority
Serving multiple users
Have enough VRAM and RAM
Production deployment
Need tool calling / structured outputs

Use Ollama when:

Quick setup needed
Single user
Less resources available

Next Steps

Ollama - Simpler alternative with faster startup
DeepSeek-R1 - Reasoning model guide
DeepSeek-V3 - Best general model
Qwen2.5 - Multilingual models
Llama.cpp - Lower VRAM option

PreviousOpen WebUI NextLlama.cpp Server

Last updated 26 days ago

Was this helpful?

hashtagServer Requirements

hashtagWhy vLLM?

hashtagQuick Deploy on CLORE.AI

hashtagVerify It's Working

hashtagAccessing Your Service

hashtagInstallation

hashtagUsing Docker (Recommended)

hashtagUsing pip

hashtagSupported Models

hashtagServer Options

hashtagBasic Server

hashtagProduction Server

hashtagWith Quantization (Lower VRAM)

hashtagStructured Outputs and Tool Calling (v0.7+)

hashtagMulti-LoRA Serving (v0.7+)

hashtagDeepSeek-R1 Support (v0.7+)

hashtagDeepSeek-R1-Distill-Qwen-7B (Single GPU)

hashtagDeepSeek-R1-Distill-Qwen-32B (Dual GPU)

hashtagDeepSeek-R1-Distill-Llama-70B (Quad GPU)

hashtagQuerying DeepSeek-R1

hashtagAPI Usage

hashtagChat Completions (OpenAI Compatible)

hashtagStreaming

hashtagcURL

hashtagText Completions

hashtagComplete API Reference

hashtagStandard Endpoints

hashtagAdditional Endpoints

hashtagTokenize Text

hashtagDetokenize

hashtagGet Version

hashtagSwagger Documentation

hashtagPrometheus Metrics

hashtagBenchmarks

hashtagThroughput (tokens/sec per user)

hashtagContext Length vs VRAM

hashtagHugging Face Authentication

hashtagGPU Requirements

hashtagCost Estimate

hashtagTroubleshooting

hashtagHTTP 502 for a long time

hashtagOut of Memory

hashtagModel Download Fails

hashtagvLLM vs Others

hashtagNext Steps

Server Requirements

Why vLLM?

Quick Deploy on CLORE.AI

Verify It's Working

Accessing Your Service

Installation

Using Docker (Recommended)

Using pip

Supported Models

Server Options

Basic Server

Production Server

With Quantization (Lower VRAM)

Structured Outputs and Tool Calling (v0.7+)

Multi-LoRA Serving (v0.7+)

DeepSeek-R1 Support (v0.7+)

DeepSeek-R1-Distill-Qwen-7B (Single GPU)

DeepSeek-R1-Distill-Qwen-32B (Dual GPU)

DeepSeek-R1-Distill-Llama-70B (Quad GPU)

Querying DeepSeek-R1

API Usage

Chat Completions (OpenAI Compatible)

Streaming

cURL

Text Completions

Complete API Reference

Standard Endpoints

Additional Endpoints

Tokenize Text

Detokenize

Get Version

Swagger Documentation

Prometheus Metrics

Benchmarks

Throughput (tokens/sec per user)

Context Length vs VRAM

Hugging Face Authentication

GPU Requirements

Cost Estimate

Troubleshooting

HTTP 502 for a long time

Out of Memory

Model Download Fails

vLLM vs Others

Next Steps