vLLM

High-throughput LLM inference server for production workloads on CLORE.AI GPUs.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

16GB

32GB+

VRAM

16GB (7B)

24GB+

Network

500Mbps

1Gbps+

Startup Time

5-15 minutes

-

triangle-exclamation
circle-exclamation

Why vLLM?

  • Fastest throughput - PagedAttention for 24x higher throughput

  • Production ready - OpenAI-compatible API out of the box

  • Continuous batching - Efficient multi-user serving

  • Streaming - Real-time token generation

  • Multi-GPU - Tensor parallelism for large models

Quick Deploy on CLORE.AI

Docker Image:

Ports:

Command:

Verify It's Working

After deployment, find your http_pub URL in My Orders:

circle-exclamation

Accessing Your Service

When deployed on CLORE.AI, access vLLM via the http_pub URL:

circle-info

All localhost:8000 examples below work when connected via SSH. For external access, replace with your https://your-http-pub.clorecloud.net/ URL.

Installation

Using pip

Supported Models

Model
Parameters
VRAM Required
RAM Required

Mistral 7B

7B

14GB

16GB+

Llama 3.1 8B

8B

16GB

16GB+

Llama 3.1 70B

70B

140GB (or 2x80GB)

64GB+

Mixtral 8x7B

47B

90GB

32GB+

Qwen2.5 7B

7B

14GB

16GB+

Qwen2.5 72B

72B

145GB

64GB+

DeepSeek-V2

236B

Multi-GPU

128GB+

Phi-4

14B

28GB

32GB+

Gemma 2 9B

9B

18GB

16GB+

CodeLlama 34B

34B

68GB

32GB+

Server Options

Basic Server

Production Server

With Quantization (Lower VRAM)

API Usage

Chat Completions (OpenAI Compatible)

Streaming

cURL

Text Completions

Complete API Reference

vLLM provides OpenAI-compatible endpoints plus additional utility endpoints.

Standard Endpoints

Endpoint
Method
Description

/v1/models

GET

List available models

/v1/chat/completions

POST

Chat completion

/v1/completions

POST

Text completion

/health

GET

Health check (may return empty)

Additional Endpoints

Endpoint
Method
Description

/tokenize

POST

Tokenize text

/detokenize

POST

Convert tokens to text

/version

GET

Get vLLM version

/docs

GET

Swagger UI documentation

/metrics

GET

Prometheus metrics

Tokenize Text

Useful for counting tokens before sending requests:

Response:

Detokenize

Convert token IDs back to text:

Response:

Get Version

Response:

Swagger Documentation

Open in browser for interactive API documentation:

Prometheus Metrics

For monitoring:

circle-info

Reasoning Models: Some models like Qwen3 support reasoning mode and may include <think> tags in responses showing the model's reasoning process.

Benchmarks

Throughput (tokens/sec per user)

Model
RTX 3090
RTX 4090
A100 40GB
A100 80GB

Mistral 7B

100

170

210

230

Llama 3.1 8B

95

150

200

220

Llama 3.1 8B (AWQ)

130

190

260

280

Mixtral 8x7B

-

45

70

85

Llama 3.1 70B

-

-

25 (2x)

45 (2x)

Benchmarks updated January 2026.

Context Length vs VRAM

Model
4K ctx
8K ctx
16K ctx
32K ctx

8B FP16

18GB

22GB

30GB

46GB

8B AWQ

8GB

10GB

14GB

22GB

70B FP16

145GB

160GB

190GB

250GB

70B AWQ

42GB

50GB

66GB

98GB

Hugging Face Authentication

For gated models (Llama, etc.):

Or set as environment variable:

GPU Requirements

Model
Min VRAM
Min RAM
Recommended

7-8B

16GB

16GB

24GB VRAM, 32GB RAM

13B

26GB

32GB

40GB VRAM

34B

70GB

32GB

80GB VRAM

70B

140GB

64GB

2x80GB

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU
CLORE/day
Approx USD/hr
Best For

RTX 3090 24GB

~150

~$0.03

7-8B models

RTX 4090 24GB

~200

~$0.04

7-13B, fast

A100 40GB

~400

~$0.08

13-34B models

A100 80GB

~600

~$0.12

34-70B models

Prices vary by provider. Check CLORE.AI Marketplacearrow-up-right for current rates.

Troubleshooting

HTTP 502 for a long time

  1. Check RAM: Server must have 16GB+ RAM

  2. Check VRAM: Must fit the model

  3. Model downloading: First run downloads from HuggingFace (5-15 min)

  4. HF Token: Gated models require authentication

Out of Memory

Model Download Fails

vLLM vs Others

Feature
vLLM
llama.cpp
Ollama

Throughput

Best

Good

Good

VRAM Usage

High

Low

Medium

Ease of Use

Medium

Medium

Easy

Startup Time

5-15 min

1-2 min

30 sec

Multi-GPU

Native

Limited

Limited

Use vLLM when:

  • High throughput is priority

  • Serving multiple users

  • Have enough VRAM and RAM

  • Production deployment

Use Ollama when:

  • Quick setup needed

  • Single user

  • Less resources available

Next Steps

Last updated

Was this helpful?