vLLM
Server Requirements
Parameter
Minimum
Recommended
Why vLLM?
Quick Deploy on CLORE.AI
Verify It's Working
Accessing Your Service
Installation
Using Docker (Recommended)
Using pip
Supported Models
Model
Parameters
VRAM Required
RAM Required
Server Options
Basic Server
Production Server
With Quantization (Lower VRAM)
API Usage
Chat Completions (OpenAI Compatible)
Streaming
cURL
Text Completions
Complete API Reference
Standard Endpoints
Endpoint
Method
Description
Additional Endpoints
Endpoint
Method
Description
Tokenize Text
Detokenize
Get Version
Swagger Documentation
Prometheus Metrics
Benchmarks
Throughput (tokens/sec per user)
Model
RTX 3090
RTX 4090
A100 40GB
A100 80GB
Context Length vs VRAM
Model
4K ctx
8K ctx
16K ctx
32K ctx
Hugging Face Authentication
GPU Requirements
Model
Min VRAM
Min RAM
Recommended
Cost Estimate
GPU
CLORE/day
Approx USD/hr
Best For
Troubleshooting
HTTP 502 for a long time
Out of Memory
Model Download Fails
vLLM vs Others
Feature
vLLM
llama.cpp
Ollama
Next Steps
Last updated
Was this helpful?