vLLM

Inferencia LLM de alto rendimiento con vLLM en las GPU de Clore.ai

Servidor de inferencia LLM de alto rendimiento para cargas de trabajo de producción en GPUs de CLORE.AI.

Todos los ejemplos se pueden ejecutar en servidores GPU alquilados a través de CLORE.AI Marketplace.

Versión actual: v0.7.x — Esta guía cubre vLLM v0.7.3+. Las nuevas características incluyen soporte DeepSeek-R1, salidas estructuradas con elección automática de herramientas, servicio multi-LoRA y mayor eficiencia de memoria.

Requisitos del servidor

Parámetro

Mínimo

Recomendado

RAM

16GB

32GB+

VRAM

16GB (7B)

24GB+

Red

500Mbps

1Gbps+

Tiempo de inicio

5-15 minutos

Importante: vLLM requiere RAM y VRAM significativas. Servidores con menos de 16GB de RAM fallarán al ejecutar incluso modelos 7B.

Tiempo de inicio: El primer lanzamiento descarga el modelo desde HuggingFace (5-15 minutos dependiendo del tamaño del modelo y la velocidad de la red). HTTP 502 durante este tiempo es normal.

¿Por qué vLLM?

Mayor rendimiento - PagedAttention para un rendimiento 24x mayor
Listo para producción - API compatible con OpenAI listo para usar
Batching continuo - Servicio multiusuario eficiente
Streaming - Generación de tokens en tiempo real
Multi-GPU - Paralelismo tensorial para modelos grandes
Multi-LoRA - Sirve múltiples adaptadores fine-tuned simultáneamente (v0.7+)
Salidas estructuradas - Aplicación de esquemas JSON y llamadas a herramientas (v0.7+)

Despliegue rápido en CLORE.AI

Imagen Docker:

vllm/vllm-openai:v0.7.3

Puertos:

22/tcp
8000/http

Comando:

vllm serve mistralai/Mistral-7B-Instruct-v0.2 --host 0.0.0.0 --port 8000

Verificar que funciona

Después del despliegue, encuentra tu http_pub URL en Mis Pedidos:

# Comprobar salud (puede tardar 5-15 min en la primera ejecución)
curl https://your-http-pub.clorecloud.net/health

# Listar modelos (solo funciona después de cargar el modelo)
curl https://your-http-pub.clorecloud.net/v1/models

Si obtienes HTTP 502 por más de 15 minutos, verifica:

El servidor tiene 16GB+ de RAM
El servidor tiene suficiente VRAM para el modelo
El token de HuggingFace está configurado para modelos restringidos

Accediendo a tu servicio

Cuando se despliega en CLORE.AI, accede a vLLM vía el http_pub URL:

# Completado de chat
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [{"role": "user", "content": "¡Hola!"}]
  }'

Todo localhost:8000 los ejemplos a continuación funcionan cuando está conectado vía SSH. Para acceso externo, reemplace con su https://your-http-pub.clorecloud.net/ URL.

Instalación

Usando Docker (Recomendado)

docker run -d --gpus all \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:v0.7.3 \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0

Usando pip

pip install vllm==0.7.3

# Ejecutar servidor
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2

Modelos compatibles

Modelo

Parámetros

VRAM requerida

RAM requerida

Mistral 7B

14GB

16GB+

Llama 3.1 8B

16GB

16GB+

Llama 3.1 70B

70B

140GB (o 2x80GB)

64GB+

Mixtral 8x7B

47B

90GB

32GB+

Qwen2.5 7B

14GB

16GB+

Qwen2.5 72B

72B

145GB

64GB+

DeepSeek-V3

236B MoE

Multi-GPU

128GB+

DeepSeek-R1-Distill-Qwen-7B

14GB

16GB+

DeepSeek-R1-Distill-Qwen-32B

32B

64GB

32GB+

DeepSeek-R1-Distill-Llama-70B

70B

140GB

64GB+

Phi-4

14B

28GB

32GB+

Gemma 2 9B

18GB

16GB+

CodeLlama 34B

34B

68GB

32GB+

Opciones del servidor

Servidor básico

vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --port 8000

Servidor de Producción

vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 256 \
    --enable-prefix-caching

Con cuantización (menos VRAM)

# Modelo quantizado AWQ (usa menos VRAM)
vllm serve TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
    --host 0.0.0.0 \
    --quantization awq

Salidas estructuradas y llamadas a herramientas (v0.7+)

Habilita elección automática de herramientas y salidas JSON estructuradas:

vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral

Usar en Python:

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Obtener el clima actual de una ciudad",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "Nombre de la ciudad"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "¿Cuál es el clima en París?"}],
    tools=tools,
    tool_choice="auto"
)

# Analizar llamada a herramienta
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
print(f"Herramienta: {tool_call.function.name}, Args: {args}")

Salida JSON estructurada mediante response format:

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Extraer: John Smith, 30 años, ingeniero de software"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "occupation": {"type": "string"}
                },
                "required": ["name", "age", "occupation"]
            }
        }
    }
)
print(response.choices[0].message.content)

Servicio Multi-LoRA (v0.7+)

Servir un modelo base con múltiples adaptadores LoRA simultáneamente:

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --enable-lora \
    --lora-modules \
        sql-adapter=path/to/sql-lora \
        code-adapter=path/to/code-lora \
        chat-adapter=path/to/chat-lora \
    --max-lora-rank 64

Consulta un adaptador LoRA específico por nombre de modelo:

# Usar el adaptador SQL
response = client.chat.completions.create(
    model="sql-adapter",
    messages=[{"role": "user", "content": "Escribe una consulta SQL para encontrar los 10 mejores clientes"}]
)

# Usar el adaptador de código
response = client.chat.completions.create(
    model="code-adapter",
    messages=[{"role": "user", "content": "Escribe una función en Python para ordenar una lista"}]
)

Soporte DeepSeek-R1 (v0.7+)

vLLM v0.7+ tiene soporte nativo para modelos distill DeepSeek-R1. Estos modelos de razonamiento producen <think> etiquetas que muestran su proceso de razonamiento.

DeepSeek-R1-Distill-Qwen-7B (GPU única)

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384

DeepSeek-R1-Distill-Qwen-32B (GPU dual)

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90

DeepSeek-R1-Distill-Llama-70B (GPU cuádruple)

vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 32768

Consultando DeepSeek-R1

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    messages=[
        {
            "role": "user",
            "content": "Resolver: Si un tren viaja 120km en 1.5 horas, ¿cuál es su velocidad en m/s?"
        }
    ],
    max_tokens=2048,
    temperature=0.6
)

content = response.choices[0].message.content
# La respuesta incluye un bloque de razonamiento <think>...</think> seguido de la respuesta
print(content)

Analizando etiquetas think:

import re

def parse_deepseek_r1_response(content: str) -> dict:
    """Extraer el pensamiento y la respuesta de la respuesta DeepSeek-R1."""
    think_match = re.search(r'<think>(.*?)</think>', content, re.DOTALL)
    thinking = think_match.group(1).strip() if think_match else ""
    answer = re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL).strip()
    return {"thinking": thinking, "answer": answer}

result = parse_deepseek_r1_response(content)
print("Pensamiento:", result["thinking"][:200], "...")
print("Respuesta:", result["answer"])

Uso de la API

Chat Completions (compatible con OpenAI)

from openai import OpenAI

# Para acceso externo, use su URL http_pub:
client = OpenAI(
    base_url="https://your-http-pub.clorecloud.net/v1",
    api_key="no-necesaria"
)

# O vía túnel SSH:
# client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {"role": "user", "content": "Explicar la computación cuántica"}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Escribe un poema"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

cURL

curl https://your-http-pub.clorecloud.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [{"role": "user", "content": "¡Hola!"}],
    "max_tokens": 100
  }'

Completaciones de texto

curl https://your-http-pub.clorecloud.net/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "La capital de Francia es",
    "max_tokens": 50
  }'

Referencia completa de la API

vLLM proporciona endpoints compatibles con OpenAI además de endpoints utilitarios adicionales.

Puntos finales estándar

Endpoint

Método

Descripción

/v1/models

GET

Listar modelos disponibles

/v1/chat/completions

POST

Completación de chat

/v1/completions

POST

Completación de texto

/health

GET

Chequeo de salud (puede devolver vacío)

Endpoints adicionales

Endpoint

Método

Descripción

/tokenize

POST

Tokenizar texto

/detokenize

POST

Convertir tokens a texto

/version

GET

Obtener versión de vLLM

/docs

GET

Documentación Swagger UI

/metrics

GET

Métricas de Prometheus

Tokenizar texto

Útil para contar tokens antes de enviar solicitudes:

curl https://your-http-pub.clorecloud.net/tokenize \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "Hello world"
  }'

Respuesta:

{"count": 2, "max_model_len": 32768, "tokens": [9707, 1879]}

Detokenizar

Convertir IDs de tokens de vuelta a texto:

curl https://your-http-pub.clorecloud.net/detokenize \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "tokens": [9707, 1879]
  }'

Respuesta:

{"prompt": "Hello world"}

Obtener versión

curl https://your-http-pub.clorecloud.net/version

Respuesta:

{"version": "0.7.3"}

Documentación Swagger

Ábralo en el navegador para documentación interactiva de la API:

https://your-http-pub.clorecloud.net/docs

Métricas Prometheus

Para monitoreo:

curl https://your-http-pub.clorecloud.net/metrics

Modelos de razonamiento: DeepSeek-R1 y modelos similares incluyen <think> etiquetas en las respuestas que muestran el proceso de razonamiento del modelo antes de la respuesta final.

Benchmarks

Rendimiento (tokens/seg por usuario)

Modelo

RTX 3090

RTX 4090

A100 40GB

A100 80GB

Mistral 7B

100

170

210

230

Llama 3.1 8B

150

200

220

Llama 3.1 8B (AWQ)

130

190

260

280

Mixtral 8x7B

Llama 3.1 70B

25 (2x)

45 (2x)

DeepSeek-R1 7B

145

190

210

DeepSeek-R1 32B

70 (2x)

Benchmarks actualizados en enero de 2026.

Longitud de contexto vs VRAM

Modelo

ctx 4K

ctx 8K

ctx 16K

ctx 32K

8B FP16

18GB

22GB

30GB

46GB

8B AWQ

8GB

10GB

14GB

22GB

70B FP16

145GB

160GB

190GB

250GB

70B AWQ

42GB

50GB

66GB

98GB

Autenticación de Hugging Face

Para modelos restringidos (Llama, etc.):

# Establecer token en el comando
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --env HUGGING_FACE_HUB_TOKEN=hf_xxxxx

O establecerlo como variable de entorno:

export HUGGING_FACE_HUB_TOKEN=hf_xxxxx

Requisitos de GPU

Modelo

VRAM mínima

RAM mínima

Recomendado

7-8B

16GB

16GB

24GB VRAM, 32GB RAM

13B

26GB

32GB

40GB VRAM

34B

70GB

32GB

80GB VRAM

70B

140GB

64GB

2x80GB

Estimación de costos

Tarifas típicas del mercado de CLORE.AI:

GPU

VRAM

Precio/día

Mejor para

RTX 3090

24GB

$0.30–1.00

Modelos 7-8B

RTX 4090

24GB

$0.50–2.00

7-13B, rápido

A100

40GB

$1.50–3.00

Modelos 13-34B

A100

80GB

$2.00–4.00

Modelos 34-70B

Precios en USD/día. Las tarifas varían según el proveedor: consulte CLORE.AI Marketplace para las tarifas actuales.

Solución de problemas

HTTP 502 por mucho tiempo

Verificar RAM: El servidor debe tener 16GB+ de RAM
Verificar VRAM: Debe caber el modelo
Descarga del modelo: La primera ejecución descarga desde HuggingFace (5-15 min)
Token HF: Los modelos restringidos requieren autenticación

Memoria insuficiente

# Reducir uso de memoria
--gpu-memory-utilization 0.8
--max-model-len 4096
--max-num-seqs 64

# O usar cuantización
--quantization awq

La descarga del modelo falla

# Verificar token HF
echo $HUGGING_FACE_HUB_TOKEN

# Pre-descargar modelo
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2

vLLM vs Otros

Función

vLLM

llama.cpp

Ollama

Rendimiento

Mejor

Bueno

Uso de VRAM

Alto

Baja

Medio

Facilidad de uso

Medio

Fácil

Tiempo de inicio

5-15 min

1-2 min

30 seg

Multi-GPU

Nativo

Limitado

Llamada a herramientas

Sí (v0.7+)

Limitado

Multi-LoRA

Sí (v0.7+)

Usa vLLM cuando:

El alto rendimiento sea prioridad
Atender a múltiples usuarios
Tener suficiente VRAM y RAM
Despliegue en producción
Necesitar llamadas a herramientas / salidas estructuradas

Usa Ollama cuando:

Se necesite configuración rápida
Usuario único
Menos recursos disponibles

Próximos pasos

Ollama - Alternativa más simple con inicio más rápido
DeepSeek-R1 - Guía de modelos de razonamiento
DeepSeek-V3 - Mejor modelo general
Qwen2.5 - Modelos multilingües
Llama.cpp - Opción con menor VRAM

AnteriorOpen WebUI SiguienteLlama.cpp Server

Última actualización hace 21 días

¿Te fue útil?

hashtagRequisitos del servidor

hashtag¿Por qué vLLM?

hashtagDespliegue rápido en CLORE.AI

hashtagVerificar que funciona

hashtagAccediendo a tu servicio

hashtagInstalación

hashtagUsando Docker (Recomendado)

hashtagUsando pip

hashtagModelos compatibles

hashtagOpciones del servidor

hashtagServidor básico

hashtagServidor de Producción

hashtagCon cuantización (menos VRAM)

hashtagSalidas estructuradas y llamadas a herramientas (v0.7+)

hashtagServicio Multi-LoRA (v0.7+)

hashtagSoporte DeepSeek-R1 (v0.7+)

hashtagDeepSeek-R1-Distill-Qwen-7B (GPU única)

hashtagDeepSeek-R1-Distill-Qwen-32B (GPU dual)

hashtagDeepSeek-R1-Distill-Llama-70B (GPU cuádruple)

hashtagConsultando DeepSeek-R1

hashtagUso de la API

hashtagChat Completions (compatible con OpenAI)

hashtagStreaming

hashtagcURL

hashtagCompletaciones de texto

hashtagReferencia completa de la API

hashtagPuntos finales estándar

hashtagEndpoints adicionales

hashtagTokenizar texto

hashtagDetokenizar

hashtagObtener versión

hashtagDocumentación Swagger

hashtagMétricas Prometheus

hashtagBenchmarks

hashtagRendimiento (tokens/seg por usuario)

hashtagLongitud de contexto vs VRAM

hashtagAutenticación de Hugging Face

hashtagRequisitos de GPU

hashtagEstimación de costos

hashtagSolución de problemas

hashtagHTTP 502 por mucho tiempo

hashtagMemoria insuficiente

hashtagLa descarga del modelo falla

hashtagvLLM vs Otros

hashtagPróximos pasos

Requisitos del servidor

¿Por qué vLLM?

Despliegue rápido en CLORE.AI

Verificar que funciona

Accediendo a tu servicio

Instalación

Usando Docker (Recomendado)

Usando pip

Modelos compatibles

Opciones del servidor

Servidor básico

Servidor de Producción

Con cuantización (menos VRAM)

Salidas estructuradas y llamadas a herramientas (v0.7+)

Servicio Multi-LoRA (v0.7+)

Soporte DeepSeek-R1 (v0.7+)

DeepSeek-R1-Distill-Qwen-7B (GPU única)

DeepSeek-R1-Distill-Qwen-32B (GPU dual)

DeepSeek-R1-Distill-Llama-70B (GPU cuádruple)

Consultando DeepSeek-R1

Uso de la API

Chat Completions (compatible con OpenAI)

Streaming

cURL

Completaciones de texto

Referencia completa de la API

Puntos finales estándar

Endpoints adicionales

Tokenizar texto

Detokenizar

Obtener versión

Documentación Swagger

Métricas Prometheus

Benchmarks

Rendimiento (tokens/seg por usuario)

Longitud de contexto vs VRAM

Autenticación de Hugging Face

Requisitos de GPU

Estimación de costos

Solución de problemas

HTTP 502 por mucho tiempo

Memoria insuficiente

La descarga del modelo falla

vLLM vs Otros

Próximos pasos