TensorRT-LLM

Máximo rendimiento de inferencia de LLM con la optimización NVIDIA TensorRT — desplegado mediante Triton Inference Server

TensorRT-LLM es la biblioteca de código abierto de NVIDIA para optimizar la inferencia de modelos de lenguaje grande en GPUs NVIDIA. Ofrece rendimiento de vanguardia mediante fusión de kernels, cuantización (INT4, INT8, FP8), batching en vuelo y caché KV paginada. Combinado con Triton Inference Server, obtienes una infraestructura de serving de grado producción.

GitHub: NVIDIA/TensorRT-LLM — 10K+ ⭐

¿Por qué TensorRT-LLM?

Característica

vLLM

TensorRT-LLM

Rendimiento (Throughput)

Excelente

De primera clase

Latencia

Bueno

Excelente

Cuantización INT4/INT8

Parcial

Soporte nativo

Soporte FP8

Limitado

Completo

Paralelismo de tensores multi-GPU

Sí

Complejidad de configuración

Bajo

Media-Alta

TensorRT-LLM normalmente ofrece 2–4x mayor rendimiento en comparación con la inferencia estándar de transformers de HuggingFace, y 30–50% mejor rendimiento que vLLM para escenarios de serving por lotes.

Prerrequisitos

Cuenta en Clore.ai con alquiler de GPU
GPU NVIDIA con arquitectura Ampere o más reciente (RTX 3090, A100, RTX 4090, H100)
Conocimientos básicos de Linux y Docker
VRAM suficiente para el modelo elegido

Requisitos de VRAM por modelo

Modelo

FP16

INT8

INT4

Llama-3.1 8B

16GB

8GB

4GB

Llama-3.1 70B

140GB

70GB

35GB

Mistral 7B

14GB

7GB

4GB

Mixtral 8x7B

90GB

45GB

24GB

Qwen2.5 72B

144GB

72GB

36GB

Paso 1 — Elige tu GPU en Clore.ai

Inicia sesión en clore.ai → Marketplace
Para serving con GPU única (modelos 7B–13B): RTX 4090 24GB o RTX 3090 24GB
Para modelos grandes (70B+): Múltiples A100 80GB o H100

Estrategia multi-GPU:

2x A100 80GB → Llama 3.1 70B en FP16 o Qwen2.5 72B
4x A100 80GB → Llama 3.1 405B en INT8
Selecciona servidores con múltiples GPUs listadas en el marketplace de Clore.ai

Paso 2 — Despliega Triton Inference Server con el backend TRT-LLM

Imagen Docker:

nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3

Usa la -trtllm-python-py3 variante — esto incluye el backend TensorRT-LLM preinstalado. La etiqueta corresponde a la versión del contenedor NVIDIA (24.01 = enero de 2024). Consulta NGC para la etiqueta más reciente.

Puertos expuestos:

22
8000

Variables de entorno:

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
TRANSFORMERS_CACHE=/workspace/hf_cache
HF_HOME=/workspace/hf_cache

Volumen/Disco: Mínimo 100GB recomendado

Paso 3 — Conéctate y verifica la instalación

ssh root@<ip-del-servidor> -p <puerto-ssh>

# Comprobar GPU
nvidia-smi

# Verificar versión de TensorRT
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"

# Verificar que Triton esté disponible
tritonserver --version

Paso 4 — Descargar y preparar el modelo

Usaremos Llama 3.1 8B como ejemplo. Ajusta las rutas para el modelo que elijas.

Instalar HuggingFace CLI

pip install huggingface_hub
huggingface-cli login
# Ingresa tu token de HuggingFace cuando se te solicite

Descargar pesos del modelo

mkdir -p /workspace/models/llama-3.1-8b
huggingface-cli download \
    meta-llama/Llama-3.1-8B-Instruct \
    --local-dir /workspace/models/llama-3.1-8b \
    --local-dir-use-symlinks False

# O usar snapshot_download
python3 << 'EOF'
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="meta-llama/Llama-3.1-8B-Instruct",
    local_dir="/workspace/models/llama-3.1-8b",
    local_dir_use_symlinks=False
)
EOF

Paso 5 — Construir el motor TensorRT

Este es el paso clave — compilar el modelo en un engine TensorRT optimizado.

Engine FP16 (Mejor calidad)

cd /workspace

# Convertir pesos de HuggingFace al formato TRT-LLM
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /workspace/models/llama-3.1-8b \
    --output_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \
    --dtype float16 \
    --tp_size 1

# Construir engine de TensorRT
trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-fp16 \
    --gemm_plugin float16 \
    --max_batch_size 32 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --max_num_tokens 16384 \
    --use_paged_context_fmha enable

Engine INT8 SmoothQuant (Mayor rendimiento)

# Convertir con cuantización SmoothQuant
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /workspace/models/llama-3.1-8b \
    --output_dir /workspace/trt_checkpoints/llama-3.1-8b-int8 \
    --dtype float16 \
    --smoothquant 0.5 \
    --per_channel \
    --per_token

trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-int8 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-int8 \
    --gemm_plugin float16 \
    --smoothquant_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_seq_len 8192

Engine INT4 AWQ (Máximo rendimiento / Mínima memoria)

# Instalar auto-gptq para la cuantización
pip install autoawq

# Cuantizar a INT4 AWQ
python3 << 'EOF'
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "/workspace/models/llama-3.1-8b"
quant_path = "/workspace/models/llama-3.1-8b-awq-int4"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
EOF

# Convertir AWQ a TRT-LLM
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /workspace/models/llama-3.1-8b-awq-int4 \
    --output_dir /workspace/trt_checkpoints/llama-3.1-8b-int4 \
    --dtype float16 \
    --quant_ckpt_path /workspace/models/llama-3.1-8b-awq-int4 \
    --use_weight_only \
    --weight_only_precision int4_awq \
    --per_group

trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-int4 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-int4 \
    --gemm_plugin float16 \
    --max_batch_size 128 \
    --max_input_len 4096 \
    --max_seq_len 8192

Tiempo de construcción del engine: 10–30 minutos dependiendo de la GPU y del tamaño del modelo. Esta es una operación única — una vez construido, el engine se carga en segundos.

Paso 6 — Prueba rápida con la API Python de TRT-LLM

Antes de configurar Triton, verifica que el engine funcione:

python3 << 'EOF'
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
from transformers import AutoTokenizer

engine_dir = "/workspace/trt_engines/llama-3.1-8b-fp16"
tokenizer_dir = "/workspace/models/llama-3.1-8b"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
runner = ModelRunner.from_dir(
    engine_dir=engine_dir,
    rank=0
)

prompt = "¿Cuál es la capital de Francia?"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

output = runner.generate(
    batch_input_ids=[input_ids[0].tolist()],
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9
)

output_ids = output[0][0][len(input_ids[0]):]
response = tokenizer.decode(output_ids, skip_special_tokens=True)
print(f"Respuesta: {response}")
EOF

Paso 7 — Configurar Triton Inference Server

Crear estructura del repositorio de modelos

mkdir -p /workspace/triton_model_repo/llama/1

# Crear configuración del modelo
cat > /workspace/triton_model_repo/llama/config.pbtxt << 'EOF'
backend: "tensorrtllm"
name: "llama"
max_batch_size: 64
model_transaction_policy {
  decoupled: true
}

dynamic_batching {
  preferred_batch_size: [1, 2, 4, 8, 16, 32, 64]
  max_queue_delay_microseconds: 1000
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [-1]
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [1]
    reshape: { shape: [] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [1]
    reshape: { shape: [] }
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [1]
    reshape: { shape: [] }
    optional: true
  }
]

output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [-1, -1]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [1]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [0]
  }
]

parameters: {
  key: "gpt_model_type"
  value: { string_value: "inflight_fused_batching" }
}

parameters: {
  key: "gpt_model_path"
  value: { string_value: "/workspace/trt_engines/llama-3.1-8b-fp16" }
}

parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: { string_value: "8192" }
}

parameters: {
  key: "batch_scheduler_policy"
  value: { string_value: "guaranteed_no_evict" }
}
EOF

Crear enlace simbólico del engine

ln -s /workspace/trt_engines/llama-3.1-8b-fp16 \
    /workspace/triton_model_repo/llama/1/

Iniciar Triton Server

tritonserver \
    --model-repository=/workspace/triton_model_repo \
    --http-port=8000 \
    --grpc-port=8001 \
    --metrics-port=8002 \
    --log-verbose=0 &

# Esperar a que el servidor arranque
sleep 30

# Comprobar la salud del servidor
curl -s http://localhost:8000/v2/health/ready

Paso 8 — Consultar la API

Cliente compatible con OpenAI

import requests
import json

def generate(prompt: str, max_tokens: int = 200) -> str:
    url = "http://localhost:8000/v2/models/llama/generate"
    
    payload = {
        "text_input": prompt,
        "parameters": {
            "max_tokens": max_tokens,
            "temperature": 0.7,
            "top_p": 0.9
        }
    }
    
    response = requests.post(url, json=payload)
    result = response.json()
    return result.get("text_output", "")

# Prueba
print(generate("Explica la computación cuántica en términos sencillos:"))

Medir el rendimiento (throughput)

# Instalar tritonclient
pip install tritonclient[all]

# Ejecutar benchmark de rendimiento
perf_analyzer \
    -m llama \
    -u localhost:8001 \
    --protocol grpc \
    --input-data /workspace/sample_inputs.json \
    --concurrency-range 1:32:2 \
    --measurement-interval 10000 \
    --shape input_ids:512 \
    --shape input_lengths:1 \
    --shape request_output_len:1

Paso 9 — Añadir wrapper de API compatible con OpenAI

Para una integración más sencilla, añade un wrapper FastAPI:

pip install fastapi uvicorn tritonclient[all]

cat > /workspace/openai_server.py << 'EOF'
from fastapi import FastAPI
from pydantic import BaseModel
import tritonclient.http as httpclient
import numpy as np
from transformers import AutoTokenizer

app = FastAPI()
tokenizer = AutoTokenizer.from_pretrained("/workspace/models/llama-3.1-8b")
client = httpclient.InferenceServerClient("localhost:8000")

class ChatRequest(BaseModel):
    model: str = "llama"
    messages: list
    max_tokens: int = 512
    temperature: float = 0.7

@app.post("/v1/chat/completions")
async def chat(req: ChatRequest):
    prompt = tokenizer.apply_chat_template(
        req.messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    input_ids = tokenizer.encode(prompt)
    
    inputs = [
        httpclient.InferInput("input_ids", [len(input_ids)], "INT32"),
        httpclient.InferInput("input_lengths", [1], "INT32"),
        httpclient.InferInput("request_output_len", [1], "INT32"),
    ]
    inputs[0].set_data_from_numpy(np.array(input_ids, dtype=np.int32))
    inputs[1].set_data_from_numpy(np.array([len(input_ids)], dtype=np.int32))
    inputs[2].set_data_from_numpy(np.array([req.max_tokens], dtype=np.int32))
    
    result = client.infer("llama", inputs)
    output_ids = result.as_numpy("output_ids")[0][len(input_ids):]
    text = tokenizer.decode(output_ids, skip_special_tokens=True)
    
    return {
        "choices": [{"message": {"role": "assistant", "content": text}}]
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
EOF

python3 /workspace/openai_server.py &

Solución de problemas

OOM durante la construcción del engine

# Reducir max_batch_size y max_num_tokens
trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-fp16 \
    --gemm_plugin float16 \
    --max_batch_size 8 \        # Reducir desde 32
    --max_input_len 2048 \      # Reducir desde 4096
    --max_seq_len 4096          # Reducir desde 8192

Triton Server no inicia

# Comprobar logs
cat /workspace/triton.log

# Verificar que los archivos del engine existan
ls -la /workspace/trt_engines/llama-3.1-8b-fp16/

# Revisa la memoria GPU
nvidia-smi

Bajo rendimiento (Low Throughput)

# Habilitar batching en vuelo e incrementar concurrencia
# Afinar max_tokens_in_paged_kv_cache según la VRAM disponible

Benchmarks de rendimiento en GPUs de Clore.ai

Modelo

GPU

Cuantización

Rendimiento (tokens/seg)

Llama 3.1 8B

RTX 4090

FP16

~3,500

Llama 3.1 8B

RTX 4090

INT4 AWQ

~6,200

Llama 3.1 70B

2x A100 80G

FP16

~1,800

Mixtral 8x7B

2x RTX 4090

INT8

~2,400

Recursos adicionales

TensorRT-LLM en Clore.ai es la opción óptima para serving de LLM en producción donde el rendimiento y la latencia son críticos. Para configuraciones más simples, considera la guía de vLLM.

Recomendaciones de GPU en Clore.ai

Caso de uso

GPU recomendada

Coste estimado en Clore.ai

Desarrollo/Pruebas

RTX 3090 (24GB)

~$0.12/gpu/hr

Inferencia en Producción

RTX 4090 (24GB)

~$0.70/gpu/hr

Modelos grandes (70B+)

A100 80GB

~$1.20/gpu/hr

💡 Todos los ejemplos en esta guía pueden desplegarse en Clore.ai servidores GPU. Navega las GPUs disponibles y alquila por hora — sin compromisos, acceso root completo.

AnteriorDescripción general SiguienteONNX Runtime GPU

Última actualización hace 1 día

¿Te fue útil?

hashtag¿Por qué TensorRT-LLM?

hashtagPrerrequisitos

hashtagRequisitos de VRAM por modelo

hashtagPaso 1 — Elige tu GPU en Clore.ai

hashtagPaso 2 — Despliega Triton Inference Server con el backend TRT-LLM

hashtagPaso 3 — Conéctate y verifica la instalación

hashtagPaso 4 — Descargar y preparar el modelo

hashtagInstalar HuggingFace CLI

hashtagDescargar pesos del modelo

hashtagPaso 5 — Construir el motor TensorRT

hashtagEngine FP16 (Mejor calidad)

hashtagEngine INT8 SmoothQuant (Mayor rendimiento)

hashtagEngine INT4 AWQ (Máximo rendimiento / Mínima memoria)

hashtagPaso 6 — Prueba rápida con la API Python de TRT-LLM

hashtagPaso 7 — Configurar Triton Inference Server

hashtagCrear estructura del repositorio de modelos

hashtagCrear enlace simbólico del engine

hashtagIniciar Triton Server

hashtagPaso 8 — Consultar la API

hashtagCliente compatible con OpenAI

hashtagMedir el rendimiento (throughput)

hashtagPaso 9 — Añadir wrapper de API compatible con OpenAI

hashtagSolución de problemas

hashtagOOM durante la construcción del engine

hashtagTriton Server no inicia

hashtagBajo rendimiento (Low Throughput)

hashtagBenchmarks de rendimiento en GPUs de Clore.ai

hashtagRecursos adicionales

hashtagRecomendaciones de GPU en Clore.ai

¿Por qué TensorRT-LLM?

Prerrequisitos

Requisitos de VRAM por modelo

Paso 1 — Elige tu GPU en Clore.ai

Paso 2 — Despliega Triton Inference Server con el backend TRT-LLM

Paso 3 — Conéctate y verifica la instalación

Paso 4 — Descargar y preparar el modelo

Instalar HuggingFace CLI

Descargar pesos del modelo

Paso 5 — Construir el motor TensorRT

Engine FP16 (Mejor calidad)

Engine INT8 SmoothQuant (Mayor rendimiento)

Engine INT4 AWQ (Máximo rendimiento / Mínima memoria)

Paso 6 — Prueba rápida con la API Python de TRT-LLM

Paso 7 — Configurar Triton Inference Server

Crear estructura del repositorio de modelos

Crear enlace simbólico del engine

Iniciar Triton Server

Paso 8 — Consultar la API

Cliente compatible con OpenAI

Medir el rendimiento (throughput)

Paso 9 — Añadir wrapper de API compatible con OpenAI

Solución de problemas

OOM durante la construcción del engine

Triton Server no inicia

Bajo rendimiento (Low Throughput)

Benchmarks de rendimiento en GPUs de Clore.ai

Recursos adicionales

Recomendaciones de GPU en Clore.ai