> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-es/devops-de-gpu/tensorrt-llm.md). # TensorRT-LLM > **Máximo rendimiento de inferencia de LLM con la optimización NVIDIA TensorRT — desplegado mediante Triton Inference Server** TensorRT-LLM es la biblioteca de código abierto de NVIDIA para optimizar la inferencia de modelos de lenguaje grande en GPUs NVIDIA. Ofrece rendimiento de vanguardia mediante fusión de kernels, cuantización (INT4, INT8, FP8), batching en vuelo y caché KV paginada. Combinado con Triton Inference Server, obtienes una infraestructura de serving de grado producción. **GitHub:** [NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) — 10K+ ⭐ *** ## ¿Por qué TensorRT-LLM? | Característica | vLLM | TensorRT-LLM | | --------------------------------- | --------- | ---------------- | | Rendimiento (Throughput) | Excelente | De primera clase | | Latencia | Bueno | Excelente | | Cuantización INT4/INT8 | Parcial | Soporte nativo | | Soporte FP8 | Limitado | Completo | | Paralelismo de tensores multi-GPU | Sí | Sí | | Complejidad de configuración | Bajo | Media-Alta | {% hint style="success" %} **TensorRT-LLM normalmente ofrece 2–4x mayor rendimiento** en comparación con la inferencia estándar de transformers de HuggingFace, y 30–50% mejor rendimiento que vLLM para escenarios de serving por lotes. {% endhint %} *** ## Prerrequisitos * Cuenta en Clore.ai con alquiler de GPU * **GPU NVIDIA con arquitectura Ampere o más reciente** (RTX 3090, A100, RTX 4090, H100) * Conocimientos básicos de Linux y Docker * VRAM suficiente para el modelo elegido *** ## Requisitos de VRAM por modelo | Modelo | FP16 | INT8 | INT4 | | ------------- | ----- | ---- | ---- | | Llama-3.1 8B | 16GB | 8GB | 4GB | | Llama-3.1 70B | 140GB | 70GB | 35GB | | Mistral 7B | 14GB | 7GB | 4GB | | Mixtral 8x7B | 90GB | 45GB | 24GB | | Qwen2.5 72B | 144GB | 72GB | 36GB | *** ## Paso 1 — Elige tu GPU en Clore.ai 1. Inicia sesión en [clore.ai](https://clore.ai) → **Marketplace** 2. **Para serving con GPU única (modelos 7B–13B):** RTX 4090 24GB o RTX 3090 24GB 3. **Para modelos grandes (70B+):** Múltiples A100 80GB o H100 {% hint style="info" %} **Estrategia multi-GPU:** * 2x A100 80GB → Llama 3.1 70B en FP16 o Qwen2.5 72B * 4x A100 80GB → Llama 3.1 405B en INT8 * Selecciona servidores con múltiples GPUs listadas en el marketplace de Clore.ai {% endhint %} *** ## Paso 2 — Despliega Triton Inference Server con el backend TRT-LLM **Imagen Docker:** ``` nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 ``` {% hint style="warning" %} Usa la `-trtllm-python-py3` variante — esto incluye el backend TensorRT-LLM preinstalado. La etiqueta corresponde a la versión del contenedor NVIDIA (24.01 = enero de 2024). Consulta [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags) para la etiqueta más reciente. {% endhint %} **Puertos expuestos:** ``` 22 8000 ``` **Variables de entorno:** ``` NVIDIA_VISIBLE_DEVICES=all NVIDIA_DRIVER_CAPABILITIES=compute,utility TRANSFORMERS_CACHE=/workspace/hf_cache HF_HOME=/workspace/hf_cache ``` **Volumen/Disco:** Mínimo 100GB recomendado *** ## Paso 3 — Conéctate y verifica la instalación ```bash ssh root@ -p # Comprobar GPU nvidia-smi # Verificar versión de TensorRT python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)" # Verificar que Triton esté disponible tritonserver --version ``` *** ## Paso 4 — Descargar y preparar el modelo Usaremos Llama 3.1 8B como ejemplo. Ajusta las rutas para el modelo que elijas. ### Instalar HuggingFace CLI ```bash pip install huggingface_hub huggingface-cli login # Ingresa tu token de HuggingFace cuando se te solicite ``` ### Descargar pesos del modelo ```bash mkdir -p /workspace/models/llama-3.1-8b huggingface-cli download \ meta-llama/Llama-3.1-8B-Instruct \ --local-dir /workspace/models/llama-3.1-8b \ --local-dir-use-symlinks False # O usar snapshot_download python3 << 'EOF' from huggingface_hub import snapshot_download snapshot_download( repo_id="meta-llama/Llama-3.1-8B-Instruct", local_dir="/workspace/models/llama-3.1-8b", local_dir_use_symlinks=False ) EOF ``` *** ## Paso 5 — Construir el motor TensorRT Este es el paso clave — compilar el modelo en un engine TensorRT optimizado. ### Engine FP16 (Mejor calidad) ```bash cd /workspace # Convertir pesos de HuggingFace al formato TRT-LLM python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir /workspace/models/llama-3.1-8b \ --output_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \ --dtype float16 \ --tp_size 1 # Construir engine de TensorRT trtllm-build \ --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \ --output_dir /workspace/trt_engines/llama-3.1-8b-fp16 \ --gemm_plugin float16 \ --max_batch_size 32 \ --max_input_len 4096 \ --max_seq_len 8192 \ --max_num_tokens 16384 \ --use_paged_context_fmha enable ``` ### Engine INT8 SmoothQuant (Mayor rendimiento) ```bash # Convertir con cuantización SmoothQuant python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir /workspace/models/llama-3.1-8b \ --output_dir /workspace/trt_checkpoints/llama-3.1-8b-int8 \ --dtype float16 \ --smoothquant 0.5 \ --per_channel \ --per_token trtllm-build \ --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-int8 \ --output_dir /workspace/trt_engines/llama-3.1-8b-int8 \ --gemm_plugin float16 \ --smoothquant_plugin float16 \ --max_batch_size 64 \ --max_input_len 4096 \ --max_seq_len 8192 ``` ### Engine INT4 AWQ (Máximo rendimiento / Mínima memoria) ```bash # Instalar auto-gptq para la cuantización pip install autoawq # Cuantizar a INT4 AWQ python3 << 'EOF' from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model_path = "/workspace/models/llama-3.1-8b" quant_path = "/workspace/models/llama-3.1-8b-awq-int4" model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } model.quantize(tokenizer, quant_config=quant_config) model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path) EOF # Convertir AWQ a TRT-LLM python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir /workspace/models/llama-3.1-8b-awq-int4 \ --output_dir /workspace/trt_checkpoints/llama-3.1-8b-int4 \ --dtype float16 \ --quant_ckpt_path /workspace/models/llama-3.1-8b-awq-int4 \ --use_weight_only \ --weight_only_precision int4_awq \ --per_group trtllm-build \ --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-int4 \ --output_dir /workspace/trt_engines/llama-3.1-8b-int4 \ --gemm_plugin float16 \ --max_batch_size 128 \ --max_input_len 4096 \ --max_seq_len 8192 ``` {% hint style="info" %} **Tiempo de construcción del engine:** 10–30 minutos dependiendo de la GPU y del tamaño del modelo. Esta es una operación única — una vez construido, el engine se carga en segundos. {% endhint %} *** ## Paso 6 — Prueba rápida con la API Python de TRT-LLM Antes de configurar Triton, verifica que el engine funcione: ```bash python3 << 'EOF' import tensorrt_llm from tensorrt_llm.runtime import ModelRunner from transformers import AutoTokenizer engine_dir = "/workspace/trt_engines/llama-3.1-8b-fp16" tokenizer_dir = "/workspace/models/llama-3.1-8b" tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir) runner = ModelRunner.from_dir( engine_dir=engine_dir, rank=0 ) prompt = "¿Cuál es la capital de Francia?" input_ids = tokenizer.encode(prompt, return_tensors="pt") output = runner.generate( batch_input_ids=[input_ids[0].tolist()], max_new_tokens=200, temperature=0.7, top_p=0.9 ) output_ids = output[0][0][len(input_ids[0]):] response = tokenizer.decode(output_ids, skip_special_tokens=True) print(f"Respuesta: {response}") EOF ``` *** ## Paso 7 — Configurar Triton Inference Server ### Crear estructura del repositorio de modelos ```bash mkdir -p /workspace/triton_model_repo/llama/1 # Crear configuración del modelo cat > /workspace/triton_model_repo/llama/config.pbtxt << 'EOF' backend: "tensorrtllm" name: "llama" max_batch_size: 64 model_transaction_policy { decoupled: true } dynamic_batching { preferred_batch_size: [1, 2, 4, 8, 16, 32, 64] max_queue_delay_microseconds: 1000 } input [ { name: "input_ids" data_type: TYPE_INT32 dims: [-1] }, { name: "input_lengths" data_type: TYPE_INT32 dims: [1] reshape: { shape: [] } }, { name: "request_output_len" data_type: TYPE_INT32 dims: [1] reshape: { shape: [] } }, { name: "temperature" data_type: TYPE_FP32 dims: [1] reshape: { shape: [] } optional: true } ] output [ { name: "output_ids" data_type: TYPE_INT32 dims: [-1, -1] }, { name: "sequence_length" data_type: TYPE_INT32 dims: [1] } ] instance_group [ { count: 1 kind: KIND_GPU gpus: [0] } ] parameters: { key: "gpt_model_type" value: { string_value: "inflight_fused_batching" } } parameters: { key: "gpt_model_path" value: { string_value: "/workspace/trt_engines/llama-3.1-8b-fp16" } } parameters: { key: "max_tokens_in_paged_kv_cache" value: { string_value: "8192" } } parameters: { key: "batch_scheduler_policy" value: { string_value: "guaranteed_no_evict" } } EOF ``` ### Crear enlace simbólico del engine ```bash ln -s /workspace/trt_engines/llama-3.1-8b-fp16 \ /workspace/triton_model_repo/llama/1/ ``` ### Iniciar Triton Server ```bash tritonserver \ --model-repository=/workspace/triton_model_repo \ --http-port=8000 \ --grpc-port=8001 \ --metrics-port=8002 \ --log-verbose=0 & # Esperar a que el servidor arranque sleep 30 # Comprobar la salud del servidor curl -s http://localhost:8000/v2/health/ready ``` *** ## Paso 8 — Consultar la API ### Cliente compatible con OpenAI ```python import requests import json def generate(prompt: str, max_tokens: int = 200) -> str: url = "http://localhost:8000/v2/models/llama/generate" payload = { "text_input": prompt, "parameters": { "max_tokens": max_tokens, "temperature": 0.7, "top_p": 0.9 } } response = requests.post(url, json=payload) result = response.json() return result.get("text_output", "") # Prueba print(generate("Explica la computación cuántica en términos sencillos:")) ``` ### Medir el rendimiento (throughput) ```bash # Instalar tritonclient pip install tritonclient[all] # Ejecutar benchmark de rendimiento perf_analyzer \ -m llama \ -u localhost:8001 \ --protocol grpc \ --input-data /workspace/sample_inputs.json \ --concurrency-range 1:32:2 \ --measurement-interval 10000 \ --shape input_ids:512 \ --shape input_lengths:1 \ --shape request_output_len:1 ``` *** ## Paso 9 — Añadir wrapper de API compatible con OpenAI Para una integración más sencilla, añade un wrapper FastAPI: ```bash pip install fastapi uvicorn tritonclient[all] cat > /workspace/openai_server.py << 'EOF' from fastapi import FastAPI from pydantic import BaseModel import tritonclient.http as httpclient import numpy as np from transformers import AutoTokenizer app = FastAPI() tokenizer = AutoTokenizer.from_pretrained("/workspace/models/llama-3.1-8b") client = httpclient.InferenceServerClient("localhost:8000") class ChatRequest(BaseModel): model: str = "llama" messages: list max_tokens: int = 512 temperature: float = 0.7 @app.post("/v1/chat/completions") async def chat(req: ChatRequest): prompt = tokenizer.apply_chat_template( req.messages, tokenize=False, add_generation_prompt=True ) input_ids = tokenizer.encode(prompt) inputs = [ httpclient.InferInput("input_ids", [len(input_ids)], "INT32"), httpclient.InferInput("input_lengths", [1], "INT32"), httpclient.InferInput("request_output_len", [1], "INT32"), ] inputs[0].set_data_from_numpy(np.array(input_ids, dtype=np.int32)) inputs[1].set_data_from_numpy(np.array([len(input_ids)], dtype=np.int32)) inputs[2].set_data_from_numpy(np.array([req.max_tokens], dtype=np.int32)) result = client.infer("llama", inputs) output_ids = result.as_numpy("output_ids")[0][len(input_ids):] text = tokenizer.decode(output_ids, skip_special_tokens=True) return { "choices": [{"message": {"role": "assistant", "content": text}}] } if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8080) EOF python3 /workspace/openai_server.py & ``` *** ## Solución de problemas ### OOM durante la construcción del engine ```bash # Reducir max_batch_size y max_num_tokens trtllm-build \ --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \ --output_dir /workspace/trt_engines/llama-3.1-8b-fp16 \ --gemm_plugin float16 \ --max_batch_size 8 \ # Reducir desde 32 --max_input_len 2048 \ # Reducir desde 4096 --max_seq_len 4096 # Reducir desde 8192 ``` ### Triton Server no inicia ```bash # Comprobar logs cat /workspace/triton.log # Verificar que los archivos del engine existan ls -la /workspace/trt_engines/llama-3.1-8b-fp16/ # Revisa la memoria GPU nvidia-smi ``` ### Bajo rendimiento (Low Throughput) ```bash # Habilitar batching en vuelo e incrementar concurrencia # Afinar max_tokens_in_paged_kv_cache según la VRAM disponible ``` *** ## Benchmarks de rendimiento en GPUs de Clore.ai | Modelo | GPU | Cuantización | Rendimiento (tokens/seg) | | ------------- | ----------- | ------------ | ------------------------ | | Llama 3.1 8B | RTX 4090 | FP16 | \~3,500 | | Llama 3.1 8B | RTX 4090 | INT4 AWQ | \~6,200 | | Llama 3.1 70B | 2x A100 80G | FP16 | \~1,800 | | Mixtral 8x7B | 2x RTX 4090 | INT8 | \~2,400 | *** ## Recursos adicionales * [TensorRT-LLM GitHub](https://github.com/NVIDIA/TensorRT-LLM) * [Triton Inference Server](https://github.com/triton-inference-server/server) * [Registro de Contenedores NGC](https://catalog.ngc.nvidia.com/) * [Documentación de TRT-LLM](https://nvidia.github.io/TensorRT-LLM/) * [Cuantización AWQ](https://github.com/mit-han-lab/llm-awq) *** *TensorRT-LLM en Clore.ai es la opción óptima para serving de LLM en producción donde el rendimiento y la latencia son críticos. Para configuraciones más simples, considera la guía de vLLM.* *** ## Recomendaciones de GPU en Clore.ai | Caso de uso | GPU recomendada | Coste estimado en Clore.ai | | ------------------------ | --------------- | -------------------------- | | Desarrollo/Pruebas | RTX 3090 (24GB) | \~$0.12/gpu/hr | | Inferencia en Producción | RTX 4090 (24GB) | \~$0.70/gpu/hr | | Modelos grandes (70B+) | A100 80GB | \~$1.20/gpu/hr | > 💡 Todos los ejemplos en esta guía pueden desplegarse en [Clore.ai](https://clore.ai/marketplace) servidores GPU. Navega las GPUs disponibles y alquila por hora — sin compromisos, acceso root completo. --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://docs.clore.ai/guides/guides_v2-es/devops-de-gpu/tensorrt-llm.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.