# TensorRT-LLM

> **Maximum LLM inference throughput with NVIDIA TensorRT optimization — deployed via Triton Inference Server**

TensorRT-LLM is NVIDIA's open-source library for optimizing Large Language Model inference on NVIDIA GPUs. It delivers state-of-the-art performance through kernel fusion, quantization (INT4, INT8, FP8), in-flight batching, and paged KV-caching. Combined with Triton Inference Server, you get a production-grade serving infrastructure.

**GitHub:** [NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) — 10K+ ⭐

***

## Why TensorRT-LLM?

| Feature                   | vLLM      | TensorRT-LLM  |
| ------------------------- | --------- | ------------- |
| Throughput                | Excellent | Best-in-class |
| Latency                   | Good      | Excellent     |
| INT4/INT8 quantization    | Partial   | Native        |
| FP8 support               | Limited   | Full          |
| Multi-GPU tensor parallel | Yes       | Yes           |
| Setup complexity          | Low       | Medium-High   |

{% hint style="success" %}
**TensorRT-LLM typically delivers 2–4x higher throughput** compared to standard HuggingFace transformers inference, and 30–50% better throughput than vLLM for batch serving scenarios.
{% endhint %}

***

## Prerequisites

* Clore.ai account with GPU rental
* **NVIDIA GPU with Ampere architecture or newer** (RTX 3090, A100, RTX 4090, H100)
* Basic Linux and Docker knowledge
* Sufficient VRAM for your chosen model

***

## VRAM Requirements by Model

| Model         | FP16  | INT8 | INT4 |
| ------------- | ----- | ---- | ---- |
| Llama-3.1 8B  | 16GB  | 8GB  | 4GB  |
| Llama-3.1 70B | 140GB | 70GB | 35GB |
| Mistral 7B    | 14GB  | 7GB  | 4GB  |
| Mixtral 8x7B  | 90GB  | 45GB | 24GB |
| Qwen2.5 72B   | 144GB | 72GB | 36GB |

***

## Step 1 — Choose Your GPU on Clore.ai

1. Log in to [clore.ai](https://clore.ai) → **Marketplace**
2. **For single GPU serving (7B–13B models):** RTX 4090 24GB or RTX 3090 24GB
3. **For large models (70B+):** Multiple A100 80GB or H100

{% hint style="info" %}
**Multi-GPU Strategy:**

* 2x A100 80GB → Llama 3.1 70B in FP16 or Qwen2.5 72B
* 4x A100 80GB → Llama 3.1 405B in INT8
* Select servers with multiple GPUs listed in the Clore.ai marketplace
  {% endhint %}

***

## Step 2 — Deploy Triton Inference Server with TRT-LLM Backend

**Docker Image:**

```
nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
```

{% hint style="warning" %}
Use the `-trtllm-python-py3` variant — this includes the TensorRT-LLM backend pre-installed. The tag corresponds to the NVIDIA container release (24.01 = January 2024). Check [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags) for the latest tag.
{% endhint %}

**Exposed Ports:**

```
22
8000
```

**Environment Variables:**

```
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
TRANSFORMERS_CACHE=/workspace/hf_cache
HF_HOME=/workspace/hf_cache
```

**Volume/Disk:** Minimum 100GB recommended

***

## Step 3 — Connect and Verify Installation

```bash
ssh root@<server-ip> -p <ssh-port>

# Check GPU
nvidia-smi

# Check TensorRT version
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"

# Check Triton is available
tritonserver --version
```

***

## Step 4 — Download and Prepare Model

We'll use Llama 3.1 8B as the example. Adjust paths for your chosen model.

### Install HuggingFace CLI

```bash
pip install huggingface_hub
huggingface-cli login
# Enter your HuggingFace token when prompted
```

### Download Model Weights

```bash
mkdir -p /workspace/models/llama-3.1-8b
huggingface-cli download \
    meta-llama/Llama-3.1-8B-Instruct \
    --local-dir /workspace/models/llama-3.1-8b \
    --local-dir-use-symlinks False

# Or use snapshot_download
python3 << 'EOF'
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="meta-llama/Llama-3.1-8B-Instruct",
    local_dir="/workspace/models/llama-3.1-8b",
    local_dir_use_symlinks=False
)
EOF
```

***

## Step 5 — Build TensorRT Engine

This is the key step — compiling the model into an optimized TensorRT engine.

### FP16 Engine (Best Quality)

```bash
cd /workspace

# Convert HuggingFace weights to TRT-LLM format
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /workspace/models/llama-3.1-8b \
    --output_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \
    --dtype float16 \
    --tp_size 1

# Build TensorRT engine
trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-fp16 \
    --gemm_plugin float16 \
    --max_batch_size 32 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --max_num_tokens 16384 \
    --use_paged_context_fmha enable
```

### INT8 SmoothQuant Engine (Higher Throughput)

```bash
# Convert with SmoothQuant quantization
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /workspace/models/llama-3.1-8b \
    --output_dir /workspace/trt_checkpoints/llama-3.1-8b-int8 \
    --dtype float16 \
    --smoothquant 0.5 \
    --per_channel \
    --per_token

trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-int8 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-int8 \
    --gemm_plugin float16 \
    --smoothquant_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_seq_len 8192
```

### INT4 AWQ Engine (Maximum Throughput / Minimum Memory)

```bash
# Install auto-gptq for quantization
pip install autoawq

# Quantize to INT4 AWQ
python3 << 'EOF'
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "/workspace/models/llama-3.1-8b"
quant_path = "/workspace/models/llama-3.1-8b-awq-int4"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
EOF

# Convert AWQ to TRT-LLM
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /workspace/models/llama-3.1-8b-awq-int4 \
    --output_dir /workspace/trt_checkpoints/llama-3.1-8b-int4 \
    --dtype float16 \
    --quant_ckpt_path /workspace/models/llama-3.1-8b-awq-int4 \
    --use_weight_only \
    --weight_only_precision int4_awq \
    --per_group

trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-int4 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-int4 \
    --gemm_plugin float16 \
    --max_batch_size 128 \
    --max_input_len 4096 \
    --max_seq_len 8192
```

{% hint style="info" %}
**Engine build time:** 10–30 minutes depending on GPU and model size. This is a one-time operation — once built, the engine loads in seconds.
{% endhint %}

***

## Step 6 — Quick Test with TRT-LLM Python API

Before setting up Triton, verify the engine works:

```bash
python3 << 'EOF'
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
from transformers import AutoTokenizer

engine_dir = "/workspace/trt_engines/llama-3.1-8b-fp16"
tokenizer_dir = "/workspace/models/llama-3.1-8b"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
runner = ModelRunner.from_dir(
    engine_dir=engine_dir,
    rank=0
)

prompt = "What is the capital of France?"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

output = runner.generate(
    batch_input_ids=[input_ids[0].tolist()],
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9
)

output_ids = output[0][0][len(input_ids[0]):]
response = tokenizer.decode(output_ids, skip_special_tokens=True)
print(f"Response: {response}")
EOF
```

***

## Step 7 — Set Up Triton Inference Server

### Create Model Repository Structure

```bash
mkdir -p /workspace/triton_model_repo/llama/1

# Create model configuration
cat > /workspace/triton_model_repo/llama/config.pbtxt << 'EOF'
backend: "tensorrtllm"
name: "llama"
max_batch_size: 64
model_transaction_policy {
  decoupled: true
}

dynamic_batching {
  preferred_batch_size: [1, 2, 4, 8, 16, 32, 64]
  max_queue_delay_microseconds: 1000
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [-1]
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [1]
    reshape: { shape: [] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [1]
    reshape: { shape: [] }
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [1]
    reshape: { shape: [] }
    optional: true
  }
]

output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [-1, -1]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [1]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [0]
  }
]

parameters: {
  key: "gpt_model_type"
  value: { string_value: "inflight_fused_batching" }
}

parameters: {
  key: "gpt_model_path"
  value: { string_value: "/workspace/trt_engines/llama-3.1-8b-fp16" }
}

parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: { string_value: "8192" }
}

parameters: {
  key: "batch_scheduler_policy"
  value: { string_value: "guaranteed_no_evict" }
}
EOF
```

### Create Engine Symlink

```bash
ln -s /workspace/trt_engines/llama-3.1-8b-fp16 \
    /workspace/triton_model_repo/llama/1/
```

### Start Triton Server

```bash
tritonserver \
    --model-repository=/workspace/triton_model_repo \
    --http-port=8000 \
    --grpc-port=8001 \
    --metrics-port=8002 \
    --log-verbose=0 &

# Wait for server to start
sleep 30

# Check server health
curl -s http://localhost:8000/v2/health/ready
```

***

## Step 8 — Query the API

### OpenAI-Compatible Client

```python
import requests
import json

def generate(prompt: str, max_tokens: int = 200) -> str:
    url = "http://localhost:8000/v2/models/llama/generate"
    
    payload = {
        "text_input": prompt,
        "parameters": {
            "max_tokens": max_tokens,
            "temperature": 0.7,
            "top_p": 0.9
        }
    }
    
    response = requests.post(url, json=payload)
    result = response.json()
    return result.get("text_output", "")

# Test
print(generate("Explain quantum computing in simple terms:"))
```

### Benchmark Throughput

```bash
# Install tritonclient
pip install tritonclient[all]

# Run performance benchmark
perf_analyzer \
    -m llama \
    -u localhost:8001 \
    --protocol grpc \
    --input-data /workspace/sample_inputs.json \
    --concurrency-range 1:32:2 \
    --measurement-interval 10000 \
    --shape input_ids:512 \
    --shape input_lengths:1 \
    --shape request_output_len:1
```

***

## Step 9 — Add OpenAI-Compatible API Wrapper

For easier integration, add a FastAPI wrapper:

```bash
pip install fastapi uvicorn tritonclient[all]

cat > /workspace/openai_server.py << 'EOF'
from fastapi import FastAPI
from pydantic import BaseModel
import tritonclient.http as httpclient
import numpy as np
from transformers import AutoTokenizer

app = FastAPI()
tokenizer = AutoTokenizer.from_pretrained("/workspace/models/llama-3.1-8b")
client = httpclient.InferenceServerClient("localhost:8000")

class ChatRequest(BaseModel):
    model: str = "llama"
    messages: list
    max_tokens: int = 512
    temperature: float = 0.7

@app.post("/v1/chat/completions")
async def chat(req: ChatRequest):
    prompt = tokenizer.apply_chat_template(
        req.messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    input_ids = tokenizer.encode(prompt)
    
    inputs = [
        httpclient.InferInput("input_ids", [len(input_ids)], "INT32"),
        httpclient.InferInput("input_lengths", [1], "INT32"),
        httpclient.InferInput("request_output_len", [1], "INT32"),
    ]
    inputs[0].set_data_from_numpy(np.array(input_ids, dtype=np.int32))
    inputs[1].set_data_from_numpy(np.array([len(input_ids)], dtype=np.int32))
    inputs[2].set_data_from_numpy(np.array([req.max_tokens], dtype=np.int32))
    
    result = client.infer("llama", inputs)
    output_ids = result.as_numpy("output_ids")[0][len(input_ids):]
    text = tokenizer.decode(output_ids, skip_special_tokens=True)
    
    return {
        "choices": [{"message": {"role": "assistant", "content": text}}]
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
EOF

python3 /workspace/openai_server.py &
```

***

## Troubleshooting

### Engine Build OOM

```bash
# Reduce max_batch_size and max_num_tokens
trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-fp16 \
    --gemm_plugin float16 \
    --max_batch_size 8 \        # Reduce from 32
    --max_input_len 2048 \      # Reduce from 4096
    --max_seq_len 4096          # Reduce from 8192
```

### Triton Server Not Starting

```bash
# Check logs
cat /workspace/triton.log

# Verify engine files exist
ls -la /workspace/trt_engines/llama-3.1-8b-fp16/

# Check GPU memory
nvidia-smi
```

### Low Throughput

```bash
# Enable in-flight batching and increase concurrency
# Tune max_tokens_in_paged_kv_cache based on available VRAM
```

***

## Performance Benchmarks on Clore.ai GPUs

| Model         | GPU         | Quantization | Throughput (tokens/sec) |
| ------------- | ----------- | ------------ | ----------------------- |
| Llama 3.1 8B  | RTX 4090    | FP16         | \~3,500                 |
| Llama 3.1 8B  | RTX 4090    | INT4 AWQ     | \~6,200                 |
| Llama 3.1 70B | 2x A100 80G | FP16         | \~1,800                 |
| Mixtral 8x7B  | 2x RTX 4090 | INT8         | \~2,400                 |

***

## Additional Resources

* [TensorRT-LLM GitHub](https://github.com/NVIDIA/TensorRT-LLM)
* [Triton Inference Server](https://github.com/triton-inference-server/server)
* [NGC Container Registry](https://catalog.ngc.nvidia.com/)
* [TRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/)
* [AWQ Quantization](https://github.com/mit-han-lab/llm-awq)

***

*TensorRT-LLM on Clore.ai is the optimal choice for production LLM serving where throughput and latency are critical. For simpler setups, consider the vLLM guide.*

***

## Clore.ai GPU Recommendations

| Use Case             | Recommended GPU | Est. Cost on Clore.ai |
| -------------------- | --------------- | --------------------- |
| Development/Testing  | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production Inference | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Models (70B+)  | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/gpu-devops/tensorrt-llm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
