# TensorRT-LLM

> **Maximum LLM inference throughput with NVIDIA TensorRT optimization — deployed via Triton Inference Server**

TensorRT-LLM is NVIDIA's open-source library for optimizing Large Language Model inference on NVIDIA GPUs. It delivers state-of-the-art performance through kernel fusion, quantization (INT4, INT8, FP8), in-flight batching, and paged KV-caching. Combined with Triton Inference Server, you get a production-grade serving infrastructure.

**GitHub:** [NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) — 10K+ ⭐

***

## Why TensorRT-LLM?

| Feature                   | vLLM      | TensorRT-LLM  |
| ------------------------- | --------- | ------------- |
| Throughput                | Excellent | Best-in-class |
| Latency                   | Good      | Excellent     |
| INT4/INT8 quantization    | Partial   | Native        |
| FP8 support               | Limited   | Full          |
| Multi-GPU tensor parallel | Yes       | Yes           |
| Setup complexity          | Low       | Medium-High   |

{% hint style="success" %}
**TensorRT-LLM typically delivers 2–4x higher throughput** compared to standard HuggingFace transformers inference, and 30–50% better throughput than vLLM for batch serving scenarios.
{% endhint %}

***

## Prerequisites

* Clore.ai account with GPU rental
* **NVIDIA GPU with Ampere architecture or newer** (RTX 3090, A100, RTX 4090, H100)
* Basic Linux and Docker knowledge
* Sufficient VRAM for your chosen model

***

## VRAM Requirements by Model

| Model         | FP16  | INT8 | INT4 |
| ------------- | ----- | ---- | ---- |
| Llama-3.1 8B  | 16GB  | 8GB  | 4GB  |
| Llama-3.1 70B | 140GB | 70GB | 35GB |
| Mistral 7B    | 14GB  | 7GB  | 4GB  |
| Mixtral 8x7B  | 90GB  | 45GB | 24GB |
| Qwen2.5 72B   | 144GB | 72GB | 36GB |

***

## Step 1 — Choose Your GPU on Clore.ai

1. Log in to [clore.ai](https://clore.ai) → **Marketplace**
2. **For single GPU serving (7B–13B models):** RTX 4090 24GB or RTX 3090 24GB
3. **For large models (70B+):** Multiple A100 80GB or H100

{% hint style="info" %}
**Multi-GPU Strategy:**

* 2x A100 80GB → Llama 3.1 70B in FP16 or Qwen2.5 72B
* 4x A100 80GB → Llama 3.1 405B in INT8
* Select servers with multiple GPUs listed in the Clore.ai marketplace
  {% endhint %}

***

## Step 2 — Deploy Triton Inference Server with TRT-LLM Backend

**Docker Image:**

```
nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
```

{% hint style="warning" %}
Use the `-trtllm-python-py3` variant — this includes the TensorRT-LLM backend pre-installed. The tag corresponds to the NVIDIA container release (24.01 = January 2024). Check [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags) for the latest tag.
{% endhint %}

**Exposed Ports:**

```
22
8000
```

**Environment Variables:**

```
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
TRANSFORMERS_CACHE=/workspace/hf_cache
HF_HOME=/workspace/hf_cache
```

**Volume/Disk:** Minimum 100GB recommended

***

## Step 3 — Connect and Verify Installation

```bash
ssh root@<server-ip> -p <ssh-port>

# Check GPU
nvidia-smi

# Check TensorRT version
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"

# Check Triton is available
tritonserver --version
```

***

## Step 4 — Download and Prepare Model

We'll use Llama 3.1 8B as the example. Adjust paths for your chosen model.

### Install HuggingFace CLI

```bash
pip install huggingface_hub
huggingface-cli login
# Enter your HuggingFace token when prompted
```

### Download Model Weights

```bash
mkdir -p /workspace/models/llama-3.1-8b
huggingface-cli download \
    meta-llama/Llama-3.1-8B-Instruct \
    --local-dir /workspace/models/llama-3.1-8b \
    --local-dir-use-symlinks False

# Or use snapshot_download
python3 << 'EOF'
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="meta-llama/Llama-3.1-8B-Instruct",
    local_dir="/workspace/models/llama-3.1-8b",
    local_dir_use_symlinks=False
)
EOF
```

***

## Step 5 — Build TensorRT Engine

This is the key step — compiling the model into an optimized TensorRT engine.

### FP16 Engine (Best Quality)

```bash
cd /workspace

# Convert HuggingFace weights to TRT-LLM format
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /workspace/models/llama-3.1-8b \
    --output_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \
    --dtype float16 \
    --tp_size 1

# Build TensorRT engine
trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-fp16 \
    --gemm_plugin float16 \
    --max_batch_size 32 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --max_num_tokens 16384 \
    --use_paged_context_fmha enable
```

### INT8 SmoothQuant Engine (Higher Throughput)

```bash
# Convert with SmoothQuant quantization
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /workspace/models/llama-3.1-8b \
    --output_dir /workspace/trt_checkpoints/llama-3.1-8b-int8 \
    --dtype float16 \
    --smoothquant 0.5 \
    --per_channel \
    --per_token

trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-int8 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-int8 \
    --gemm_plugin float16 \
    --smoothquant_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_seq_len 8192
```

### INT4 AWQ Engine (Maximum Throughput / Minimum Memory)

```bash
# Install auto-gptq for quantization
pip install autoawq

# Quantize to INT4 AWQ
python3 << 'EOF'
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "/workspace/models/llama-3.1-8b"
quant_path = "/workspace/models/llama-3.1-8b-awq-int4"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
EOF

# Convert AWQ to TRT-LLM
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /workspace/models/llama-3.1-8b-awq-int4 \
    --output_dir /workspace/trt_checkpoints/llama-3.1-8b-int4 \
    --dtype float16 \
    --quant_ckpt_path /workspace/models/llama-3.1-8b-awq-int4 \
    --use_weight_only \
    --weight_only_precision int4_awq \
    --per_group

trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-int4 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-int4 \
    --gemm_plugin float16 \
    --max_batch_size 128 \
    --max_input_len 4096 \
    --max_seq_len 8192
```

{% hint style="info" %}
**Engine build time:** 10–30 minutes depending on GPU and model size. This is a one-time operation — once built, the engine loads in seconds.
{% endhint %}

***

## Step 6 — Quick Test with TRT-LLM Python API

Before setting up Triton, verify the engine works:

```bash
python3 << 'EOF'
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
from transformers import AutoTokenizer

engine_dir = "/workspace/trt_engines/llama-3.1-8b-fp16"
tokenizer_dir = "/workspace/models/llama-3.1-8b"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
runner = ModelRunner.from_dir(
    engine_dir=engine_dir,
    rank=0
)

prompt = "What is the capital of France?"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

output = runner.generate(
    batch_input_ids=[input_ids[0].tolist()],
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9
)

output_ids = output[0][0][len(input_ids[0]):]
response = tokenizer.decode(output_ids, skip_special_tokens=True)
print(f"Response: {response}")
EOF
```

***

## Step 7 — Set Up Triton Inference Server

### Create Model Repository Structure

```bash
mkdir -p /workspace/triton_model_repo/llama/1

# Create model configuration
cat > /workspace/triton_model_repo/llama/config.pbtxt << 'EOF'
backend: "tensorrtllm"
name: "llama"
max_batch_size: 64
model_transaction_policy {
  decoupled: true
}

dynamic_batching {
  preferred_batch_size: [1, 2, 4, 8, 16, 32, 64]
  max_queue_delay_microseconds: 1000
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [-1]
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [1]
    reshape: { shape: [] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [1]
    reshape: { shape: [] }
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [1]
    reshape: { shape: [] }
    optional: true
  }
]

output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [-1, -1]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [1]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [0]
  }
]

parameters: {
  key: "gpt_model_type"
  value: { string_value: "inflight_fused_batching" }
}

parameters: {
  key: "gpt_model_path"
  value: { string_value: "/workspace/trt_engines/llama-3.1-8b-fp16" }
}

parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: { string_value: "8192" }
}

parameters: {
  key: "batch_scheduler_policy"
  value: { string_value: "guaranteed_no_evict" }
}
EOF
```

### Create Engine Symlink

```bash
ln -s /workspace/trt_engines/llama-3.1-8b-fp16 \
    /workspace/triton_model_repo/llama/1/
```

### Start Triton Server

```bash
tritonserver \
    --model-repository=/workspace/triton_model_repo \
    --http-port=8000 \
    --grpc-port=8001 \
    --metrics-port=8002 \
    --log-verbose=0 &

# Wait for server to start
sleep 30

# Check server health
curl -s http://localhost:8000/v2/health/ready
```

***

## Step 8 — Query the API

### OpenAI-Compatible Client

```python
import requests
import json

def generate(prompt: str, max_tokens: int = 200) -> str:
    url = "http://localhost:8000/v2/models/llama/generate"
    
    payload = {
        "text_input": prompt,
        "parameters": {
            "max_tokens": max_tokens,
            "temperature": 0.7,
            "top_p": 0.9
        }
    }
    
    response = requests.post(url, json=payload)
    result = response.json()
    return result.get("text_output", "")

# Test
print(generate("Explain quantum computing in simple terms:"))
```

### Benchmark Throughput

```bash
# Install tritonclient
pip install tritonclient[all]

# Run performance benchmark
perf_analyzer \
    -m llama \
    -u localhost:8001 \
    --protocol grpc \
    --input-data /workspace/sample_inputs.json \
    --concurrency-range 1:32:2 \
    --measurement-interval 10000 \
    --shape input_ids:512 \
    --shape input_lengths:1 \
    --shape request_output_len:1
```

***

## Step 9 — Add OpenAI-Compatible API Wrapper

For easier integration, add a FastAPI wrapper:

```bash
pip install fastapi uvicorn tritonclient[all]

cat > /workspace/openai_server.py << 'EOF'
from fastapi import FastAPI
from pydantic import BaseModel
import tritonclient.http as httpclient
import numpy as np
from transformers import AutoTokenizer

app = FastAPI()
tokenizer = AutoTokenizer.from_pretrained("/workspace/models/llama-3.1-8b")
client = httpclient.InferenceServerClient("localhost:8000")

class ChatRequest(BaseModel):
    model: str = "llama"
    messages: list
    max_tokens: int = 512
    temperature: float = 0.7

@app.post("/v1/chat/completions")
async def chat(req: ChatRequest):
    prompt = tokenizer.apply_chat_template(
        req.messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    input_ids = tokenizer.encode(prompt)
    
    inputs = [
        httpclient.InferInput("input_ids", [len(input_ids)], "INT32"),
        httpclient.InferInput("input_lengths", [1], "INT32"),
        httpclient.InferInput("request_output_len", [1], "INT32"),
    ]
    inputs[0].set_data_from_numpy(np.array(input_ids, dtype=np.int32))
    inputs[1].set_data_from_numpy(np.array([len(input_ids)], dtype=np.int32))
    inputs[2].set_data_from_numpy(np.array([req.max_tokens], dtype=np.int32))
    
    result = client.infer("llama", inputs)
    output_ids = result.as_numpy("output_ids")[0][len(input_ids):]
    text = tokenizer.decode(output_ids, skip_special_tokens=True)
    
    return {
        "choices": [{"message": {"role": "assistant", "content": text}}]
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
EOF

python3 /workspace/openai_server.py &
```

***

## Troubleshooting

### Engine Build OOM

```bash
# Reduce max_batch_size and max_num_tokens
trtllm-build \
    --checkpoint_dir /workspace/trt_checkpoints/llama-3.1-8b-fp16 \
    --output_dir /workspace/trt_engines/llama-3.1-8b-fp16 \
    --gemm_plugin float16 \
    --max_batch_size 8 \        # Reduce from 32
    --max_input_len 2048 \      # Reduce from 4096
    --max_seq_len 4096          # Reduce from 8192
```

### Triton Server Not Starting

```bash
# Check logs
cat /workspace/triton.log

# Verify engine files exist
ls -la /workspace/trt_engines/llama-3.1-8b-fp16/

# Check GPU memory
nvidia-smi
```

### Low Throughput

```bash
# Enable in-flight batching and increase concurrency
# Tune max_tokens_in_paged_kv_cache based on available VRAM
```

***

## Performance Benchmarks on Clore.ai GPUs

| Model         | GPU         | Quantization | Throughput (tokens/sec) |
| ------------- | ----------- | ------------ | ----------------------- |
| Llama 3.1 8B  | RTX 4090    | FP16         | \~3,500                 |
| Llama 3.1 8B  | RTX 4090    | INT4 AWQ     | \~6,200                 |
| Llama 3.1 70B | 2x A100 80G | FP16         | \~1,800                 |
| Mixtral 8x7B  | 2x RTX 4090 | INT8         | \~2,400                 |

***

## Additional Resources

* [TensorRT-LLM GitHub](https://github.com/NVIDIA/TensorRT-LLM)
* [Triton Inference Server](https://github.com/triton-inference-server/server)
* [NGC Container Registry](https://catalog.ngc.nvidia.com/)
* [TRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/)
* [AWQ Quantization](https://github.com/mit-han-lab/llm-awq)

***

*TensorRT-LLM on Clore.ai is the optimal choice for production LLM serving where throughput and latency are critical. For simpler setups, consider the vLLM guide.*

***

## Clore.ai GPU Recommendations

| Use Case             | Recommended GPU | Est. Cost on Clore.ai |
| -------------------- | --------------- | --------------------- |
| Development/Testing  | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production Inference | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Models (70B+)  | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.
