TensorRT-LLM
Why TensorRT-LLM?
Feature
vLLM
TensorRT-LLM
Prerequisites
VRAM Requirements by Model
Model
FP16
INT8
INT4
Step 1 — Choose Your GPU on Clore.ai
Step 2 — Deploy Triton Inference Server with TRT-LLM Backend
Step 3 — Connect and Verify Installation
Step 4 — Download and Prepare Model
Install HuggingFace CLI
Download Model Weights
Step 5 — Build TensorRT Engine
FP16 Engine (Best Quality)
INT8 SmoothQuant Engine (Higher Throughput)
INT4 AWQ Engine (Maximum Throughput / Minimum Memory)
Step 6 — Quick Test with TRT-LLM Python API
Step 7 — Set Up Triton Inference Server
Create Model Repository Structure
Create Engine Symlink
Start Triton Server
Step 8 — Query the API
OpenAI-Compatible Client
Benchmark Throughput
Step 9 — Add OpenAI-Compatible API Wrapper
Troubleshooting
Engine Build OOM
Triton Server Not Starting
Low Throughput
Performance Benchmarks on Clore.ai GPUs
Model
GPU
Quantization
Throughput (tokens/sec)
Additional Resources
Clore.ai GPU Recommendations
Use Case
Recommended GPU
Est. Cost on Clore.ai
Last updated
Was this helpful?