TensorRT-LLM

Maximum LLM inference throughput with NVIDIA TensorRT optimization — deployed via Triton Inference Server

TensorRT-LLM is NVIDIA's open-source library for optimizing Large Language Model inference on NVIDIA GPUs. It delivers state-of-the-art performance through kernel fusion, quantization (INT4, INT8, FP8), in-flight batching, and paged KV-caching. Combined with Triton Inference Server, you get a production-grade serving infrastructure.

GitHub: NVIDIA/TensorRT-LLMarrow-up-right — 10K+ ⭐


Why TensorRT-LLM?

Feature
vLLM
TensorRT-LLM

Throughput

Excellent

Best-in-class

Latency

Good

Excellent

INT4/INT8 quantization

Partial

Native

FP8 support

Limited

Full

Multi-GPU tensor parallel

Yes

Yes

Setup complexity

Low

Medium-High

circle-check

Prerequisites

  • Clore.ai account with GPU rental

  • NVIDIA GPU with Ampere architecture or newer (RTX 3090, A100, RTX 4090, H100)

  • Basic Linux and Docker knowledge

  • Sufficient VRAM for your chosen model


VRAM Requirements by Model

Model
FP16
INT8
INT4

Llama-3.1 8B

16GB

8GB

4GB

Llama-3.1 70B

140GB

70GB

35GB

Mistral 7B

14GB

7GB

4GB

Mixtral 8x7B

90GB

45GB

24GB

Qwen2.5 72B

144GB

72GB

36GB


Step 1 — Choose Your GPU on Clore.ai

  1. Log in to clore.aiarrow-up-rightMarketplace

  2. For single GPU serving (7B–13B models): RTX 4090 24GB or RTX 3090 24GB

  3. For large models (70B+): Multiple A100 80GB or H100

circle-info

Multi-GPU Strategy:

  • 2x A100 80GB → Llama 3.1 70B in FP16 or Qwen2.5 72B

  • 4x A100 80GB → Llama 3.1 405B in INT8

  • Select servers with multiple GPUs listed in the Clore.ai marketplace


Step 2 — Deploy Triton Inference Server with TRT-LLM Backend

Docker Image:

circle-exclamation

Exposed Ports:

Environment Variables:

Volume/Disk: Minimum 100GB recommended


Step 3 — Connect and Verify Installation


Step 4 — Download and Prepare Model

We'll use Llama 3.1 8B as the example. Adjust paths for your chosen model.

Install HuggingFace CLI

Download Model Weights


Step 5 — Build TensorRT Engine

This is the key step — compiling the model into an optimized TensorRT engine.

FP16 Engine (Best Quality)

INT8 SmoothQuant Engine (Higher Throughput)

INT4 AWQ Engine (Maximum Throughput / Minimum Memory)

circle-info

Engine build time: 10–30 minutes depending on GPU and model size. This is a one-time operation — once built, the engine loads in seconds.


Step 6 — Quick Test with TRT-LLM Python API

Before setting up Triton, verify the engine works:


Step 7 — Set Up Triton Inference Server

Create Model Repository Structure

Start Triton Server


Step 8 — Query the API

OpenAI-Compatible Client

Benchmark Throughput


Step 9 — Add OpenAI-Compatible API Wrapper

For easier integration, add a FastAPI wrapper:


Troubleshooting

Engine Build OOM

Triton Server Not Starting

Low Throughput


Performance Benchmarks on Clore.ai GPUs

Model
GPU
Quantization
Throughput (tokens/sec)

Llama 3.1 8B

RTX 4090

FP16

~3,500

Llama 3.1 8B

RTX 4090

INT4 AWQ

~6,200

Llama 3.1 70B

2x A100 80G

FP16

~1,800

Mixtral 8x7B

2x RTX 4090

INT8

~2,400


Additional Resources


TensorRT-LLM on Clore.ai is the optimal choice for production LLM serving where throughput and latency are critical. For simpler setups, consider the vLLM guide.


Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Production Inference

RTX 4090 (24GB)

~$0.70/gpu/hr

Large Models (70B+)

A100 80GB

~$1.20/gpu/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?