TGI (Text Generation Inference)

Run HuggingFace Text Generation Inference (TGI) for production LLM serving on Clore.ai GPUs

Text Generation Inference (TGI) is HuggingFace's production-grade LLM serving framework, designed for high-throughput and low-latency inference. It supports Flash Attention 2, continuous batching, PagedAttention, and tensor parallelism out of the box — making it the go-to solution for deploying large language models at scale on CLORE.AI GPU servers.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

16 GB

32 GB+

VRAM

8 GB

24 GB+

Disk

50 GB

200 GB+

GPU

Any NVIDIA (Ampere+ for Flash Attention)

A100, H100, RTX 4090

circle-info

Flash Attention 2 requires Ampere architecture or newer (RTX 3000+, A100, H100). For older GPUs, TGI will fall back to standard attention automatically.

Quick Deploy on CLORE.AI

Docker Image: ghcr.io/huggingface/text-generation-inference:latest

Ports: 22/tcp, 8080/http

Environment Variables:

Variable
Example
Description

MODEL_ID

mistralai/Mistral-7B-Instruct-v0.3

HuggingFace model ID

HF_TOKEN

hf_xxx...

HuggingFace token (for gated models)

NUM_SHARD

2

Number of GPUs for tensor parallelism

MAX_INPUT_LENGTH

4096

Max input tokens

MAX_TOTAL_TOKENS

8192

Max input + output tokens

QUANTIZE

bitsandbytes-nf4

Quantization method

Step-by-Step Setup

1. Rent a GPU Server on CLORE.AI

Go to CLORE.AI Marketplacearrow-up-right and filter servers by:

  • VRAM ≥ 24 GB for 7B models (full precision)

  • VRAM ≥ 12 GB for 7B models (4-bit quantization)

  • VRAM ≥ 80 GB for 70B models (full precision, single GPU)

2. Connect via SSH

After your order is confirmed, connect to your server using the SSH details from your CLORE.AI dashboard:

Or use the Web Terminal from your CLORE.AI order panel.

3. Pull the TGI Docker Image

4. Launch TGI with a Model

Basic launch (Mistral 7B):

With HuggingFace token (for gated models like Llama 3):

With 4-bit quantization (for smaller VRAM):

Multi-GPU tensor parallelism (for 70B models):

5. Verify the Server is Running

Expected response: {"status":"ok"}

6. Access via CLORE.AI HTTP Proxy

In your CLORE.AI order panel, you'll see your http_pub URL for port 8080. This allows browser/API access without SSH tunneling:


Usage Examples

Example 1: Basic Text Generation

Example 2: Chat Completions (OpenAI-compatible)

TGI supports the OpenAI chat completions API format:

Example 3: Streaming Response

Example 4: Python Client

Example 5: Batch Requests


Configuration

Key CLI Parameters

Parameter
Default
Description

--model-id

required

HuggingFace model ID or local path

--num-shard

1

Number of GPU shards (tensor parallelism)

--max-concurrent-requests

128

Max simultaneous requests

--max-input-length

1024

Max input token length

--max-total-tokens

2048

Max input + output tokens

--max-batch-total-tokens

auto

Max tokens per batch

--quantize

none

Quantization: bitsandbytes-nf4, gptq, awq

--dtype

auto

float16, bfloat16

--trust-remote-code

false

Allow custom model code

--port

80

Server port

Using a Local Model

If you have a model downloaded locally:

AWQ Quantization (Faster than NF4)


Performance Tips

1. Enable Flash Attention 2

Flash Attention 2 is automatically enabled on Ampere+ GPUs (RTX 3000+, A100, H100). No extra configuration needed.

2. Tune Max Batch Size

For high-throughput scenarios, increase batch size:

3. Use bfloat16 on Ampere+ GPUs

This is more numerically stable than float16 and performs identically on modern GPUs.

4. Pre-download Models to Persistent Storage

Then mount the local path to avoid re-downloading on restarts.

5. GPU Memory Management

For RTX 3090/4090 (24GB VRAM):

6. Speculative Decoding

For faster generation with smaller models as draft:


Troubleshooting

Problem: "CUDA out of memory"

Solution: Reduce --max-total-tokens or enable quantization:

Problem: Model download is slow

Solution: Use HuggingFace mirror or pre-download:

Problem: Server not accessible via http_pub

Solution: Make sure port 8080 is mapped correctly. TGI listens on port 80 internally, but you map it to 8080 externally:

Problem: "trust_remote_code is required"

Some models (e.g., Falcon, Phi) require custom code:

Problem: Slow first response

The first request triggers model loading into VRAM. This is normal. Subsequent requests will be fast.

Problem: Container exits immediately



Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Production (7B–13B)

RTX 4090 (24GB)

~$0.70/gpu/hr

Large Models (70B+)

A100 80GB / H100

~$1.20/gpu/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?