# TGI (Text Generation Inference)

Text Generation Inference (TGI) is HuggingFace's production-grade LLM serving framework, designed for high-throughput and low-latency inference. It supports Flash Attention 2, continuous batching, PagedAttention, and tensor parallelism out of the box — making it the go-to solution for deploying large language models at scale on CLORE.AI GPU servers.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter | Minimum                                  | Recommended          |
| --------- | ---------------------------------------- | -------------------- |
| RAM       | 16 GB                                    | 32 GB+               |
| VRAM      | 8 GB                                     | 24 GB+               |
| Disk      | 50 GB                                    | 200 GB+              |
| GPU       | Any NVIDIA (Ampere+ for Flash Attention) | A100, H100, RTX 4090 |

{% hint style="info" %}
Flash Attention 2 requires Ampere architecture or newer (RTX 3000+, A100, H100). For older GPUs, TGI will fall back to standard attention automatically.
{% endhint %}

## Quick Deploy on CLORE.AI

**Docker Image:** `ghcr.io/huggingface/text-generation-inference:latest`

**Ports:** `22/tcp`, `8080/http`

**Environment Variables:**

| Variable           | Example                              | Description                           |
| ------------------ | ------------------------------------ | ------------------------------------- |
| `MODEL_ID`         | `mistralai/Mistral-7B-Instruct-v0.3` | HuggingFace model ID                  |
| `HF_TOKEN`         | `hf_xxx...`                          | HuggingFace token (for gated models)  |
| `NUM_SHARD`        | `2`                                  | Number of GPUs for tensor parallelism |
| `MAX_INPUT_LENGTH` | `4096`                               | Max input tokens                      |
| `MAX_TOTAL_TOKENS` | `8192`                               | Max input + output tokens             |
| `QUANTIZE`         | `bitsandbytes-nf4`                   | Quantization method                   |

## Step-by-Step Setup

### 1. Rent a GPU Server on CLORE.AI

Go to [CLORE.AI Marketplace](https://clore.ai/marketplace) and filter servers by:

* VRAM ≥ 24 GB for 7B models (full precision)
* VRAM ≥ 12 GB for 7B models (4-bit quantization)
* VRAM ≥ 80 GB for 70B models (full precision, single GPU)

### 2. Connect via SSH

After your order is confirmed, connect to your server using the SSH details from your CLORE.AI dashboard:

```bash
ssh -p <PORT> root@<SERVER_IP>
```

Or use the Web Terminal from your CLORE.AI order panel.

### 3. Pull the TGI Docker Image

```bash
docker pull ghcr.io/huggingface/text-generation-inference:latest
```

### 4. Launch TGI with a Model

**Basic launch (Mistral 7B):**

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /root/models:/data \
  -e MODEL_ID=mistralai/Mistral-7B-Instruct-v0.3 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.3 \
  --max-input-length 4096 \
  --max-total-tokens 8192
```

**With HuggingFace token (for gated models like Llama 3):**

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /root/models:/data \
  -e HUGGING_FACE_HUB_TOKEN=hf_your_token_here \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3-8B-Instruct \
  --max-input-length 8192 \
  --max-total-tokens 16384
```

**With 4-bit quantization (for smaller VRAM):**

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /root/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --quantize bitsandbytes-nf4 \
  --max-input-length 4096 \
  --max-total-tokens 8192
```

**Multi-GPU tensor parallelism (for 70B models):**

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 2g \
  -p 8080:80 \
  -v /root/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3-70B-Instruct \
  --num-shard 2 \
  --max-input-length 8192 \
  --max-total-tokens 16384
```

### 5. Verify the Server is Running

```bash
# Check logs
docker logs -f tgi

# Wait for "Connected" message, then test:
curl http://localhost:8080/health
```

Expected response: `{"status":"ok"}`

### 6. Access via CLORE.AI HTTP Proxy

In your CLORE.AI order panel, you'll see your `http_pub` URL for port 8080. This allows browser/API access without SSH tunneling:

```
https://<order-id>.clore.ai/
```

***

## Usage Examples

### Example 1: Basic Text Generation

```bash
curl http://localhost:8080/generate \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "What is the capital of France?",
    "parameters": {
      "max_new_tokens": 100,
      "temperature": 0.7
    }
  }'
```

### Example 2: Chat Completions (OpenAI-compatible)

TGI supports the OpenAI chat completions API format:

```bash
curl http://localhost:8080/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tgi",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum entanglement in simple terms."}
    ],
    "max_tokens": 512,
    "temperature": 0.8,
    "stream": false
  }'
```

### Example 3: Streaming Response

```bash
curl http://localhost:8080/generate_stream \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "Write a Python function to calculate Fibonacci numbers:",
    "parameters": {
      "max_new_tokens": 300,
      "temperature": 0.2
    }
  }' \
  --no-buffer
```

### Example 4: Python Client

```python
from huggingface_hub import InferenceClient

# Replace with your CLORE.AI http_pub URL
client = InferenceClient(model="http://localhost:8080")

# Simple generation
response = client.text_generation(
    "Translate to French: Hello, how are you?",
    max_new_tokens=100,
    temperature=0.7,
)
print(response)

# Chat
for token in client.chat_completion(
    messages=[{"role": "user", "content": "What is machine learning?"}],
    max_tokens=200,
    stream=True,
):
    print(token.choices[0].delta.content, end="", flush=True)
```

### Example 5: Batch Requests

```python
import requests

BASE_URL = "http://localhost:8080"  # or your CLORE.AI http_pub URL

prompts = [
    "Summarize the French Revolution in 3 sentences.",
    "Write a haiku about GPU computing.",
    "What are the main benefits of Rust over C++?",
]

results = []
for prompt in prompts:
    response = requests.post(
        f"{BASE_URL}/generate",
        json={"inputs": prompt, "parameters": {"max_new_tokens": 150}},
    )
    results.append(response.json()["generated_text"])

for prompt, result in zip(prompts, results):
    print(f"Prompt: {prompt}\nAnswer: {result}\n{'-'*50}")
```

***

## Configuration

### Key CLI Parameters

| Parameter                   | Default  | Description                                     |
| --------------------------- | -------- | ----------------------------------------------- |
| `--model-id`                | required | HuggingFace model ID or local path              |
| `--num-shard`               | 1        | Number of GPU shards (tensor parallelism)       |
| `--max-concurrent-requests` | 128      | Max simultaneous requests                       |
| `--max-input-length`        | 1024     | Max input token length                          |
| `--max-total-tokens`        | 2048     | Max input + output tokens                       |
| `--max-batch-total-tokens`  | auto     | Max tokens per batch                            |
| `--quantize`                | none     | Quantization: `bitsandbytes-nf4`, `gptq`, `awq` |
| `--dtype`                   | auto     | `float16`, `bfloat16`                           |
| `--trust-remote-code`       | false    | Allow custom model code                         |
| `--port`                    | 80       | Server port                                     |

### Using a Local Model

If you have a model downloaded locally:

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /path/to/your/model:/model \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /model
```

### AWQ Quantization (Faster than NF4)

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /root/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id casperhansen/mistral-7b-instruct-v0.2-awq \
  --quantize awq
```

***

## Performance Tips

### 1. Enable Flash Attention 2

Flash Attention 2 is automatically enabled on Ampere+ GPUs (RTX 3000+, A100, H100). No extra configuration needed.

### 2. Tune Max Batch Size

For high-throughput scenarios, increase batch size:

```bash
--max-batch-total-tokens 32000 \
--max-waiting-tokens 20
```

### 3. Use bfloat16 on Ampere+ GPUs

```bash
--dtype bfloat16
```

This is more numerically stable than float16 and performs identically on modern GPUs.

### 4. Pre-download Models to Persistent Storage

```bash
# On the server, pre-download before starting TGI
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('mistralai/Mistral-7B-Instruct-v0.3', local_dir='/root/models/mistral-7b')
"
```

Then mount the local path to avoid re-downloading on restarts.

### 5. GPU Memory Management

For RTX 3090/4090 (24GB VRAM):

```bash
# 7B model in float16 fits perfectly
--max-total-tokens 8192

# 13B model needs quantization
--quantize bitsandbytes-nf4
--max-total-tokens 4096
```

### 6. Speculative Decoding

For faster generation with smaller models as draft:

```bash
--speculate 4  # Number of speculative tokens
```

***

## Troubleshooting

### Problem: "CUDA out of memory"

```
Error: CUDA out of memory. Tried to allocate X GiB
```

**Solution:** Reduce `--max-total-tokens` or enable quantization:

```bash
--quantize bitsandbytes-nf4
--max-total-tokens 4096
```

### Problem: Model download is slow

**Solution:** Use HuggingFace mirror or pre-download:

```bash
# Set mirror
-e HF_ENDPOINT=https://hf-mirror.com
```

### Problem: Server not accessible via http\_pub

**Solution:** Make sure port 8080 is mapped correctly. TGI listens on port 80 internally, but you map it to 8080 externally:

```bash
-p 8080:80  # host:container
```

### Problem: "trust\_remote\_code is required"

Some models (e.g., Falcon, Phi) require custom code:

```bash
--trust-remote-code
```

### Problem: Slow first response

The first request triggers model loading into VRAM. This is normal. Subsequent requests will be fast.

```bash
# Check loading progress
docker logs -f tgi | grep -E "Connected|Error|Loading"
```

### Problem: Container exits immediately

```bash
# Check for errors
docker logs tgi

# Common fix: increase shared memory
--shm-size 2g
```

***

## Links

* [GitHub](https://github.com/huggingface/text-generation-inference)
* [Documentation](https://huggingface.co/docs/text-generation-inference)
* [Docker Hub / GHCR](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference)
* [Supported Models](https://huggingface.co/docs/text-generation-inference/supported_models)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)

***

## Clore.ai GPU Recommendations

| Use Case            | Recommended GPU  | Est. Cost on Clore.ai |
| ------------------- | ---------------- | --------------------- |
| Development/Testing | RTX 3090 (24GB)  | \~$0.12/gpu/hr        |
| Production (7B–13B) | RTX 4090 (24GB)  | \~$0.70/gpu/hr        |
| Large Models (70B+) | A100 80GB / H100 | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.
