# TGI (Text Generation Inference)

Text Generation Inference (TGI) is HuggingFace's production-grade LLM serving framework, designed for high-throughput and low-latency inference. It supports Flash Attention 2, continuous batching, PagedAttention, and tensor parallelism out of the box — making it the go-to solution for deploying large language models at scale on CLORE.AI GPU servers.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter | Minimum                                  | Recommended          |
| --------- | ---------------------------------------- | -------------------- |
| RAM       | 16 GB                                    | 32 GB+               |
| VRAM      | 8 GB                                     | 24 GB+               |
| Disk      | 50 GB                                    | 200 GB+              |
| GPU       | Any NVIDIA (Ampere+ for Flash Attention) | A100, H100, RTX 4090 |

{% hint style="info" %}
Flash Attention 2 requires Ampere architecture or newer (RTX 3000+, A100, H100). For older GPUs, TGI will fall back to standard attention automatically.
{% endhint %}

## Quick Deploy on CLORE.AI

**Docker Image:** `ghcr.io/huggingface/text-generation-inference:latest`

**Ports:** `22/tcp`, `8080/http`

**Environment Variables:**

| Variable           | Example                              | Description                           |
| ------------------ | ------------------------------------ | ------------------------------------- |
| `MODEL_ID`         | `mistralai/Mistral-7B-Instruct-v0.3` | HuggingFace model ID                  |
| `HF_TOKEN`         | `hf_xxx...`                          | HuggingFace token (for gated models)  |
| `NUM_SHARD`        | `2`                                  | Number of GPUs for tensor parallelism |
| `MAX_INPUT_LENGTH` | `4096`                               | Max input tokens                      |
| `MAX_TOTAL_TOKENS` | `8192`                               | Max input + output tokens             |
| `QUANTIZE`         | `bitsandbytes-nf4`                   | Quantization method                   |

## Step-by-Step Setup

### 1. Rent a GPU Server on CLORE.AI

Go to [CLORE.AI Marketplace](https://clore.ai/marketplace) and filter servers by:

* VRAM ≥ 24 GB for 7B models (full precision)
* VRAM ≥ 12 GB for 7B models (4-bit quantization)
* VRAM ≥ 80 GB for 70B models (full precision, single GPU)

### 2. Connect via SSH

After your order is confirmed, connect to your server using the SSH details from your CLORE.AI dashboard:

```bash
ssh -p <PORT> root@<SERVER_IP>
```

Or use the Web Terminal from your CLORE.AI order panel.

### 3. Pull the TGI Docker Image

```bash
docker pull ghcr.io/huggingface/text-generation-inference:latest
```

### 4. Launch TGI with a Model

**Basic launch (Mistral 7B):**

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /root/models:/data \
  -e MODEL_ID=mistralai/Mistral-7B-Instruct-v0.3 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.3 \
  --max-input-length 4096 \
  --max-total-tokens 8192
```

**With HuggingFace token (for gated models like Llama 3):**

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /root/models:/data \
  -e HUGGING_FACE_HUB_TOKEN=hf_your_token_here \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3-8B-Instruct \
  --max-input-length 8192 \
  --max-total-tokens 16384
```

**With 4-bit quantization (for smaller VRAM):**

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /root/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --quantize bitsandbytes-nf4 \
  --max-input-length 4096 \
  --max-total-tokens 8192
```

**Multi-GPU tensor parallelism (for 70B models):**

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 2g \
  -p 8080:80 \
  -v /root/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3-70B-Instruct \
  --num-shard 2 \
  --max-input-length 8192 \
  --max-total-tokens 16384
```

### 5. Verify the Server is Running

```bash
# Check logs
docker logs -f tgi

# Wait for "Connected" message, then test:
curl http://localhost:8080/health
```

Expected response: `{"status":"ok"}`

### 6. Access via CLORE.AI HTTP Proxy

In your CLORE.AI order panel, you'll see your `http_pub` URL for port 8080. This allows browser/API access without SSH tunneling:

```
https://<order-id>.clore.ai/
```

***

## Usage Examples

### Example 1: Basic Text Generation

```bash
curl http://localhost:8080/generate \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "What is the capital of France?",
    "parameters": {
      "max_new_tokens": 100,
      "temperature": 0.7
    }
  }'
```

### Example 2: Chat Completions (OpenAI-compatible)

TGI supports the OpenAI chat completions API format:

```bash
curl http://localhost:8080/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tgi",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum entanglement in simple terms."}
    ],
    "max_tokens": 512,
    "temperature": 0.8,
    "stream": false
  }'
```

### Example 3: Streaming Response

```bash
curl http://localhost:8080/generate_stream \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "Write a Python function to calculate Fibonacci numbers:",
    "parameters": {
      "max_new_tokens": 300,
      "temperature": 0.2
    }
  }' \
  --no-buffer
```

### Example 4: Python Client

```python
from huggingface_hub import InferenceClient

# Replace with your CLORE.AI http_pub URL
client = InferenceClient(model="http://localhost:8080")

# Simple generation
response = client.text_generation(
    "Translate to French: Hello, how are you?",
    max_new_tokens=100,
    temperature=0.7,
)
print(response)

# Chat
for token in client.chat_completion(
    messages=[{"role": "user", "content": "What is machine learning?"}],
    max_tokens=200,
    stream=True,
):
    print(token.choices[0].delta.content, end="", flush=True)
```

### Example 5: Batch Requests

```python
import requests

BASE_URL = "http://localhost:8080"  # or your CLORE.AI http_pub URL

prompts = [
    "Summarize the French Revolution in 3 sentences.",
    "Write a haiku about GPU computing.",
    "What are the main benefits of Rust over C++?",
]

results = []
for prompt in prompts:
    response = requests.post(
        f"{BASE_URL}/generate",
        json={"inputs": prompt, "parameters": {"max_new_tokens": 150}},
    )
    results.append(response.json()["generated_text"])

for prompt, result in zip(prompts, results):
    print(f"Prompt: {prompt}\nAnswer: {result}\n{'-'*50}")
```

***

## Configuration

### Key CLI Parameters

| Parameter                   | Default  | Description                                     |
| --------------------------- | -------- | ----------------------------------------------- |
| `--model-id`                | required | HuggingFace model ID or local path              |
| `--num-shard`               | 1        | Number of GPU shards (tensor parallelism)       |
| `--max-concurrent-requests` | 128      | Max simultaneous requests                       |
| `--max-input-length`        | 1024     | Max input token length                          |
| `--max-total-tokens`        | 2048     | Max input + output tokens                       |
| `--max-batch-total-tokens`  | auto     | Max tokens per batch                            |
| `--quantize`                | none     | Quantization: `bitsandbytes-nf4`, `gptq`, `awq` |
| `--dtype`                   | auto     | `float16`, `bfloat16`                           |
| `--trust-remote-code`       | false    | Allow custom model code                         |
| `--port`                    | 80       | Server port                                     |

### Using a Local Model

If you have a model downloaded locally:

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /path/to/your/model:/model \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /model
```

### AWQ Quantization (Faster than NF4)

```bash
docker run -d \
  --name tgi \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /root/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id casperhansen/mistral-7b-instruct-v0.2-awq \
  --quantize awq
```

***

## Performance Tips

### 1. Enable Flash Attention 2

Flash Attention 2 is automatically enabled on Ampere+ GPUs (RTX 3000+, A100, H100). No extra configuration needed.

### 2. Tune Max Batch Size

For high-throughput scenarios, increase batch size:

```bash
--max-batch-total-tokens 32000 \
--max-waiting-tokens 20
```

### 3. Use bfloat16 on Ampere+ GPUs

```bash
--dtype bfloat16
```

This is more numerically stable than float16 and performs identically on modern GPUs.

### 4. Pre-download Models to Persistent Storage

```bash
# On the server, pre-download before starting TGI
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('mistralai/Mistral-7B-Instruct-v0.3', local_dir='/root/models/mistral-7b')
"
```

Then mount the local path to avoid re-downloading on restarts.

### 5. GPU Memory Management

For RTX 3090/4090 (24GB VRAM):

```bash
# 7B model in float16 fits perfectly
--max-total-tokens 8192

# 13B model needs quantization
--quantize bitsandbytes-nf4
--max-total-tokens 4096
```

### 6. Speculative Decoding

For faster generation with smaller models as draft:

```bash
--speculate 4  # Number of speculative tokens
```

***

## Troubleshooting

### Problem: "CUDA out of memory"

```
Error: CUDA out of memory. Tried to allocate X GiB
```

**Solution:** Reduce `--max-total-tokens` or enable quantization:

```bash
--quantize bitsandbytes-nf4
--max-total-tokens 4096
```

### Problem: Model download is slow

**Solution:** Use HuggingFace mirror or pre-download:

```bash
# Set mirror
-e HF_ENDPOINT=https://hf-mirror.com
```

### Problem: Server not accessible via http\_pub

**Solution:** Make sure port 8080 is mapped correctly. TGI listens on port 80 internally, but you map it to 8080 externally:

```bash
-p 8080:80  # host:container
```

### Problem: "trust\_remote\_code is required"

Some models (e.g., Falcon, Phi) require custom code:

```bash
--trust-remote-code
```

### Problem: Slow first response

The first request triggers model loading into VRAM. This is normal. Subsequent requests will be fast.

```bash
# Check loading progress
docker logs -f tgi | grep -E "Connected|Error|Loading"
```

### Problem: Container exits immediately

```bash
# Check for errors
docker logs tgi

# Common fix: increase shared memory
--shm-size 2g
```

***

## Links

* [GitHub](https://github.com/huggingface/text-generation-inference)
* [Documentation](https://huggingface.co/docs/text-generation-inference)
* [Docker Hub / GHCR](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference)
* [Supported Models](https://huggingface.co/docs/text-generation-inference/supported_models)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)

***

## Clore.ai GPU Recommendations

| Use Case            | Recommended GPU  | Est. Cost on Clore.ai |
| ------------------- | ---------------- | --------------------- |
| Development/Testing | RTX 3090 (24GB)  | \~$0.12/gpu/hr        |
| Production (7B–13B) | RTX 4090 (24GB)  | \~$0.70/gpu/hr        |
| Large Models (70B+) | A100 80GB / H100 | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/tgi.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
