TGI (Text Generation Inference)
Run HuggingFace Text Generation Inference (TGI) for production LLM serving on Clore.ai GPUs
Server Requirements
Parameter
Minimum
Recommended
Quick Deploy on CLORE.AI
Variable
Example
Description
Step-by-Step Setup
1. Rent a GPU Server on CLORE.AI
2. Connect via SSH
3. Pull the TGI Docker Image
4. Launch TGI with a Model
5. Verify the Server is Running
6. Access via CLORE.AI HTTP Proxy
Usage Examples
Example 1: Basic Text Generation
Example 2: Chat Completions (OpenAI-compatible)
Example 3: Streaming Response
Example 4: Python Client
Example 5: Batch Requests
Configuration
Key CLI Parameters
Parameter
Default
Description
Using a Local Model
AWQ Quantization (Faster than NF4)
Performance Tips
1. Enable Flash Attention 2
2. Tune Max Batch Size
3. Use bfloat16 on Ampere+ GPUs
4. Pre-download Models to Persistent Storage
5. GPU Memory Management
6. Speculative Decoding
Troubleshooting
Problem: "CUDA out of memory"
Problem: Model download is slow
Problem: Server not accessible via http_pub
Problem: "trust_remote_code is required"
Problem: Slow first response
Problem: Container exits immediately
Links
Clore.ai GPU Recommendations
Use Case
Recommended GPU
Est. Cost on Clore.ai
Last updated
Was this helpful?