# MLC-LLM

**Universal LLM deployment through ML Compilation** — run any large language model on any hardware with maximum performance using machine learning compilation.

> 🌟 **20,000+ GitHub stars** | Maintained by the MLC AI team | Apache-2.0 License

***

## What is MLC-LLM?

MLC-LLM (Machine Learning Compilation for Large Language Models) is a universal framework that enables efficient deployment of large language models across diverse hardware backends. By leveraging **TVM (Tensor Virtual Machine)** as its compilation backend, MLC-LLM compiles LLM models directly to native hardware code — achieving near-optimal performance without hardware-specific engineering.

### Key Capabilities

* **Universal hardware support** — NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, WebGPU
* **OpenAI-compatible REST API** — drop-in replacement for existing workflows
* **Multiple model formats** — Llama, Mistral, Gemma, Phi, Qwen, Falcon, and more
* **4-bit / 8-bit quantization** — run large models on consumer GPUs
* **Chat interface** — built-in web UI for immediate testing
* **Python & CLI tools** — flexible integration options

### Why Use MLC-LLM on Clore.ai?

Clore.ai GPU marketplace gives you access to high-performance NVIDIA GPUs at competitive rental rates. MLC-LLM's compilation approach squeezes maximum throughput from every GPU — making it ideal for:

* Production API inference at scale
* Research and benchmarking across model sizes
* Cost-efficient serving with quantized models
* Multi-model deployment on a single GPU instance

***

## Quick Start on Clore.ai

### Step 1: Find a GPU Server

1. Go to [clore.ai](https://clore.ai) marketplace
2. Filter servers: **NVIDIA GPU**, minimum **8GB VRAM** (16GB+ recommended for 7B+ models)
3. For optimal performance: RTX 3090, RTX 4090, A100, or H100

### Step 2: Deploy MLC-LLM

{% hint style="info" %}
**Note:** MLC-LLM does not publish an official pre-built Docker image to Docker Hub. The recommended deployment approach is to use an NVIDIA CUDA base image and install MLC-LLM via pip. Use `nvidia/cuda:12.1.0-devel-ubuntu22.04` as your base image on Clore.ai.
{% endhint %}

Use an NVIDIA CUDA base image in your Clore.ai order configuration:

```
Docker Image: nvidia/cuda:12.1.0-devel-ubuntu22.04
```

**Port mappings:**

| Container Port | Purpose         |
| -------------- | --------------- |
| `22`           | SSH access      |
| `8000`         | REST API server |

**Recommended environment variables:**

```
MLC_MODEL=HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
MLC_HOST=0.0.0.0
MLC_PORT=8000
```

**Startup script** (run after SSH):

```bash
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121
```

### Step 3: Connect via SSH

```bash
ssh root@<clore-node-ip> -p <assigned-ssh-port>
```

***

## Installation & Setup

### Option A: Use Pre-compiled Models (Fastest)

MLC-AI maintains a library of pre-compiled models on Hugging Face. No compilation needed:

```bash
# Pull and run a pre-compiled Llama 3 8B (4-bit quantized)
python -m mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC \
  --host 0.0.0.0 \
  --port 8000
```

### Option B: Compile Your Own Model

For custom models or specific quantization requirements:

```bash
# Step 1: Convert model weights
python -m mlc_llm convert_weight \
  ./path/to/model \
  --quantization q4f16_1 \
  --output ./compiled/model-q4f16_1

# Step 2: Generate model configuration
python -m mlc_llm gen_config \
  ./path/to/model \
  --quantization q4f16_1 \
  --conv-template llama-3 \
  --output ./compiled/model-q4f16_1

# Step 3: Compile the model
python -m mlc_llm compile \
  ./compiled/model-q4f16_1/mlc-chat-config.json \
  --device cuda \
  --output ./compiled/model-q4f16_1/lib.so
```

{% hint style="info" %}
**Compilation time:** Compiling a 7B model typically takes 10–30 minutes on first run. Compiled artifacts are cached and reused on subsequent launches.
{% endhint %}

***

## Running the API Server

### Start the OpenAI-Compatible Server

```bash
python -m mlc_llm serve \
  HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC \
  --host 0.0.0.0 \
  --port 8000 \
  --max-batch-size 4 \
  --max-total-sequence-length 8192
```

### Server Startup Output

```
[2024-01-01 12:00:00] INFO: Loading model from HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-01-01 12:00:15] INFO: Model loaded successfully
[2024-01-01 12:00:15] INFO: Starting server on 0.0.0.0:8000
[2024-01-01 12:00:15] INFO: OpenAI-compatible API available at http://0.0.0.0:8000/v1
```

### Available API Endpoints

| Endpoint                     | Method | Description                      |
| ---------------------------- | ------ | -------------------------------- |
| `/v1/chat/completions`       | POST   | Chat completions (OpenAI format) |
| `/v1/completions`            | POST   | Text completions                 |
| `/v1/models`                 | GET    | List available models            |
| `/v1/debug/dump_event_trace` | GET    | Performance debugging            |

***

## API Usage Examples

### Chat Completions (Python)

```python
from openai import OpenAI

# Point to your Clore.ai server
client = OpenAI(
    base_url="http://<clore-node-ip>:<api-port>/v1",
    api_key="none"  # MLC-LLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="Llama-3-8B-Instruct-q4f16_1-MLC",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)
```

### Streaming Response

```python
stream = client.chat.completions.create(
    model="Llama-3-8B-Instruct-q4f16_1-MLC",
    messages=[{"role": "user", "content": "Write a short story about AI."}],
    stream=True,
    max_tokens=1024
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

### cURL Example

```bash
curl http://<clore-node-ip>:<api-port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-3-8B-Instruct-q4f16_1-MLC",
    "messages": [
      {"role": "user", "content": "What is 2+2?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'
```

***

## Available Pre-compiled Models

MLC-AI provides ready-to-use compiled models on Hugging Face:

### Llama 3 Series

```bash
# 8B Instruct (recommended for most use cases)
HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

# 70B Instruct (requires 40GB+ VRAM or multi-GPU)
HF://mlc-ai/Llama-3-70B-Instruct-q4f16_1-MLC
```

### Mistral / Mixtral

```bash
HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC
HF://mlc-ai/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC
```

### Gemma

```bash
HF://mlc-ai/gemma-2b-it-q4f16_1-MLC
HF://mlc-ai/gemma-7b-it-q4f16_1-MLC
```

### Phi

```bash
HF://mlc-ai/phi-2-q4f16_1-MLC
HF://mlc-ai/Phi-3-mini-4k-instruct-q4f16_1-MLC
```

{% hint style="success" %}
**Full model list:** Browse all pre-compiled models at [huggingface.co/mlc-ai](https://huggingface.co/mlc-ai)
{% endhint %}

***

## Quantization Options

MLC-LLM supports multiple quantization schemes. Choose based on your VRAM budget:

| Quantization | Bits              | Quality | VRAM (7B) | VRAM (13B) |
| ------------ | ----------------- | ------- | --------- | ---------- |
| `q4f16_1`    | 4-bit             | ★★★★☆   | \~4GB     | \~7GB      |
| `q4f32_1`    | 4-bit (f32 accum) | ★★★★☆   | \~4GB     | \~7GB      |
| `q8f16_1`    | 8-bit             | ★★★★★   | \~8GB     | \~14GB     |
| `q0f16`      | 16-bit (no quant) | ★★★★★   | \~14GB    | \~26GB     |
| `q0f32`      | 32-bit (no quant) | ★★★★★   | \~28GB    | \~52GB     |

{% hint style="warning" %}
**VRAM recommendation:** Always leave 2–3GB headroom for CUDA overhead and KV cache. A 7B model with `q4f16_1` needs \~6–7GB total on a typical workload.
{% endhint %}

***

## Multi-GPU Deployment

For large models (70B+) requiring multiple GPUs:

```bash
# Enable tensor parallelism across 2 GPUs
python -m mlc_llm serve \
  HF://mlc-ai/Llama-3-70B-Instruct-q4f16_1-MLC \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-shards 2
```

Check GPU topology before deploying:

```bash
nvidia-smi topo -m  # Check NVLink/PCIe connectivity
```

{% hint style="info" %}
**Best performance:** Multi-GPU works best with NVLink-connected cards (e.g., A100 80GB SXM pairs). PCIe-connected GPUs will show bottlenecks on large models.
{% endhint %}

***

## Web Chat Interface

MLC-LLM includes a built-in web UI accessible once the server is running:

```bash
# Start server with web UI enabled
python -m mlc_llm serve \
  HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-debug  # Optional: enables debug endpoint
```

Access the UI at: `http://<clore-node-ip>:<api-port>`

***

## Performance Tuning

### Optimize Batch Size

```bash
# Increase batch size for higher throughput (requires more VRAM)
python -m mlc_llm serve \
  HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC \
  --host 0.0.0.0 \
  --port 8000 \
  --max-batch-size 8 \
  --max-total-sequence-length 16384 \
  --prefill-chunk-size 2048
```

### Monitor GPU Utilization

```bash
# In a separate terminal
watch -n 1 nvidia-smi

# More detailed monitoring
nvidia-smi dmon -s u  # Streaming utilization metrics
```

### Benchmark Throughput

```python
import time
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

start = time.time()
response = client.chat.completions.create(
    model="Llama-3-8B-Instruct-q4f16_1-MLC",
    messages=[{"role": "user", "content": "Count from 1 to 100"}],
    max_tokens=512
)
elapsed = time.time() - start

tokens = response.usage.completion_tokens
print(f"Throughput: {tokens/elapsed:.1f} tokens/sec")
```

***

## Docker Compose Setup

For a production-ready deployment on Clore.ai using an NVIDIA CUDA base image with MLC-LLM installed via pip:

```yaml
version: '3.8'
services:
  mlc-llm:
    image: nvidia/cuda:12.1.0-devel-ubuntu22.04
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/models
      - mlc-cache:/root/.cache/mlc_llm
    command: >
      bash -c "pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121 &&
      python -m mlc_llm serve
      HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
      --host 0.0.0.0
      --port 8000
      --max-batch-size 4"
    restart: unless-stopped

volumes:
  mlc-cache:
```

***

## Troubleshooting

### Model Download Fails

```bash
# Check internet connectivity
curl -I https://huggingface.co

# Manually download with huggingface-cli
pip install huggingface_hub
huggingface-cli download mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
```

### Out of Memory (OOM)

```bash
# Reduce context length
python -m mlc_llm serve MODEL \
  --max-total-sequence-length 4096  # Reduce from default

# Use more aggressive quantization
# Switch from q8f16_1 to q4f16_1
```

### CUDA Version Mismatch

```bash
# Check CUDA version
nvcc --version
nvidia-smi | grep CUDA

# For CUDA 12.1 servers, install:
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121

# For CUDA 12.2+ servers, install:
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122
```

{% hint style="danger" %}
**Common pitfall:** MLC-LLM pip wheels are CUDA-version specific. Make sure to install the correct variant matching your server's CUDA version. Check available wheels at [mlc.ai/wheels](https://mlc.ai/wheels).
{% endhint %}

### Server Not Accessible

```bash
# Verify port is listening
ss -tlnp | grep 8000

# Check firewall
iptables -L -n | grep 8000

# Test locally first
curl http://localhost:8000/v1/models
```

***

## Clore.ai GPU Recommendations

MLC-LLM's compilation approach delivers near-optimal throughput on every GPU tier. Pick based on model size and budget:

| GPU       | VRAM  | Clore.ai Price | Best For                      | Throughput (Llama 3 8B Q4) |
| --------- | ----- | -------------- | ----------------------------- | -------------------------- |
| RTX 3090  | 24 GB | \~$0.12/hr     | 7B–13B models, budget serving | \~85 tok/s                 |
| RTX 4090  | 24 GB | \~$0.70/hr     | 7B–34B models, fast serving   | \~140 tok/s                |
| A100 40GB | 40 GB | \~$1.20/hr     | 34B–70B, production API       | \~110 tok/s                |
| A100 80GB | 80 GB | \~$2.00/hr     | 70B+, multi-model serving     | \~130 tok/s                |
| H100 SXM  | 80 GB | \~$3.50/hr     | Maximum throughput, FP8       | \~280 tok/s                |

**Recommended starting point:** RTX 3090 at \~$0.12/hr is the best price-performance ratio for Llama 3 8B and Mistral 7B serving via MLC-LLM. The compiled kernels extract near-maximum utilization from consumer GPUs.

For 70B models (e.g., Llama 3 70B Q4): use A100 40GB (\~$1.20/hr) or two RTX 3090s via tensor parallelism.

***

## Resources

* 📦 **Pip Wheels:** [mlc.ai/wheels](https://mlc.ai/wheels) (install via pip, no Docker Hub image available)
* 🐙 **GitHub:** [github.com/mlc-ai/mlc-llm](https://github.com/mlc-ai/mlc-llm)
* 📚 **Documentation:** [llm.mlc.ai/docs](https://llm.mlc.ai/docs)
* 🤗 **Pre-compiled Models:** [huggingface.co/mlc-ai](https://huggingface.co/mlc-ai)
* 💬 **Discord:** [discord.gg/9Xpy2HGBuD](https://discord.gg/9Xpy2HGBuD)
