# MLC-LLM

**Universal LLM deployment through ML Compilation** — run any large language model on any hardware with maximum performance using machine learning compilation.

> 🌟 **20,000+ GitHub stars** | Maintained by the MLC AI team | Apache-2.0 License

***

## What is MLC-LLM?

MLC-LLM (Machine Learning Compilation for Large Language Models) is a universal framework that enables efficient deployment of large language models across diverse hardware backends. By leveraging **TVM (Tensor Virtual Machine)** as its compilation backend, MLC-LLM compiles LLM models directly to native hardware code — achieving near-optimal performance without hardware-specific engineering.

### Key Capabilities

* **Universal hardware support** — NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, WebGPU
* **OpenAI-compatible REST API** — drop-in replacement for existing workflows
* **Multiple model formats** — Llama, Mistral, Gemma, Phi, Qwen, Falcon, and more
* **4-bit / 8-bit quantization** — run large models on consumer GPUs
* **Chat interface** — built-in web UI for immediate testing
* **Python & CLI tools** — flexible integration options

### Why Use MLC-LLM on Clore.ai?

Clore.ai GPU marketplace gives you access to high-performance NVIDIA GPUs at competitive rental rates. MLC-LLM's compilation approach squeezes maximum throughput from every GPU — making it ideal for:

* Production API inference at scale
* Research and benchmarking across model sizes
* Cost-efficient serving with quantized models
* Multi-model deployment on a single GPU instance

***

## Quick Start on Clore.ai

### Step 1: Find a GPU Server

1. Go to [clore.ai](https://clore.ai) marketplace
2. Filter servers: **NVIDIA GPU**, minimum **8GB VRAM** (16GB+ recommended for 7B+ models)
3. For optimal performance: RTX 3090, RTX 4090, A100, or H100

### Step 2: Deploy MLC-LLM

{% hint style="info" %}
**Note:** MLC-LLM does not publish an official pre-built Docker image to Docker Hub. The recommended deployment approach is to use an NVIDIA CUDA base image and install MLC-LLM via pip. Use `nvidia/cuda:12.1.0-devel-ubuntu22.04` as your base image on Clore.ai.
{% endhint %}

Use an NVIDIA CUDA base image in your Clore.ai order configuration:

```
Docker Image: nvidia/cuda:12.1.0-devel-ubuntu22.04
```

**Port mappings:**

| Container Port | Purpose         |
| -------------- | --------------- |
| `22`           | SSH access      |
| `8000`         | REST API server |

**Recommended environment variables:**

```
MLC_MODEL=HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
MLC_HOST=0.0.0.0
MLC_PORT=8000
```

**Startup script** (run after SSH):

```bash
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121
```

### Step 3: Connect via SSH

```bash
ssh root@<clore-node-ip> -p <assigned-ssh-port>
```

***

## Installation & Setup

### Option A: Use Pre-compiled Models (Fastest)

MLC-AI maintains a library of pre-compiled models on Hugging Face. No compilation needed:

```bash
# Pull and run a pre-compiled Llama 3 8B (4-bit quantized)
python -m mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC \
  --host 0.0.0.0 \
  --port 8000
```

### Option B: Compile Your Own Model

For custom models or specific quantization requirements:

```bash
# Step 1: Convert model weights
python -m mlc_llm convert_weight \
  ./path/to/model \
  --quantization q4f16_1 \
  --output ./compiled/model-q4f16_1

# Step 2: Generate model configuration
python -m mlc_llm gen_config \
  ./path/to/model \
  --quantization q4f16_1 \
  --conv-template llama-3 \
  --output ./compiled/model-q4f16_1

# Step 3: Compile the model
python -m mlc_llm compile \
  ./compiled/model-q4f16_1/mlc-chat-config.json \
  --device cuda \
  --output ./compiled/model-q4f16_1/lib.so
```

{% hint style="info" %}
**Compilation time:** Compiling a 7B model typically takes 10–30 minutes on first run. Compiled artifacts are cached and reused on subsequent launches.
{% endhint %}

***

## Running the API Server

### Start the OpenAI-Compatible Server

```bash
python -m mlc_llm serve \
  HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC \
  --host 0.0.0.0 \
  --port 8000 \
  --max-batch-size 4 \
  --max-total-sequence-length 8192
```

### Server Startup Output

```
[2024-01-01 12:00:00] INFO: Loading model from HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-01-01 12:00:15] INFO: Model loaded successfully
[2024-01-01 12:00:15] INFO: Starting server on 0.0.0.0:8000
[2024-01-01 12:00:15] INFO: OpenAI-compatible API available at http://0.0.0.0:8000/v1
```

### Available API Endpoints

| Endpoint                     | Method | Description                      |
| ---------------------------- | ------ | -------------------------------- |
| `/v1/chat/completions`       | POST   | Chat completions (OpenAI format) |
| `/v1/completions`            | POST   | Text completions                 |
| `/v1/models`                 | GET    | List available models            |
| `/v1/debug/dump_event_trace` | GET    | Performance debugging            |

***

## API Usage Examples

### Chat Completions (Python)

```python
from openai import OpenAI

# Point to your Clore.ai server
client = OpenAI(
    base_url="http://<clore-node-ip>:<api-port>/v1",
    api_key="none"  # MLC-LLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="Llama-3-8B-Instruct-q4f16_1-MLC",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)
```

### Streaming Response

```python
stream = client.chat.completions.create(
    model="Llama-3-8B-Instruct-q4f16_1-MLC",
    messages=[{"role": "user", "content": "Write a short story about AI."}],
    stream=True,
    max_tokens=1024
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

### cURL Example

```bash
curl http://<clore-node-ip>:<api-port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-3-8B-Instruct-q4f16_1-MLC",
    "messages": [
      {"role": "user", "content": "What is 2+2?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'
```

***

## Available Pre-compiled Models

MLC-AI provides ready-to-use compiled models on Hugging Face:

### Llama 3 Series

```bash
# 8B Instruct (recommended for most use cases)
HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

# 70B Instruct (requires 40GB+ VRAM or multi-GPU)
HF://mlc-ai/Llama-3-70B-Instruct-q4f16_1-MLC
```

### Mistral / Mixtral

```bash
HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC
HF://mlc-ai/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC
```

### Gemma

```bash
HF://mlc-ai/gemma-2b-it-q4f16_1-MLC
HF://mlc-ai/gemma-7b-it-q4f16_1-MLC
```

### Phi

```bash
HF://mlc-ai/phi-2-q4f16_1-MLC
HF://mlc-ai/Phi-3-mini-4k-instruct-q4f16_1-MLC
```

{% hint style="success" %}
**Full model list:** Browse all pre-compiled models at [huggingface.co/mlc-ai](https://huggingface.co/mlc-ai)
{% endhint %}

***

## Quantization Options

MLC-LLM supports multiple quantization schemes. Choose based on your VRAM budget:

| Quantization | Bits              | Quality | VRAM (7B) | VRAM (13B) |
| ------------ | ----------------- | ------- | --------- | ---------- |
| `q4f16_1`    | 4-bit             | ★★★★☆   | \~4GB     | \~7GB      |
| `q4f32_1`    | 4-bit (f32 accum) | ★★★★☆   | \~4GB     | \~7GB      |
| `q8f16_1`    | 8-bit             | ★★★★★   | \~8GB     | \~14GB     |
| `q0f16`      | 16-bit (no quant) | ★★★★★   | \~14GB    | \~26GB     |
| `q0f32`      | 32-bit (no quant) | ★★★★★   | \~28GB    | \~52GB     |

{% hint style="warning" %}
**VRAM recommendation:** Always leave 2–3GB headroom for CUDA overhead and KV cache. A 7B model with `q4f16_1` needs \~6–7GB total on a typical workload.
{% endhint %}

***

## Multi-GPU Deployment

For large models (70B+) requiring multiple GPUs:

```bash
# Enable tensor parallelism across 2 GPUs
python -m mlc_llm serve \
  HF://mlc-ai/Llama-3-70B-Instruct-q4f16_1-MLC \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-shards 2
```

Check GPU topology before deploying:

```bash
nvidia-smi topo -m  # Check NVLink/PCIe connectivity
```

{% hint style="info" %}
**Best performance:** Multi-GPU works best with NVLink-connected cards (e.g., A100 80GB SXM pairs). PCIe-connected GPUs will show bottlenecks on large models.
{% endhint %}

***

## Web Chat Interface

MLC-LLM includes a built-in web UI accessible once the server is running:

```bash
# Start server with web UI enabled
python -m mlc_llm serve \
  HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-debug  # Optional: enables debug endpoint
```

Access the UI at: `http://<clore-node-ip>:<api-port>`

***

## Performance Tuning

### Optimize Batch Size

```bash
# Increase batch size for higher throughput (requires more VRAM)
python -m mlc_llm serve \
  HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC \
  --host 0.0.0.0 \
  --port 8000 \
  --max-batch-size 8 \
  --max-total-sequence-length 16384 \
  --prefill-chunk-size 2048
```

### Monitor GPU Utilization

```bash
# In a separate terminal
watch -n 1 nvidia-smi

# More detailed monitoring
nvidia-smi dmon -s u  # Streaming utilization metrics
```

### Benchmark Throughput

```python
import time
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

start = time.time()
response = client.chat.completions.create(
    model="Llama-3-8B-Instruct-q4f16_1-MLC",
    messages=[{"role": "user", "content": "Count from 1 to 100"}],
    max_tokens=512
)
elapsed = time.time() - start

tokens = response.usage.completion_tokens
print(f"Throughput: {tokens/elapsed:.1f} tokens/sec")
```

***

## Docker Compose Setup

For a production-ready deployment on Clore.ai using an NVIDIA CUDA base image with MLC-LLM installed via pip:

```yaml
version: '3.8'
services:
  mlc-llm:
    image: nvidia/cuda:12.1.0-devel-ubuntu22.04
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/models
      - mlc-cache:/root/.cache/mlc_llm
    command: >
      bash -c "pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121 &&
      python -m mlc_llm serve
      HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
      --host 0.0.0.0
      --port 8000
      --max-batch-size 4"
    restart: unless-stopped

volumes:
  mlc-cache:
```

***

## Troubleshooting

### Model Download Fails

```bash
# Check internet connectivity
curl -I https://huggingface.co

# Manually download with huggingface-cli
pip install huggingface_hub
huggingface-cli download mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
```

### Out of Memory (OOM)

```bash
# Reduce context length
python -m mlc_llm serve MODEL \
  --max-total-sequence-length 4096  # Reduce from default

# Use more aggressive quantization
# Switch from q8f16_1 to q4f16_1
```

### CUDA Version Mismatch

```bash
# Check CUDA version
nvcc --version
nvidia-smi | grep CUDA

# For CUDA 12.1 servers, install:
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121

# For CUDA 12.2+ servers, install:
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122
```

{% hint style="danger" %}
**Common pitfall:** MLC-LLM pip wheels are CUDA-version specific. Make sure to install the correct variant matching your server's CUDA version. Check available wheels at [mlc.ai/wheels](https://mlc.ai/wheels).
{% endhint %}

### Server Not Accessible

```bash
# Verify port is listening
ss -tlnp | grep 8000

# Check firewall
iptables -L -n | grep 8000

# Test locally first
curl http://localhost:8000/v1/models
```

***

## Clore.ai GPU Recommendations

MLC-LLM's compilation approach delivers near-optimal throughput on every GPU tier. Pick based on model size and budget:

| GPU       | VRAM  | Clore.ai Price | Best For                      | Throughput (Llama 3 8B Q4) |
| --------- | ----- | -------------- | ----------------------------- | -------------------------- |
| RTX 3090  | 24 GB | \~$0.12/hr     | 7B–13B models, budget serving | \~85 tok/s                 |
| RTX 4090  | 24 GB | \~$0.70/hr     | 7B–34B models, fast serving   | \~140 tok/s                |
| A100 40GB | 40 GB | \~$1.20/hr     | 34B–70B, production API       | \~110 tok/s                |
| A100 80GB | 80 GB | \~$2.00/hr     | 70B+, multi-model serving     | \~130 tok/s                |
| H100 SXM  | 80 GB | \~$3.50/hr     | Maximum throughput, FP8       | \~280 tok/s                |

**Recommended starting point:** RTX 3090 at \~$0.12/hr is the best price-performance ratio for Llama 3 8B and Mistral 7B serving via MLC-LLM. The compiled kernels extract near-maximum utilization from consumer GPUs.

For 70B models (e.g., Llama 3 70B Q4): use A100 40GB (\~$1.20/hr) or two RTX 3090s via tensor parallelism.

***

## Resources

* 📦 **Pip Wheels:** [mlc.ai/wheels](https://mlc.ai/wheels) (install via pip, no Docker Hub image available)
* 🐙 **GitHub:** [github.com/mlc-ai/mlc-llm](https://github.com/mlc-ai/mlc-llm)
* 📚 **Documentation:** [llm.mlc.ai/docs](https://llm.mlc.ai/docs)
* 🤗 **Pre-compiled Models:** [huggingface.co/mlc-ai](https://huggingface.co/mlc-ai)
* 💬 **Discord:** [discord.gg/9Xpy2HGBuD](https://discord.gg/9Xpy2HGBuD)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/mlc-llm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
