MLC-LLM

Universal LLM deployment through ML Compilation — run any large language model on any hardware with maximum performance using machine learning compilation.

🌟 20,000+ GitHub stars | Maintained by the MLC AI team | Apache-2.0 License


What is MLC-LLM?

MLC-LLM (Machine Learning Compilation for Large Language Models) is a universal framework that enables efficient deployment of large language models across diverse hardware backends. By leveraging TVM (Tensor Virtual Machine) as its compilation backend, MLC-LLM compiles LLM models directly to native hardware code — achieving near-optimal performance without hardware-specific engineering.

Key Capabilities

  • Universal hardware support — NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, WebGPU

  • OpenAI-compatible REST API — drop-in replacement for existing workflows

  • Multiple model formats — Llama, Mistral, Gemma, Phi, Qwen, Falcon, and more

  • 4-bit / 8-bit quantization — run large models on consumer GPUs

  • Chat interface — built-in web UI for immediate testing

  • Python & CLI tools — flexible integration options

Why Use MLC-LLM on Clore.ai?

Clore.ai GPU marketplace gives you access to high-performance NVIDIA GPUs at competitive rental rates. MLC-LLM's compilation approach squeezes maximum throughput from every GPU — making it ideal for:

  • Production API inference at scale

  • Research and benchmarking across model sizes

  • Cost-efficient serving with quantized models

  • Multi-model deployment on a single GPU instance


Quick Start on Clore.ai

Step 1: Find a GPU Server

  1. Go to clore.aiarrow-up-right marketplace

  2. Filter servers: NVIDIA GPU, minimum 8GB VRAM (16GB+ recommended for 7B+ models)

  3. For optimal performance: RTX 3090, RTX 4090, A100, or H100

Step 2: Deploy MLC-LLM

circle-info

Note: MLC-LLM does not publish an official pre-built Docker image to Docker Hub. The recommended deployment approach is to use an NVIDIA CUDA base image and install MLC-LLM via pip. Use nvidia/cuda:12.1.0-devel-ubuntu22.04 as your base image on Clore.ai.

Use an NVIDIA CUDA base image in your Clore.ai order configuration:

Port mappings:

Container Port
Purpose

22

SSH access

8000

REST API server

Recommended environment variables:

Startup script (run after SSH):

Step 3: Connect via SSH


Installation & Setup

Option A: Use Pre-compiled Models (Fastest)

MLC-AI maintains a library of pre-compiled models on Hugging Face. No compilation needed:

Option B: Compile Your Own Model

For custom models or specific quantization requirements:

circle-info

Compilation time: Compiling a 7B model typically takes 10–30 minutes on first run. Compiled artifacts are cached and reused on subsequent launches.


Running the API Server

Start the OpenAI-Compatible Server

Server Startup Output

Available API Endpoints

Endpoint
Method
Description

/v1/chat/completions

POST

Chat completions (OpenAI format)

/v1/completions

POST

Text completions

/v1/models

GET

List available models

/v1/debug/dump_event_trace

GET

Performance debugging


API Usage Examples

Chat Completions (Python)

Streaming Response

cURL Example


Available Pre-compiled Models

MLC-AI provides ready-to-use compiled models on Hugging Face:

Llama 3 Series

Mistral / Mixtral

Gemma

Phi

circle-check

Quantization Options

MLC-LLM supports multiple quantization schemes. Choose based on your VRAM budget:

Quantization
Bits
Quality
VRAM (7B)
VRAM (13B)

q4f16_1

4-bit

★★★★☆

~4GB

~7GB

q4f32_1

4-bit (f32 accum)

★★★★☆

~4GB

~7GB

q8f16_1

8-bit

★★★★★

~8GB

~14GB

q0f16

16-bit (no quant)

★★★★★

~14GB

~26GB

q0f32

32-bit (no quant)

★★★★★

~28GB

~52GB

circle-exclamation

Multi-GPU Deployment

For large models (70B+) requiring multiple GPUs:

Check GPU topology before deploying:

circle-info

Best performance: Multi-GPU works best with NVLink-connected cards (e.g., A100 80GB SXM pairs). PCIe-connected GPUs will show bottlenecks on large models.


Web Chat Interface

MLC-LLM includes a built-in web UI accessible once the server is running:

Access the UI at: http://<clore-node-ip>:<api-port>


Performance Tuning

Optimize Batch Size

Monitor GPU Utilization

Benchmark Throughput


Docker Compose Setup

For a production-ready deployment on Clore.ai using an NVIDIA CUDA base image with MLC-LLM installed via pip:


Troubleshooting

Model Download Fails

Out of Memory (OOM)

CUDA Version Mismatch

triangle-exclamation

Server Not Accessible


Clore.ai GPU Recommendations

MLC-LLM's compilation approach delivers near-optimal throughput on every GPU tier. Pick based on model size and budget:

GPU
VRAM
Clore.ai Price
Best For
Throughput (Llama 3 8B Q4)

RTX 3090

24 GB

~$0.12/hr

7B–13B models, budget serving

~85 tok/s

RTX 4090

24 GB

~$0.70/hr

7B–34B models, fast serving

~140 tok/s

A100 40GB

40 GB

~$1.20/hr

34B–70B, production API

~110 tok/s

A100 80GB

80 GB

~$2.00/hr

70B+, multi-model serving

~130 tok/s

H100 SXM

80 GB

~$3.50/hr

Maximum throughput, FP8

~280 tok/s

Recommended starting point: RTX 3090 at ~$0.12/hr is the best price-performance ratio for Llama 3 8B and Mistral 7B serving via MLC-LLM. The compiled kernels extract near-maximum utilization from consumer GPUs.

For 70B models (e.g., Llama 3 70B Q4): use A100 40GB (~$1.20/hr) or two RTX 3090s via tensor parallelism.


Resources

Last updated

Was this helpful?