# LMDeploy

**Efficient LLM deployment toolkit by Shanghai AI Lab** — production-grade inference, quantization, and serving for large language models with continuous batching and PagedAttention.

> 🏛️ Developed by **OpenMMLab / Shanghai AI Lab** | Apache-2.0 License | 4,000+ GitHub stars

***

## What is LMDeploy?

LMDeploy is a comprehensive toolkit for compressing, deploying, and serving Large Language Models in production. Built by the same team behind OpenMMLab (MMDetection, MMSeg), it brings research-grade optimizations to practical deployment:

* **TurboMind engine** — high-performance C++ inference backend with CUDA optimizations
* **PyTorch engine** — flexible Python-based engine for broad model compatibility
* **Continuous batching** — maximizes GPU utilization across concurrent requests
* **PagedAttention** — efficient KV cache management (similar to vLLM)
* **4-bit / 8-bit quantization** — AWQ and SmoothQuant support
* **Vision-Language Models** — InternVL, LLaVA, Qwen-VL support

Compared to vLLM, LMDeploy's TurboMind engine delivers \~1.36× higher throughput on Llama 3 8B at batch=32, and its AWQ quantization is first-class — not an afterthought. For VLMs (especially InternVL2), LMDeploy is the reference deployment stack.

### Why LMDeploy?

| Feature                   | LMDeploy | vLLM    | TGI     |
| ------------------------- | -------- | ------- | ------- |
| Continuous batching       | ✅        | ✅       | ✅       |
| AWQ quantization          | ✅        | ✅       | ❌       |
| Speculative decoding      | ✅        | ✅       | ✅       |
| Vision-Language           | ✅        | Limited | Limited |
| OpenAI API                | ✅        | ✅       | ✅       |
| TurboMind (custom engine) | ✅        | ❌       | ❌       |

***

## Quick Start on Clore.ai

### Step 1: Select a GPU Server

On [clore.ai](https://clore.ai) marketplace:

* **Minimum:** NVIDIA GPU with 8GB VRAM (for 7B models)
* **Recommended:** RTX 3090/4090 (24GB) or A100 (40/80GB)
* **CUDA:** 11.8 or 12.x required

### Step 2: Deploy LMDeploy Docker

```
Docker Image: openmmlab/lmdeploy
```

**Port mappings:**

| Container Port | Purpose             |
| -------------- | ------------------- |
| `22`           | SSH access          |
| `23333`        | LMDeploy API server |

**Environment variables:**

```
HUGGING_FACE_HUB_TOKEN=your_hf_token_here  # For gated models
```

### Step 3: SSH and Verify

```bash
ssh root@<clore-node-ip> -p <ssh-port>

# Verify installation
python -c "import lmdeploy; print(lmdeploy.__version__)"
lmdeploy --help
```

***

## Starting the API Server

### OpenAI-Compatible Server (Recommended)

```bash
# Serve Llama 3 8B with TurboMind engine
lmdeploy serve api_server \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --server-port 23333 \
  --server-name 0.0.0.0 \
  --model-name llama3-8b

# With explicit engine selection
lmdeploy serve api_server \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --backend turbomind \
  --server-port 23333 \
  --server-name 0.0.0.0 \
  --tp 1 \
  --max-batch-size 128 \
  --cache-max-entry-count 0.8
```

### PyTorch Engine (Broader Compatibility)

```bash
# Use PyTorch engine for models not supported by TurboMind
lmdeploy serve api_server \
  mistralai/Mistral-7B-Instruct-v0.2 \
  --backend pytorch \
  --server-port 23333 \
  --server-name 0.0.0.0
```

### Server Startup Output

```
[2024-01-01 12:00:00,000] INFO: Loading model: meta-llama/Meta-Llama-3-8B-Instruct
[2024-01-01 12:00:20,000] INFO: TurboMind engine initialized
[2024-01-01 12:00:20,000] INFO: Server started at http://0.0.0.0:23333
[2024-01-01 12:00:20,000] INFO: API docs: http://0.0.0.0:23333/docs
```

{% hint style="success" %}
Once started, LMDeploy exposes interactive API docs at `http://<your-ip>:23333/docs` — useful for testing endpoints directly from the browser.
{% endhint %}

***

## Supported Models

### Text Models

```bash
# Llama 3
meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct

# Mistral / Mixtral
mistralai/Mistral-7B-Instruct-v0.2
mistralai/Mixtral-8x7B-Instruct-v0.1

# Qwen
Qwen/Qwen2-7B-Instruct
Qwen/Qwen2-72B-Instruct

# InternLM
internlm/internlm2-chat-7b
internlm/internlm2-chat-20b

# Yi
01-ai/Yi-1.5-9B-Chat
01-ai/Yi-1.5-34B-Chat

# Gemma
google/gemma-7b-it
google/gemma-2b-it
```

### Vision-Language Models

```bash
# InternVL (recommended VLM)
OpenGVLab/InternVL2-8B
OpenGVLab/InternVL2-26B

# LLaVA
llava-hf/llava-1.5-7b-hf

# Qwen-VL
Qwen/Qwen-VL-Chat
```

***

## Quantization

### AWQ 4-bit Quantization

LMDeploy's AWQ (Activation-aware Weight Quantization) produces excellent quality at 4-bit:

```bash
# Quantize a model to AWQ 4-bit
lmdeploy lite auto_awq \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --calib-dataset ptb \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir ./quantized/llama3-8b-awq

# Serve the quantized model
lmdeploy serve api_server \
  ./quantized/llama3-8b-awq \
  --server-port 23333 \
  --server-name 0.0.0.0
```

### SmoothQuant W8A8

8-bit weight and activation quantization (better for throughput-critical deployments):

```bash
lmdeploy lite smooth_quant \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --work-dir ./quantized/llama3-8b-sq \
  --calib-dataset ptb \
  --calib-samples 512
```

### Quantization Impact

| Quantization     | VRAM (7B) | Quality Loss | Throughput Gain |
| ---------------- | --------- | ------------ | --------------- |
| None (bf16)      | \~14GB    | None         | Baseline        |
| SmoothQuant W8A8 | \~8GB     | Minimal      | +20%            |
| AWQ W4A16        | \~4GB     | Low          | +15%            |
| GPTQ W4A16       | \~4GB     | Low          | +10%            |

{% hint style="info" %}
**AWQ recommendation:** For most use cases, AWQ 4-bit is the best balance of quality and VRAM savings. Use `--w-group-size 128` for better quality at slightly higher memory usage.
{% endhint %}

***

## API Usage Examples

### Python Client

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<clore-node-ip>:<api-port>/v1",
    api_key="none"
)

# Chat completion
response = client.chat.completions.create(
    model="llama3-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the history of AI in 3 sentences."}
    ],
    temperature=0.7,
    max_tokens=512
)
print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="llama3-8b",
    messages=[{"role": "user", "content": "Write a poem about space."}],
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()
```

### LMDeploy Native Python Client

```python
from lmdeploy import pipeline, TurbomindEngineConfig

# Direct pipeline (no server needed)
pipe = pipeline(
    'meta-llama/Meta-Llama-3-8B-Instruct',
    backend_config=TurbomindEngineConfig(max_batch_size=16)
)

# Single inference
response = pipe("What is the capital of France?")
print(response.text)

# Batch inference
responses = pipe([
    "Explain gravity",
    "What is DNA?",
    "How does Bitcoin work?"
])
for r in responses:
    print(r.text)
    print("---")
```

### Vision-Language Model

```python
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B')

image = load_image('https://example.com/photo.jpg')
response = pipe(('Describe this image in detail', image))
print(response.text)
```

***

## Multi-GPU Deployment

### Tensor Parallelism

```bash
# Distribute a 70B model across 4 GPUs
lmdeploy serve api_server \
  meta-llama/Meta-Llama-3-70B-Instruct \
  --backend turbomind \
  --server-port 23333 \
  --server-name 0.0.0.0 \
  --tp 4 \
  --max-batch-size 64
```

```python
from lmdeploy import pipeline, TurbomindEngineConfig

pipe = pipeline(
    'meta-llama/Meta-Llama-3-70B-Instruct',
    backend_config=TurbomindEngineConfig(tp=4)
)
```

***

## Advanced Configuration

### TurboMind Engine Config

```python
from lmdeploy import pipeline, TurbomindEngineConfig

engine_config = TurbomindEngineConfig(
    max_batch_size=64,          # Maximum concurrent requests
    cache_max_entry_count=0.8,  # KV cache ratio (0.0-1.0)
    quant_policy=0,             # 0=no quant, 4=4bit KV cache, 8=8bit KV cache
    rope_scaling_factor=1.0,    # For extended context
    num_tokens_per_iter=4096,   # Prefill chunk size
    max_prefill_token_num=8192, # Max prefill length
)

pipe = pipeline('meta-llama/Meta-Llama-3-8B-Instruct', backend_config=engine_config)
```

### Generation Config

```python
from lmdeploy import GenerationConfig

gen_config = GenerationConfig(
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repetition_penalty=1.1,
    max_new_tokens=1024,
    stop_words=['<|eot_id|>', '<|end_of_text|>'],
)

response = pipe("Hello, world!", gen_config=gen_config)
```

***

## Monitoring & Metrics

### Check Server Health

```bash
# Health check endpoint
curl http://localhost:23333/health

# List available models
curl http://localhost:23333/v1/models

# Server statistics
curl http://localhost:23333/stats
```

### GPU Monitoring

```bash
# Real-time GPU stats
watch -n 1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv'
```

***

## Docker Compose Example

```yaml
version: '3.8'
services:
  lmdeploy:
    image: openmmlab/lmdeploy:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    ports:
      - "23333:23333"
      - "22:22"
    volumes:
      - hf-cache:/root/.cache/huggingface
      - ./models:/models
    command: >
      lmdeploy serve api_server
      meta-llama/Meta-Llama-3-8B-Instruct
      --server-port 23333
      --server-name 0.0.0.0
      --model-name llama3-8b
      --max-batch-size 64
    restart: unless-stopped
    shm_size: '2g'

volumes:
  hf-cache:
```

***

## Benchmarking

```bash
# Built-in benchmark tool
lmdeploy benchmark \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --backend turbomind \
  --concurrency 1 4 8 16 32 \
  --num-prompts 1000 \
  --prompt-len 128 \
  --output-len 256
```

Sample output (RTX 4090, TurboMind, bf16):

```
concurrency=1:  throughput=42.3 tokens/s, latency_p50=23ms
concurrency=8:  throughput=287.1 tokens/s, latency_p50=156ms
concurrency=32: throughput=412.6 tokens/s, latency_p50=621ms
```

On A100 80GB, expect \~2.2× higher throughput vs RTX 4090 at high concurrency due to HBM2e memory bandwidth (2 TB/s vs 1 TB/s).

***

## Clore.ai GPU Recommendations

Choose based on your target model size and serving load:

| Use Case                    | GPU           | VRAM  | Why                                                          |
| --------------------------- | ------------- | ----- | ------------------------------------------------------------ |
| 7–13B models, dev/staging   | **RTX 3090**  | 24 GB | Best $/VRAM ratio; handles 7B bf16 or 13B AWQ                |
| 7–13B models, production    | **RTX 4090**  | 24 GB | \~40% faster than 3090 at same VRAM; 412 tok/s on Llama 3 8B |
| 70B models, team serving    | **A100 40GB** | 40 GB | Fits 70B AWQ; ECC memory for reliability                     |
| 70B models, high throughput | **A100 80GB** | 80 GB | Fits 70B bf16; 2× throughput vs A100 40GB at batch=32        |

**Budget pick:** RTX 3090 + AWQ 4-bit — serves Llama 3 8B at \~280 tok/s batch=8, covers most API use cases.

**Speed pick:** RTX 4090 — fastest per-dollar for 7–13B models; TurboMind squeezes out every GB/s of its 1 TB/s bandwidth.

**Production pick:** A100 80GB — run Qwen2-72B or Llama 3 70B in full bf16 without quantization quality tradeoffs; fits easily into multi-instance GPU serving.

***

## Troubleshooting

### Model Not Loading

```bash
# Check HuggingFace token is set
echo $HUGGING_FACE_HUB_TOKEN

# Manually download model
pip install huggingface_hub
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir ./llama3-8b

# Use local path instead
lmdeploy serve api_server ./llama3-8b --server-port 23333
```

### CUDA Out of Memory

```bash
# Reduce KV cache allocation
lmdeploy serve api_server MODEL \
  --cache-max-entry-count 0.5  # Reduce from 0.8

# Use quantized KV cache
lmdeploy serve api_server MODEL \
  --quant-policy 8  # 8-bit KV cache
```

### Port Already in Use

```bash
# Check what's using port 23333
ss -tlnp | grep 23333
fuser 23333/tcp

# Kill existing process
kill -9 $(fuser 23333/tcp)
```

{% hint style="warning" %}
**Docker network mode:** When running in Docker, ensure the container uses `--network host` or proper port mapping (`-p 23333:23333`) so the API is reachable from outside.
{% endhint %}

***

## Clore.ai GPU Recommendations

LMDeploy's TurboMind engine and W4A16 quantization deliver best-in-class throughput — especially on Ampere/Hopper GPUs.

| GPU         | VRAM  | Clore.ai Price | Llama 3 8B Throughput         | Llama 3 70B Q4     |
| ----------- | ----- | -------------- | ----------------------------- | ------------------ |
| RTX 3090    | 24 GB | \~$0.12/hr     | \~120 tok/s (fp16)            | ❌ Too large        |
| RTX 4090    | 24 GB | \~$0.70/hr     | \~200 tok/s (fp16)            | ❌ Too large        |
| A100 40GB   | 40 GB | \~$1.20/hr     | \~160 tok/s (fp16)            | \~55 tok/s (W4A16) |
| A100 80GB   | 80 GB | \~$2.00/hr     | \~175 tok/s (fp16)            | \~80 tok/s (fp16)  |
| 2× RTX 4090 | 48 GB | \~$1.40/hr     | \~380 tok/s (tensor parallel) | \~60 tok/s         |

{% hint style="info" %}
**RTX 3090 at \~$0.12/hr** is the top choice for 7B–13B models. LMDeploy's TurboMind engine extracts near-maximum throughput from consumer GPUs. A single RTX 3090 serving Llama 3 8B handles 120 tok/s — sufficient for production APIs with 10–20 concurrent users.

For 70B models: A100 40GB (\~$1.20/hr) with W4A16 quantization delivers \~55 tok/s — more cost-effective than two RTX 4090s.
{% endhint %}

***

## Resources

* 📦 **Docker Hub:** [hub.docker.com/r/openmmlab/lmdeploy](https://hub.docker.com/r/openmmlab/lmdeploy)
* 🐙 **GitHub:** [github.com/InternLM/lmdeploy](https://github.com/InternLM/lmdeploy)
* 📚 **Documentation:** [lmdeploy.readthedocs.io](https://lmdeploy.readthedocs.io)
* 💬 **Discord:** [discord.gg/xa29JuW84p](https://discord.gg/xa29JuW84p)
* 🤗 **Pre-quantized Models:** [huggingface.co/lmdeploy](https://huggingface.co/lmdeploy)
