# LMDeploy

**Efficient LLM deployment toolkit by Shanghai AI Lab** — production-grade inference, quantization, and serving for large language models with continuous batching and PagedAttention.

> 🏛️ Developed by **OpenMMLab / Shanghai AI Lab** | Apache-2.0 License | 4,000+ GitHub stars

***

## What is LMDeploy?

LMDeploy is a comprehensive toolkit for compressing, deploying, and serving Large Language Models in production. Built by the same team behind OpenMMLab (MMDetection, MMSeg), it brings research-grade optimizations to practical deployment:

* **TurboMind engine** — high-performance C++ inference backend with CUDA optimizations
* **PyTorch engine** — flexible Python-based engine for broad model compatibility
* **Continuous batching** — maximizes GPU utilization across concurrent requests
* **PagedAttention** — efficient KV cache management (similar to vLLM)
* **4-bit / 8-bit quantization** — AWQ and SmoothQuant support
* **Vision-Language Models** — InternVL, LLaVA, Qwen-VL support

Compared to vLLM, LMDeploy's TurboMind engine delivers \~1.36× higher throughput on Llama 3 8B at batch=32, and its AWQ quantization is first-class — not an afterthought. For VLMs (especially InternVL2), LMDeploy is the reference deployment stack.

### Why LMDeploy?

| Feature                   | LMDeploy | vLLM    | TGI     |
| ------------------------- | -------- | ------- | ------- |
| Continuous batching       | ✅        | ✅       | ✅       |
| AWQ quantization          | ✅        | ✅       | ❌       |
| Speculative decoding      | ✅        | ✅       | ✅       |
| Vision-Language           | ✅        | Limited | Limited |
| OpenAI API                | ✅        | ✅       | ✅       |
| TurboMind (custom engine) | ✅        | ❌       | ❌       |

***

## Quick Start on Clore.ai

### Step 1: Select a GPU Server

On [clore.ai](https://clore.ai) marketplace:

* **Minimum:** NVIDIA GPU with 8GB VRAM (for 7B models)
* **Recommended:** RTX 3090/4090 (24GB) or A100 (40/80GB)
* **CUDA:** 11.8 or 12.x required

### Step 2: Deploy LMDeploy Docker

```
Docker Image: openmmlab/lmdeploy
```

**Port mappings:**

| Container Port | Purpose             |
| -------------- | ------------------- |
| `22`           | SSH access          |
| `23333`        | LMDeploy API server |

**Environment variables:**

```
HUGGING_FACE_HUB_TOKEN=your_hf_token_here  # For gated models
```

### Step 3: SSH and Verify

```bash
ssh root@<clore-node-ip> -p <ssh-port>

# Verify installation
python -c "import lmdeploy; print(lmdeploy.__version__)"
lmdeploy --help
```

***

## Starting the API Server

### OpenAI-Compatible Server (Recommended)

```bash
# Serve Llama 3 8B with TurboMind engine
lmdeploy serve api_server \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --server-port 23333 \
  --server-name 0.0.0.0 \
  --model-name llama3-8b

# With explicit engine selection
lmdeploy serve api_server \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --backend turbomind \
  --server-port 23333 \
  --server-name 0.0.0.0 \
  --tp 1 \
  --max-batch-size 128 \
  --cache-max-entry-count 0.8
```

### PyTorch Engine (Broader Compatibility)

```bash
# Use PyTorch engine for models not supported by TurboMind
lmdeploy serve api_server \
  mistralai/Mistral-7B-Instruct-v0.2 \
  --backend pytorch \
  --server-port 23333 \
  --server-name 0.0.0.0
```

### Server Startup Output

```
[2024-01-01 12:00:00,000] INFO: Loading model: meta-llama/Meta-Llama-3-8B-Instruct
[2024-01-01 12:00:20,000] INFO: TurboMind engine initialized
[2024-01-01 12:00:20,000] INFO: Server started at http://0.0.0.0:23333
[2024-01-01 12:00:20,000] INFO: API docs: http://0.0.0.0:23333/docs
```

{% hint style="success" %}
Once started, LMDeploy exposes interactive API docs at `http://<your-ip>:23333/docs` — useful for testing endpoints directly from the browser.
{% endhint %}

***

## Supported Models

### Text Models

```bash
# Llama 3
meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct

# Mistral / Mixtral
mistralai/Mistral-7B-Instruct-v0.2
mistralai/Mixtral-8x7B-Instruct-v0.1

# Qwen
Qwen/Qwen2-7B-Instruct
Qwen/Qwen2-72B-Instruct

# InternLM
internlm/internlm2-chat-7b
internlm/internlm2-chat-20b

# Yi
01-ai/Yi-1.5-9B-Chat
01-ai/Yi-1.5-34B-Chat

# Gemma
google/gemma-7b-it
google/gemma-2b-it
```

### Vision-Language Models

```bash
# InternVL (recommended VLM)
OpenGVLab/InternVL2-8B
OpenGVLab/InternVL2-26B

# LLaVA
llava-hf/llava-1.5-7b-hf

# Qwen-VL
Qwen/Qwen-VL-Chat
```

***

## Quantization

### AWQ 4-bit Quantization

LMDeploy's AWQ (Activation-aware Weight Quantization) produces excellent quality at 4-bit:

```bash
# Quantize a model to AWQ 4-bit
lmdeploy lite auto_awq \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --calib-dataset ptb \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir ./quantized/llama3-8b-awq

# Serve the quantized model
lmdeploy serve api_server \
  ./quantized/llama3-8b-awq \
  --server-port 23333 \
  --server-name 0.0.0.0
```

### SmoothQuant W8A8

8-bit weight and activation quantization (better for throughput-critical deployments):

```bash
lmdeploy lite smooth_quant \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --work-dir ./quantized/llama3-8b-sq \
  --calib-dataset ptb \
  --calib-samples 512
```

### Quantization Impact

| Quantization     | VRAM (7B) | Quality Loss | Throughput Gain |
| ---------------- | --------- | ------------ | --------------- |
| None (bf16)      | \~14GB    | None         | Baseline        |
| SmoothQuant W8A8 | \~8GB     | Minimal      | +20%            |
| AWQ W4A16        | \~4GB     | Low          | +15%            |
| GPTQ W4A16       | \~4GB     | Low          | +10%            |

{% hint style="info" %}
**AWQ recommendation:** For most use cases, AWQ 4-bit is the best balance of quality and VRAM savings. Use `--w-group-size 128` for better quality at slightly higher memory usage.
{% endhint %}

***

## API Usage Examples

### Python Client

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<clore-node-ip>:<api-port>/v1",
    api_key="none"
)

# Chat completion
response = client.chat.completions.create(
    model="llama3-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the history of AI in 3 sentences."}
    ],
    temperature=0.7,
    max_tokens=512
)
print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="llama3-8b",
    messages=[{"role": "user", "content": "Write a poem about space."}],
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()
```

### LMDeploy Native Python Client

```python
from lmdeploy import pipeline, TurbomindEngineConfig

# Direct pipeline (no server needed)
pipe = pipeline(
    'meta-llama/Meta-Llama-3-8B-Instruct',
    backend_config=TurbomindEngineConfig(max_batch_size=16)
)

# Single inference
response = pipe("What is the capital of France?")
print(response.text)

# Batch inference
responses = pipe([
    "Explain gravity",
    "What is DNA?",
    "How does Bitcoin work?"
])
for r in responses:
    print(r.text)
    print("---")
```

### Vision-Language Model

```python
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B')

image = load_image('https://example.com/photo.jpg')
response = pipe(('Describe this image in detail', image))
print(response.text)
```

***

## Multi-GPU Deployment

### Tensor Parallelism

```bash
# Distribute a 70B model across 4 GPUs
lmdeploy serve api_server \
  meta-llama/Meta-Llama-3-70B-Instruct \
  --backend turbomind \
  --server-port 23333 \
  --server-name 0.0.0.0 \
  --tp 4 \
  --max-batch-size 64
```

```python
from lmdeploy import pipeline, TurbomindEngineConfig

pipe = pipeline(
    'meta-llama/Meta-Llama-3-70B-Instruct',
    backend_config=TurbomindEngineConfig(tp=4)
)
```

***

## Advanced Configuration

### TurboMind Engine Config

```python
from lmdeploy import pipeline, TurbomindEngineConfig

engine_config = TurbomindEngineConfig(
    max_batch_size=64,          # Maximum concurrent requests
    cache_max_entry_count=0.8,  # KV cache ratio (0.0-1.0)
    quant_policy=0,             # 0=no quant, 4=4bit KV cache, 8=8bit KV cache
    rope_scaling_factor=1.0,    # For extended context
    num_tokens_per_iter=4096,   # Prefill chunk size
    max_prefill_token_num=8192, # Max prefill length
)

pipe = pipeline('meta-llama/Meta-Llama-3-8B-Instruct', backend_config=engine_config)
```

### Generation Config

```python
from lmdeploy import GenerationConfig

gen_config = GenerationConfig(
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repetition_penalty=1.1,
    max_new_tokens=1024,
    stop_words=['<|eot_id|>', '<|end_of_text|>'],
)

response = pipe("Hello, world!", gen_config=gen_config)
```

***

## Monitoring & Metrics

### Check Server Health

```bash
# Health check endpoint
curl http://localhost:23333/health

# List available models
curl http://localhost:23333/v1/models

# Server statistics
curl http://localhost:23333/stats
```

### GPU Monitoring

```bash
# Real-time GPU stats
watch -n 1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv'
```

***

## Docker Compose Example

```yaml
version: '3.8'
services:
  lmdeploy:
    image: openmmlab/lmdeploy:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    ports:
      - "23333:23333"
      - "22:22"
    volumes:
      - hf-cache:/root/.cache/huggingface
      - ./models:/models
    command: >
      lmdeploy serve api_server
      meta-llama/Meta-Llama-3-8B-Instruct
      --server-port 23333
      --server-name 0.0.0.0
      --model-name llama3-8b
      --max-batch-size 64
    restart: unless-stopped
    shm_size: '2g'

volumes:
  hf-cache:
```

***

## Benchmarking

```bash
# Built-in benchmark tool
lmdeploy benchmark \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --backend turbomind \
  --concurrency 1 4 8 16 32 \
  --num-prompts 1000 \
  --prompt-len 128 \
  --output-len 256
```

Sample output (RTX 4090, TurboMind, bf16):

```
concurrency=1:  throughput=42.3 tokens/s, latency_p50=23ms
concurrency=8:  throughput=287.1 tokens/s, latency_p50=156ms
concurrency=32: throughput=412.6 tokens/s, latency_p50=621ms
```

On A100 80GB, expect \~2.2× higher throughput vs RTX 4090 at high concurrency due to HBM2e memory bandwidth (2 TB/s vs 1 TB/s).

***

## Clore.ai GPU Recommendations

Choose based on your target model size and serving load:

| Use Case                    | GPU           | VRAM  | Why                                                          |
| --------------------------- | ------------- | ----- | ------------------------------------------------------------ |
| 7–13B models, dev/staging   | **RTX 3090**  | 24 GB | Best $/VRAM ratio; handles 7B bf16 or 13B AWQ                |
| 7–13B models, production    | **RTX 4090**  | 24 GB | \~40% faster than 3090 at same VRAM; 412 tok/s on Llama 3 8B |
| 70B models, team serving    | **A100 40GB** | 40 GB | Fits 70B AWQ; ECC memory for reliability                     |
| 70B models, high throughput | **A100 80GB** | 80 GB | Fits 70B bf16; 2× throughput vs A100 40GB at batch=32        |

**Budget pick:** RTX 3090 + AWQ 4-bit — serves Llama 3 8B at \~280 tok/s batch=8, covers most API use cases.

**Speed pick:** RTX 4090 — fastest per-dollar for 7–13B models; TurboMind squeezes out every GB/s of its 1 TB/s bandwidth.

**Production pick:** A100 80GB — run Qwen2-72B or Llama 3 70B in full bf16 without quantization quality tradeoffs; fits easily into multi-instance GPU serving.

***

## Troubleshooting

### Model Not Loading

```bash
# Check HuggingFace token is set
echo $HUGGING_FACE_HUB_TOKEN

# Manually download model
pip install huggingface_hub
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir ./llama3-8b

# Use local path instead
lmdeploy serve api_server ./llama3-8b --server-port 23333
```

### CUDA Out of Memory

```bash
# Reduce KV cache allocation
lmdeploy serve api_server MODEL \
  --cache-max-entry-count 0.5  # Reduce from 0.8

# Use quantized KV cache
lmdeploy serve api_server MODEL \
  --quant-policy 8  # 8-bit KV cache
```

### Port Already in Use

```bash
# Check what's using port 23333
ss -tlnp | grep 23333
fuser 23333/tcp

# Kill existing process
kill -9 $(fuser 23333/tcp)
```

{% hint style="warning" %}
**Docker network mode:** When running in Docker, ensure the container uses `--network host` or proper port mapping (`-p 23333:23333`) so the API is reachable from outside.
{% endhint %}

***

## Clore.ai GPU Recommendations

LMDeploy's TurboMind engine and W4A16 quantization deliver best-in-class throughput — especially on Ampere/Hopper GPUs.

| GPU         | VRAM  | Clore.ai Price | Llama 3 8B Throughput         | Llama 3 70B Q4     |
| ----------- | ----- | -------------- | ----------------------------- | ------------------ |
| RTX 3090    | 24 GB | \~$0.12/hr     | \~120 tok/s (fp16)            | ❌ Too large        |
| RTX 4090    | 24 GB | \~$0.70/hr     | \~200 tok/s (fp16)            | ❌ Too large        |
| A100 40GB   | 40 GB | \~$1.20/hr     | \~160 tok/s (fp16)            | \~55 tok/s (W4A16) |
| A100 80GB   | 80 GB | \~$2.00/hr     | \~175 tok/s (fp16)            | \~80 tok/s (fp16)  |
| 2× RTX 4090 | 48 GB | \~$1.40/hr     | \~380 tok/s (tensor parallel) | \~60 tok/s         |

{% hint style="info" %}
**RTX 3090 at \~$0.12/hr** is the top choice for 7B–13B models. LMDeploy's TurboMind engine extracts near-maximum throughput from consumer GPUs. A single RTX 3090 serving Llama 3 8B handles 120 tok/s — sufficient for production APIs with 10–20 concurrent users.

For 70B models: A100 40GB (\~$1.20/hr) with W4A16 quantization delivers \~55 tok/s — more cost-effective than two RTX 4090s.
{% endhint %}

***

## Resources

* 📦 **Docker Hub:** [hub.docker.com/r/openmmlab/lmdeploy](https://hub.docker.com/r/openmmlab/lmdeploy)
* 🐙 **GitHub:** [github.com/InternLM/lmdeploy](https://github.com/InternLM/lmdeploy)
* 📚 **Documentation:** [lmdeploy.readthedocs.io](https://lmdeploy.readthedocs.io)
* 💬 **Discord:** [discord.gg/xa29JuW84p](https://discord.gg/xa29JuW84p)
* 🤗 **Pre-quantized Models:** [huggingface.co/lmdeploy](https://huggingface.co/lmdeploy)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/lmdeploy.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
