# LMDeploy

**上海人工智能实验室 的 高效 大型模型 部署 工具包** — 面向生产的推理、量化与服务，支持连续批处理与页式注意力（PagedAttention）。

> 🏛️ 开发者 **OpenMMLab / 上海人工智能实验室** | Apache-2.0 许可证 | 4000+ GitHub 收藏

***

## 什么是 LMDeploy？

LMDeploy 是一个用于在生产环境中压缩、部署和服务大型语言模型的综合工具包。由 OpenMMLab（MMDetection、MMSeg）背后的团队构建，将研究级的优化带入实际部署：

* **TurboMind 引擎** — 具有 CUDA 优化的高性能 C++ 推理后端
* **PyTorch 引擎** — 基于 Python 的灵活引擎，具有广泛的模型兼容性
* **连续批处理** — 在并发请求间最大化 GPU 利用率
* **PagedAttention** — 高效的 KV 缓存管理（类似于 vLLM）
* **4 位 / 8 位 量化** — 支持 AWQ 和 SmoothQuant
* **视觉-语言 模型** — 支持 InternVL、LLaVA、Qwen-VL

与 vLLM 相比，LMDeploy 的 TurboMind 引擎在 Llama 3 8B、batch=32 时提供约 1.36× 更高的吞吐量，其 AWQ 量化也是一流的 — 并非事后补充。对于 VLM（尤其是 InternVL2），LMDeploy 是参考部署栈。

### 为什么选择 LMDeploy？

| 功能               | LMDeploy | vLLM | TGI |
| ---------------- | -------- | ---- | --- |
| 连续批处理            | ✅        | ✅    | ✅   |
| AWQ 量化           | ✅        | ✅    | ❌   |
| 推测式解码            | ✅        | ✅    | ✅   |
| 视觉-语言            | ✅        | 有限   | 有限  |
| OpenAI API       | ✅        | ✅    | ✅   |
| TurboMind（自定义引擎） | ✅        | ❌    | ❌   |

***

## 在 Clore.ai 上快速开始

### 第 1 步：选择 GPU 服务器

在 [clore.ai](https://clore.ai) 市场：

* **最低要求：** NVIDIA GPU，8GB 显存（适用于 7B 模型）
* **推荐：** RTX 3090/4090（24GB）或 A100（40/80GB）
* **CUDA：** 需要 11.8 或 12.x

### 第 2 步：部署 LMDeploy Docker

```
Docker 镜像：openmmlab/lmdeploy
```

**端口映射：**

| 容器端口    | 用途               |
| ------- | ---------------- |
| `22`    | SSH 访问           |
| `23333` | LMDeploy API 服务器 |

**环境变量：**

```
HUGGING_FACE_HUB_TOKEN=your_hf_token_here  # 用于受限模型
```

### 第 3 步：SSH 并验证

```bash
ssh root@<clore-node-ip> -p <ssh-port>

# 验证安装
python -c "import lmdeploy; print(lmdeploy.__version__)"
lmdeploy --help
```

***

## 启动 API 服务器

### 兼容 OpenAI 的服务器（推荐）

```bash
# 使用 TurboMind 引擎 服务 Llama 3 8B
lmdeploy serve api_server \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --server-port 23333 \
  --server-name 0.0.0.0 \
  --model-name llama3-8b

# 指定引擎
lmdeploy serve api_server \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --backend turbomind \
  --server-port 23333 \
  --server-name 0.0.0.0 \
  --tp 1 \
  --max-batch-size 128 \
  --cache-max-entry-count 0.8
```

### PyTorch 引擎（更广的兼容性）

```bash
# 对于 TurboMind 不支持的模型使用 PyTorch 引擎
lmdeploy serve api_server \
  mistralai/Mistral-7B-Instruct-v0.2 \
  --backend pytorch \
  --server-port 23333 \
  --server-name 0.0.0.0
```

### 服务器启动输出

```
[2024-01-01 12:00:00,000] INFO: 加载模型: meta-llama/Meta-Llama-3-8B-Instruct
[2024-01-01 12:00:20,000] INFO: TurboMind 引擎 已初始化
[2024-01-01 12:00:20,000] INFO: 服务器已在 http://0.0.0.0:23333 启动
[2024-01-01 12:00:20,000] INFO: API 文档: http://0.0.0.0:23333/docs
```

{% hint style="success" %}
启动后，LMDeploy 在以下地址暴露交互式 API 文档： `http://<your-ip>:23333/docs` — 方便直接从浏览器测试端点。
{% endhint %}

***

## 支持的模型

### 文本模型

```bash
# Llama 3
meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct

# Mistral / Mixtral
mistralai/Mistral-7B-Instruct-v0.2
mistralai/Mixtral-8x7B-Instruct-v0.1

# Qwen
Qwen/Qwen2-7B-Instruct
Qwen/Qwen2-72B-Instruct

# InternLM
internlm/internlm2-chat-7b
internlm/internlm2-chat-20b

# Yi
01-ai/Yi-1.5-9B-Chat
01-ai/Yi-1.5-34B-Chat

# Gemma
google/gemma-7b-it
google/gemma-2b-it
```

### 视觉-语言 模型

```bash
# InternVL（推荐的 VLM）
OpenGVLab/InternVL2-8B
OpenGVLab/InternVL2-26B

# LLaVA
llava-hf/llava-1.5-7b-hf

# Qwen-VL
Qwen/Qwen-VL-Chat
```

***

## 量化

### AWQ 4 位 量化

LMDeploy 的 AWQ（感知激活的权重量化）在 4 位下能产生出色的质量：

```bash
# 将模型量化为 AWQ 4 位
lmdeploy lite auto_awq \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --calib-dataset ptb \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir ./quantized/llama3-8b-awq

# 服务量化后的模型
lmdeploy serve api_server \
  ./quantized/llama3-8b-awq \
  --server-port 23333 \
  --server-name 0.0.0.0
```

### SmoothQuant W8A8

8 位权重与激活量化（对吞吐量要求高的部署更佳）：

```bash
lmdeploy lite smooth_quant \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --work-dir ./quantized/llama3-8b-sq \
  --calib-dataset ptb \
  --calib-samples 512
```

### 量化影响

| 量化               | 显存（7B） | 质量损失 | 吞吐量增益 |
| ---------------- | ------ | ---- | ----- |
| 无（bf16）          | ≈14GB  | 无    | 基线    |
| SmoothQuant W8A8 | ≈8GB   | 极小   | +20%  |
| AWQ W4A16        | ≈4GB   | 低    | +15%  |
| GPTQ W4A16       | ≈4GB   | 低    | +10%  |

{% hint style="info" %}
**AWQ 建议：** 对于大多数使用场景，AWQ 4 位在质量与显存节省之间提供最佳平衡。使用 `--w-group-size 128` 可在略高内存使用下获得更好的质量。
{% endhint %}

***

## API 使用示例

### Python 客户端

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<clore-node-ip>:<api-port>/v1",
    api_key="none"
)

# 聊天补全
response = client.chat.completions.create(
    model="llama3-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the history of AI in 3 sentences."}
    ],
    temperature=0.7,
    max_tokens=512
)
print(response.choices[0].message.content)
```

### 流式传输

```python
stream = client.chat.completions.create(
    model="llama3-8b",
    messages=[{"role": "user", "content": "Write a poem about space."}],
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()
```

### LMDeploy 原生 Python 客户端

```python
from lmdeploy import pipeline, TurbomindEngineConfig

# 直接管道（无需服务器）
pipe = pipeline(
    'meta-llama/Meta-Llama-3-8B-Instruct',
    backend_config=TurbomindEngineConfig(max_batch_size=16)
)

# 单次推理
response = pipe("What is the capital of France?")
print(response.text)

# 批量推理
responses = pipe([
    "Explain gravity",
    "What is DNA?",
    "How does Bitcoin work?"
])
for r in responses:
    print(r.text)
    print("---")
```

### 视觉-语言 模型

```python
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B')

image = load_image('https://example.com/photo.jpg')
response = pipe(('Describe this image in detail', image))
print(response.text)
```

***

## 多 GPU 部署

### 张量并行

```bash
# 在 4 个 GPU 上分布一个 70B 模型
lmdeploy serve api_server \
  meta-llama/Meta-Llama-3-70B-Instruct \
  --backend turbomind \
  --server-port 23333 \
  --server-name 0.0.0.0 \
  --tp 4 \
  --max-batch-size 64
```

```python
from lmdeploy import pipeline, TurbomindEngineConfig

pipe = pipeline(
    'meta-llama/Meta-Llama-3-70B-Instruct',
    backend_config=TurbomindEngineConfig(tp=4)
)
```

***

## 高级配置

### TurboMind 引擎 配置

```python
from lmdeploy import pipeline, TurbomindEngineConfig

engine_config = TurbomindEngineConfig(
    max_batch_size=64,          # 最大并发请求数
    cache_max_entry_count=0.8,  # KV 缓存比例（0.0-1.0）
    quant_policy=0,             # 0=不量化, 4=4 位 KV 缓存, 8=8 位 KV 缓存
    rope_scaling_factor=1.0,    # 用于扩展上下文
    num_tokens_per_iter=4096,   # 预填充分块大小
    max_prefill_token_num=8192, # 最大预填充长度
)

pipe = pipeline('meta-llama/Meta-Llama-3-8B-Instruct', backend_config=engine_config)
```

### 生成配置

```python
from lmdeploy import GenerationConfig

gen_config = GenerationConfig(
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repetition_penalty=1.1,
    max_new_tokens=1024,
    stop_words=['<|eot_id|>', '<|end_of_text|>'],
)

response = pipe("Hello, world!", gen_config=gen_config)
```

***

## 监控与指标

### 检查服务器健康状态

```bash
# 健康检查端点
curl http://localhost:23333/health

# 列出可用模型
curl http://localhost:23333/v1/models

# 服务器统计信息
curl http://localhost:23333/stats
```

### GPU 监控

```bash
# 实时 GPU 状态
watch -n 1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv'
```

***

## Docker Compose 示例

```yaml
version: '3.8'
services:
  lmdeploy:
    image: openmmlab/lmdeploy:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    ports:
      - "23333:23333"
      - "22:22"
    volumes:
      - hf-cache:/root/.cache/huggingface
      - ./models:/models
    command: >
      lmdeploy serve api_server
      meta-llama/Meta-Llama-3-8B-Instruct
      --server-port 23333
      --server-name 0.0.0.0
      --model-name llama3-8b
      --max-batch-size 64
    restart: unless-stopped
    shm_size: '2g'

volumes:
  hf-cache:
```

***

## 基准测试

```bash
# 内置基准工具
lmdeploy benchmark \
  meta-llama/Meta-Llama-3-8B-Instruct \
  --backend turbomind \
  --concurrency 1 4 8 16 32 \
  --num-prompts 1000 \
  --prompt-len 128 \
  --output-len 256
```

示例输出（RTX 4090，TurboMind，bf16）：

```
并发=1： 吞吐量=42.3 tokens/s，延迟 p50=23ms
并发=8： 吞吐量=287.1 tokens/s，延迟 p50=156ms
并发=32： 吞吐量=412.6 tokens/s，延迟 p50=621ms
```

在 A100 80GB 上，由于 HBM2e 内存带宽（2 TB/s vs 1 TB/s），在高并发下预计吞吐量比 RTX 4090 高约 2.2×。

***

## Clore.ai 的 GPU 建议

根据目标模型规模和服务负载进行选择：

| 在 Clore.ai 上的预估费用  | GPU                   | 显存（VRAM） | 为什么                                           |
| ------------------ | --------------------- | -------- | --------------------------------------------- |
| 7–13B 模型，开发/预发布 环境 | **RTX 3090**          | 24 GB    | 最佳 $/显存 比；可处理 7B bf16 或 13B AWQ               |
| 7–13B 模型，生产环境      | **RTX 4090**          | 24 GB    | 在相同显存下比 3090 快约 40%；在 Llama 3 8B 上为 412 tok/s |
| 70B 模型，团队服务        | **A100 40GB**         | 40 GB    | 适配 70B AWQ；使用 ECC 内存以提高可靠性                    |
| 70B 模型，高吞吐量        | **💡 本指南中的所有示例均可部署在** | 80 GB    | 适配 70B bf16；在 batch=32 时吞吐量为 A100 40GB 的 2×   |

**预算推荐：** RTX 3090 + AWQ 4 位 — 在 batch=8 时可为 Llama 3 8B 提供约 280 tok/s，覆盖大多数 API 用例。

**速度首选：** RTX 4090 — 在 7–13B 模型上每美元性能最高；TurboMind 最大化利用其 1 TB/s 带宽的每 GB/s 性能。

**生产首选：** A100 80GB — 可在不牺牲量化质量的情况下以完整 bf16 运行 Qwen2-72B 或 Llama 3 70B；易于部署到多实例 GPU 服务中。

***

## 故障排除

### 模型无法加载

```bash
# 检查 HuggingFace token 是否设置
echo $HUGGING_FACE_HUB_TOKEN

# 手动下载模型
pip install huggingface_hub
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir ./llama3-8b

# 使用本地路径替代
lmdeploy serve api_server ./llama3-8b --server-port 23333
```

### CUDA 内存不足（Out of Memory）

```bash
# 减少 KV 缓存分配
lmdeploy serve api_server MODEL \
  --cache-max-entry-count 0.5  # 从 0.8 降低

# 使用量化的 KV 缓存
lmdeploy serve api_server MODEL \
  --quant-policy 8  # 8 位 KV 缓存
```

### 端口已被占用

```bash
# 检查哪个进程在使用端口 23333
ss -tlnp | grep 23333
fuser 23333/tcp

# 杀掉已有进程
kill -9 $(fuser 23333/tcp)
```

{% hint style="warning" %}
**Docker 网络模式：** 在 Docker 中运行时，确保容器使用 `--network host` 或正确的端口映射（`-p 23333:23333`）以便 API 可从外部访问。
{% endhint %}

***

## Clore.ai 的 GPU 建议

LMDeploy 的 TurboMind 引擎和 W4A16 量化在吞吐量方面表现卓越 — 尤其在 Ampere/Hopper GPU 上。

| GPU               | 显存（VRAM） | Clore.ai 价格 | Llama 3 8B 吞吐量   | Llama 3 70B Q4   |
| ----------------- | -------- | ----------- | ---------------- | ---------------- |
| RTX 3090          | 24 GB    | \~$0.12/小时  | ≈120 tok/s（fp16） | ❌ 太大             |
| RTX 4090          | 24 GB    | \~$0.70/小时  | ≈200 tok/s（fp16） | ❌ 太大             |
| A100 40GB         | 40 GB    | \~$1.20/小时  | ≈160 tok/s（fp16） | ≈55 tok/s（W4A16） |
| 💡 本指南中的所有示例均可部署在 | 80 GB    | \~$2.00/小时  | ≈175 tok/s（fp16） | ≈80 tok/s（fp16）  |
| 2× RTX 4090       | 48 GB    | ≈$1.40/小时   | ≈380 tok/s（张量并行） | ≈60 tok/s        |

{% hint style="info" %}
**RTX 3090 约 $0.12/小时** 是 7B–13B 模型的首选。LMDeploy 的 TurboMind 引擎能从消费级 GPU 中挖掘出接近最大化的吞吐量。单张 RTX 3090 服务 Llama 3 8B 可处理 120 tok/s — 足以支持 10–20 并发用户的生产 API。

对于 70B 模型：A100 40GB（≈$1.20/小时）配合 W4A16 量化可提供约 55 tok/s — 比两张 RTX 4090 更具成本效益。
{% endhint %}

***

## 资源

* 📦 **Docker Hub：** [hub.docker.com/r/openmmlab/lmdeploy](https://hub.docker.com/r/openmmlab/lmdeploy)
* 🐙 **GitHub：** [github.com/InternLM/lmdeploy](https://github.com/InternLM/lmdeploy)
* 📚 **文档：** [lmdeploy.readthedocs.io](https://lmdeploy.readthedocs.io)
* 💬 **Discord：** [discord.gg/xa29JuW84p](https://discord.gg/xa29JuW84p)
* 🤗 **预先量化的模型：** [huggingface.co/lmdeploy](https://huggingface.co/lmdeploy)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yu-yan-mo-xing/lmdeploy.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.