LMDeploy

Efficient LLM deployment toolkit by Shanghai AI Lab — production-grade inference, quantization, and serving for large language models with continuous batching and PagedAttention.

🏛️ Developed by OpenMMLab / Shanghai AI Lab | Apache-2.0 License | 4,000+ GitHub stars


What is LMDeploy?

LMDeploy is a comprehensive toolkit for compressing, deploying, and serving Large Language Models in production. Built by the same team behind OpenMMLab (MMDetection, MMSeg), it brings research-grade optimizations to practical deployment:

  • TurboMind engine — high-performance C++ inference backend with CUDA optimizations

  • PyTorch engine — flexible Python-based engine for broad model compatibility

  • Continuous batching — maximizes GPU utilization across concurrent requests

  • PagedAttention — efficient KV cache management (similar to vLLM)

  • 4-bit / 8-bit quantization — AWQ and SmoothQuant support

  • Vision-Language Models — InternVL, LLaVA, Qwen-VL support

Compared to vLLM, LMDeploy's TurboMind engine delivers ~1.36× higher throughput on Llama 3 8B at batch=32, and its AWQ quantization is first-class — not an afterthought. For VLMs (especially InternVL2), LMDeploy is the reference deployment stack.

Why LMDeploy?

Feature
LMDeploy
vLLM
TGI

Continuous batching

AWQ quantization

Speculative decoding

Vision-Language

Limited

Limited

OpenAI API

TurboMind (custom engine)


Quick Start on Clore.ai

Step 1: Select a GPU Server

On clore.aiarrow-up-right marketplace:

  • Minimum: NVIDIA GPU with 8GB VRAM (for 7B models)

  • Recommended: RTX 3090/4090 (24GB) or A100 (40/80GB)

  • CUDA: 11.8 or 12.x required

Step 2: Deploy LMDeploy Docker

Port mappings:

Container Port
Purpose

22

SSH access

23333

LMDeploy API server

Environment variables:

Step 3: SSH and Verify


Starting the API Server

PyTorch Engine (Broader Compatibility)

Server Startup Output

circle-check

Supported Models

Text Models

Vision-Language Models


Quantization

AWQ 4-bit Quantization

LMDeploy's AWQ (Activation-aware Weight Quantization) produces excellent quality at 4-bit:

SmoothQuant W8A8

8-bit weight and activation quantization (better for throughput-critical deployments):

Quantization Impact

Quantization
VRAM (7B)
Quality Loss
Throughput Gain

None (bf16)

~14GB

None

Baseline

SmoothQuant W8A8

~8GB

Minimal

+20%

AWQ W4A16

~4GB

Low

+15%

GPTQ W4A16

~4GB

Low

+10%

circle-info

AWQ recommendation: For most use cases, AWQ 4-bit is the best balance of quality and VRAM savings. Use --w-group-size 128 for better quality at slightly higher memory usage.


API Usage Examples

Python Client

Streaming

LMDeploy Native Python Client

Vision-Language Model


Multi-GPU Deployment

Tensor Parallelism


Advanced Configuration

TurboMind Engine Config

Generation Config


Monitoring & Metrics

Check Server Health

GPU Monitoring


Docker Compose Example


Benchmarking

Sample output (RTX 4090, TurboMind, bf16):

On A100 80GB, expect ~2.2× higher throughput vs RTX 4090 at high concurrency due to HBM2e memory bandwidth (2 TB/s vs 1 TB/s).


Clore.ai GPU Recommendations

Choose based on your target model size and serving load:

Use Case
GPU
VRAM
Why

7–13B models, dev/staging

RTX 3090

24 GB

Best $/VRAM ratio; handles 7B bf16 or 13B AWQ

7–13B models, production

RTX 4090

24 GB

~40% faster than 3090 at same VRAM; 412 tok/s on Llama 3 8B

70B models, team serving

A100 40GB

40 GB

Fits 70B AWQ; ECC memory for reliability

70B models, high throughput

A100 80GB

80 GB

Fits 70B bf16; 2× throughput vs A100 40GB at batch=32

Budget pick: RTX 3090 + AWQ 4-bit — serves Llama 3 8B at ~280 tok/s batch=8, covers most API use cases.

Speed pick: RTX 4090 — fastest per-dollar for 7–13B models; TurboMind squeezes out every GB/s of its 1 TB/s bandwidth.

Production pick: A100 80GB — run Qwen2-72B or Llama 3 70B in full bf16 without quantization quality tradeoffs; fits easily into multi-instance GPU serving.


Troubleshooting

Model Not Loading

CUDA Out of Memory

Port Already in Use

circle-exclamation

Clore.ai GPU Recommendations

LMDeploy's TurboMind engine and W4A16 quantization deliver best-in-class throughput — especially on Ampere/Hopper GPUs.

GPU
VRAM
Clore.ai Price
Llama 3 8B Throughput
Llama 3 70B Q4

RTX 3090

24 GB

~$0.12/hr

~120 tok/s (fp16)

❌ Too large

RTX 4090

24 GB

~$0.70/hr

~200 tok/s (fp16)

❌ Too large

A100 40GB

40 GB

~$1.20/hr

~160 tok/s (fp16)

~55 tok/s (W4A16)

A100 80GB

80 GB

~$2.00/hr

~175 tok/s (fp16)

~80 tok/s (fp16)

2× RTX 4090

48 GB

~$1.40/hr

~380 tok/s (tensor parallel)

~60 tok/s

circle-info

RTX 3090 at ~$0.12/hr is the top choice for 7B–13B models. LMDeploy's TurboMind engine extracts near-maximum throughput from consumer GPUs. A single RTX 3090 serving Llama 3 8B handles 120 tok/s — sufficient for production APIs with 10–20 concurrent users.

For 70B models: A100 40GB (~$1.20/hr) with W4A16 quantization delivers ~55 tok/s — more cost-effective than two RTX 4090s.


Resources

Last updated

Was this helpful?