# SGLang

SGLang（结构化生成语言）是由 LMSYS 团队开发的高性能大模型服务框架，该团队以 Vicuna 和 Chatbot Arena 的工作而闻名。它具有用于 KV 缓存共享的 RadixAttention、对高效 MoE（专家混合）的支持，以及兼容 OpenAI 的 API —— 使其成为 CLORE.AI GPU 服务器上最快的开源推理引擎之一。

{% hint style="success" %}
所有示例都可以在通过以下方式租用的 GPU 服务器上运行 [CLORE.AI 市场](https://clore.ai/marketplace).
{% endhint %}

## 服务器要求

| 参数       | 最低                                | 推荐                 |
| -------- | --------------------------------- | ------------------ |
| 内存（RAM）  | 16 GB                             | 32 GB 以上           |
| 显存（VRAM） | 8 GB                              | 24 GB 以上           |
| 磁盘       | 50 GB                             | 200 GB 以上          |
| GPU      | NVIDIA Turing 及以上（RTX 2000 系列及以上） | A100、H100、RTX 4090 |

{% hint style="info" %}
SGLang 在启用 FlashInfer 的 Ampere 及以上 GPU 上性能最佳。对于 Mixtral 或 DeepSeek 等 MoE 模型，建议使用多 GPU 配置。
{% endhint %}

## 在 CLORE.AI 上快速部署

**Docker 镜像：** `lmsysorg/sglang:latest`

**端口：** `22/tcp`, `30000/http`

**环境变量：**

| 变量                     | 示例          | 描述                     |
| ---------------------- | ----------- | ---------------------- |
| `HF_TOKEN`             | `hf_xxx...` | 用于受限模型的 HuggingFace 令牌 |
| `CUDA_VISIBLE_DEVICES` | `0,1`       | 要使用的 GPU               |

## 逐步设置

### 1. 在 CLORE.AI 上租用 GPU 服务器

访问 [CLORE.AI 市场](https://clore.ai/marketplace) 并选择一台服务器：

* **7B 模型**：至少 16 GB 显存（RTX 4080、A10）
* **13B 模型**：24 GB 显存（RTX 3090、RTX 4090、A5000）
* **70B 模型**：80 GB 以上显存（A100 80GB）或多 GPU
* **MoE 模型（Mixtral 8x7B）**：48 GB 显存或 2×24 GB

### 2. SSH 登录到你的服务器

```bash
ssh -p <PORT> root@<SERVER_IP>
```

### 3. 拉取 SGLang Docker 镜像

```bash
docker pull lmsysorg/sglang:latest
```

### 4. 启动 SGLang 服务器

**基本启动（Llama 3.1 8B）：**

```bash
docker run -d \
  --name sglang \
  --gpus all \
  --shm-size 16g \
  --ipc host \
  -p 30000:30000 \
  -v /root/models:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000
```

**使用 HuggingFace 令牌：**

```bash
docker run -d \
  --name sglang \
  --gpus all \
  --shm-size 16g \
  --ipc host \
  -p 30000:30000 \
  -v /root/models:/root/.cache/huggingface \
  -e HF_TOKEN=hf_your_token_here \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --dtype bfloat16
```

**Qwen2.5 72B 在多 GPU 上：**

```bash
docker run -d \
  --name sglang \
  --gpus all \
  --shm-size 32g \
  --ipc host \
  -p 30000:30000 \
  -v /root/models:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-72B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --tp 2 \
    --dtype bfloat16
```

**DeepSeek-V2（MoE 模型）：**

```bash
docker run -d \
  --name sglang \
  --gpus all \
  --shm-size 32g \
  --ipc host \
  -p 30000:30000 \
  -v /root/models:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V2-Lite-Chat \
    --host 0.0.0.0 \
    --port 30000 \
    --trust-remote-code \
    --tp 1
```

### 5. 检查服务器健康状态

```bash
# 查看日志
docker logs -f sglang

# 健康检查（模型加载大约需 2-3 分钟）
curl http://localhost:30000/health

# 获取模型信息
curl http://localhost:30000/get_model_info
```

### 6. 通过 CLORE.AI 代理从外部访问

你的 CLORE.AI 仪表板提供一个 `http_pub` 用于端口 30000 的 URL：

```
https://<order-id>-30000.clore.ai/
```

在任何兼容 OpenAI 的客户端中将此 URL 用作基础 URL。

***

## 使用示例

### 示例 1：兼容 OpenAI 的聊天补全

```bash
curl http://localhost:30000/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "Write a quicksort implementation in Python."}
    ],
    "max_tokens": 512,
    "temperature": 0.2
  }'
```

### 示例 2：流式响应

```bash
curl http://localhost:30000/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain how transformer attention works."}
    ],
    "max_tokens": 800,
    "stream": true
  }' \
  --no-buffer
```

### 示例 3：Python OpenAI 客户端

```python
from openai import OpenAI

# 指向你的 CLORE.AI SGLang 服务器
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="none",  # SGLang 默认不需要认证
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a data science expert."},
        {"role": "user", "content": "What is gradient boosting?"},
    ],
    max_tokens=400,
    temperature=0.7,
)

print(response.choices[0].message.content)
```

### 示例 4：使用 SGLang 原生 API 进行批量推理

SGLang 的原生 API 提供额外的控制：

```python
import requests

# 生成补全
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The future of AI is",
        "sampling_params": {
            "max_new_tokens": 200,
            "temperature": 0.8,
            "top_p": 0.95,
        },
    },
)
print(response.json()["text"])
```

### 示例 5：受限的 JSON 输出

SGLang 支持结构化输出生成：

```python
import requests

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "city": {"type": "string"},
    },
    "required": ["name", "age", "city"],
}

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Extract information: John Smith, 35 years old, lives in New York.",
        "sampling_params": {
            "max_new_tokens": 100,
            "temperature": 0.0,
        },
        "json_schema": schema,
    },
)
print(response.json()["text"])
# 输出：{"name": "John Smith", "age": 35, "city": "New York"}
```

***

## 配置

### 主要启动参数

| 参数                      | 默认          | 描述                               |
| ----------------------- | ----------- | -------------------------------- |
| `--model-path`          | 必填          | HuggingFace 模型 ID 或本地路径          |
| `--host`                | `127.0.0.1` | 绑定主机（对外使用请设置为 `0.0.0.0` ）        |
| `--port`                | `30000`     | 服务器端口                            |
| `--tp`                  | `1`         | 张量并行度（GPU 数量）                    |
| `--dp`                  | `1`         | 数据并行度                            |
| `--dtype`               | `auto`      | `float16`, `bfloat16`, `float32` |
| `--mem-fraction-static` | `0.88`      | 用于 KV 缓存的显存比例                    |
| `--max-prefill-tokens`  | auto        | 一次预填充步骤中的最大 token 数              |
| `--context-length`      | 模型最大值       | 覆盖最大上下文长度                        |
| `--trust-remote-code`   | false       | 允许自定义模型代码                        |
| `--quantization`        | none        | `awq`, `gptq`, `fp8`             |
| `--load-format`         | `auto`      | `auto`, `pt`, `safetensors`      |
| `--tokenizer-path`      | 与模型相同       | 自定义分词器路径                         |

### 量化选项

**AWQ（推荐用于速度）：**

```bash
python3 -m sglang.launch_server \
  --model-path casperhansen/mistral-7b-instruct-v0.2-awq \
  --quantization awq \
  --host 0.0.0.0 \
  --port 30000
```

**FP8（用于 H100/A100）：**

```bash
python3 -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --host 0.0.0.0 \
  --port 30000
```

***

## 性能优化建议

### 1. RadixAttention —— 关键优势

SGLang 的 RadixAttention 会自动为共享的提示前缀重用 KV 缓存。这对以下场景尤其有用：

* 具有较长系统提示的聊天机器人
* 具有重复上下文的 RAG 应用
* 共享相同前缀的批量 API 调用

无需额外配置 —— 它始终启用。

### 2. 增大 KV 缓存大小

```bash
--mem-fraction-static 0.90  # 使用 90% 的显存作为 KV 缓存
```

注意不要设置得过高 —— 要为模型权重保留空间。

### 3. 对长上下文使用分块预填充

```bash
--chunked-prefill-size 4096  # 将长提示分块处理
```

### 4. 启用 FlashInfer 后端

在可用（Ampere 及以上 GPU）时，SGLang 会自动使用 FlashInfer：

```bash
--attention-backend flashinfer
```

### 5. 多 GPU 张量并行

对于无法放入单个 GPU 的模型：

```bash
--tp 4  # 使用 4 块 GPU
```

每块 GPU 必须有足够显存来存放模型切片。

### 6. 在吞吐量与延迟之间调优

**低延迟（单用户）：**

```bash
--max-running-requests 4
```

**高吞吐量（多用户）：**

```bash
--max-running-requests 64 \
--schedule-policy lpm  # 最长前缀匹配调度
```

***

## 故障排查

### 问题：“torch.cuda.OutOfMemoryError”

```
torch.cuda.OutOfMemoryError：CUDA 内存不足
```

**解决方案：** 减少内存比例或使用量化：

```bash
--mem-fraction-static 0.80
# 或
--quantization awq
```

### 问题：服务器无法启动（加载时挂起）

```bash
# 检查 CUDA 是否可用
docker exec -it sglang nvidia-smi

# 检查模型下载进度
docker logs -f sglang 2>&1 | tail -50
```

### 问题：“trust\_remote\_code required”

将 `--trust-remote-code` 添加到启动命令中以支持具有自定义架构的模型（如 DeepSeek、Falcon 等）。

### 问题：MoE 模型生成缓慢

MoE 模型（Mixtral、DeepSeek）受限于内存带宽。请确保使用：

```bash
--dtype bfloat16  # 对 MoE 比 float16 更好
--tp 2            # 如果可用，跨 GPU 切分
```

### 问题：上下文长度错误

```bash
# 覆盖上下文长度
--context-length 32768
```

### 问题：端口 30000 无法访问

在你的 CLORE.AI 订单配置中验证该端口是否已暴露。请在订单仪表板中检查 http\_pub URL，而不是 localhost。

***

## 链接

* [GitHub](https://github.com/sgl-project/sglang)
* [文档](https://sgl-project.github.io/start/install.html)
* [Docker Hub](https://hub.docker.com/r/lmsysorg/sglang)
* [支持的模型](https://github.com/sgl-project/sglang?tab=readme-ov-file#supported-models)
* [CLORE.AI 市场](https://clore.ai/marketplace)

***

## Clore.ai GPU 推荐

| 使用场景       | 推荐 GPU           | Clore.ai 估算费用    |
| ---------- | ---------------- | ---------------- |
| 开发/测试      | RTX 3090（24GB）   | \~$0.12/每 GPU/小时 |
| 生产（7B–13B） | RTX 4090（24GB）   | \~$0.70/每 GPU/小时 |
| 大型模型（70B+） | A100 80GB / H100 | \~$1.20/每 GPU/小时 |

> 💡 本指南中的所有示例均可部署在 [Clore.ai](https://clore.ai/marketplace) GPU 服务器上。浏览可用 GPU 并按小时租用 —— 无需承诺，拥有完全的 root 访问权限。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yu-yan-mo-xing/sglang.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.