# DeepSpeed 训练

使用 Microsoft DeepSpeed 高效训练大型模型。

{% hint style="success" %}
所有示例都可以在通过以下方式租用的 GPU 服务器上运行： [CLORE.AI 市场](https://clore.ai/marketplace).
{% endhint %}

## 在 CLORE.AI 上租用

1. 访问 [CLORE.AI 市场](https://clore.ai/marketplace)
2. 按 GPU 类型、显存和价格筛选
3. 选择 **按需** （固定费率）或 **竞价** （出价价格）
4. 配置您的订单：
   * 选择 Docker 镜像
   * 设置端口（用于 SSH 的 TCP，Web 界面的 HTTP）
   * 如有需要，添加环境变量
   * 输入启动命令
5. 选择支付方式： **CLORE**, **BTC**，或 **USDT/USDC**
6. 创建订单并等待部署

### 访问您的服务器

* 在以下位置查找连接详情： **我的订单**
* Web 界面：使用 HTTP 端口的 URL
* SSH： `ssh -p <port> root@<proxy-address>`

## 什么是 DeepSpeed？

DeepSpeed 支持：

* 训练无法放入 GPU 内存的模型
* 多 GPU 与多节点训练
* ZeRO 优化（内存效率）
* 混合精度训练

## ZeRO 阶段

| 阶段            | 节省内存        | 性能    |
| ------------- | ----------- | ----- |
| ZeRO-1        | 优化器状态分区     | 快速    |
| ZeRO-2        | + 梯度分区      | 平衡    |
| ZeRO-3        | + 参数分区      | 最大节省  |
| ZeRO-Infinity | CPU/NVMe 卸载 | 最大的模型 |

## 快速部署

**Docker 镜像：**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**端口：**

```
22/tcp
```

**命令：**

```bash
pip install deepspeed transformers datasets accelerate
```

## 安装

```bash
pip install deepspeed

# 验证安装
ds_report
```

## 基础训练

### DeepSpeed 配置

**ds\_config.json：**

```json
{
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-4,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 1e-4,
            "warmup_num_steps": 100
        }
    },
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 16
    },
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "overlap_comm": true
    }
}
```

### 训练脚本

```python
import torch
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

# 初始化
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# DeepSpeed 初始化
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

# 训练循环
for epoch in range(num_epochs):
    for batch in dataloader:
        inputs = tokenizer(batch["text"], return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(model_engine.device) for k, v in inputs.items()}

        outputs = model_engine(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss

        model_engine.backward(loss)
        model_engine.step()
```

## ZeRO Stage 2 配置

```json
{
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "overlap_comm": true
    }
}
```

## ZeRO Stage 3 配置

针对大型模型：

```json
{
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}
```

## 与 Hugging Face Transformers 一起使用

### Trainer 集成

```python
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    num_train_epochs=3,
    fp16=True,
    deepspeed="ds_config.json",
    logging_steps=10,
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

trainer.train()
```

## 多 GPU 训练

### 启动命令

```bash

# 单节点，4 GPUs
deepspeed --num_gpus=4 train.py --deepspeed ds_config.json

# 指定 GPU
deepspeed --include="localhost:0,1,2,3" train.py --deepspeed ds_config.json
```

### 使用 torchrun

```bash
torchrun --nproc_per_node=4 train.py --deepspeed ds_config.json
```

## 多节点训练

### 主机文件

**hostfile：**

```
node1 slots=4
node2 slots=4
```

### 启动

```bash
deepspeed --hostfile=hostfile train.py --deepspeed ds_config.json
```

### SSH 设置

```bash

# 确保节点之间免密码 SSH
ssh-keygen -t rsa
ssh-copy-id user@node2
```

## 节省内存的配置

### 24GB GPU 上的 7B 模型

```json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "gradient_checkpointing": true,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16
}
```

### 24GB GPU 上的 13B 模型

```json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"},
        "stage3_param_persistence_threshold": 0
    },
    "gradient_checkpointing": true,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 32
}
```

## 梯度检查点

通过重新计算激活来节省内存：

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model.gradient_checkpointing_enable()
```

## 保存与加载检查点

### 保存

```python

# DeepSpeed 处理检查点保存
model_engine.save_checkpoint("./checkpoints", tag="step_1000")
```

### 加载

```python
model_engine.load_checkpoint("./checkpoints", tag="step_1000")
```

### 保存为 HuggingFace 格式

```python

# 将 DeepSpeed 检查点转换为 HF 格式
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint("./checkpoints/step_1000")
model.load_state_dict(state_dict)
model.save_pretrained("./hf_model")
```

## 监控

### TensorBoard

```json
{
    "tensorboard": {
        "enabled": true,
        "output_path": "./logs",
        "job_name": "training_run"
    }
}
```

### Weights & Biases

```json
{
    "wandb": {
        "enabled": true,
        "project": "my_project"
    }
}
```

## 常见问题

### 内存不足

```json
// 试试：
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "train_micro_batch_size_per_gpu": 1
}
```

### 训练缓慢

* 减少 CPU 卸载
* 增大批量大小
* 使用 ZeRO Stage 2 而不是 3

### NCCL 错误

```bash

# 设置环境变量
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
```

## 性能提示

| 提示              | 效果     |
| --------------- | ------ |
| 使用 bf16 替代 fp16 | 更好的稳定性 |
| 启用梯度检查点         | 更少内存占用 |
| 调整批量大小          | 更好的吞吐量 |
| 使用 NVMe 卸载      | 更大的模型  |

## 性能比较

| A100 | GPU 数量  | ZeRO 阶段 | 训练速度            |
| ---- | ------- | ------- | --------------- |
| 7B   | 1x A100 | ZeRO-3  | \~1000 tokens/s |
| 7B   | 4x A100 | ZeRO-2  | \~4000 tokens/s |
| 13B  | 4x A100 | ZeRO-3  | \~2000 tokens/s |
| 70B  | 8x A100 | ZeRO-3  | \~800 tokens/s  |

## # 使用固定种子以获得一致结果

## 下载所有所需的检查点

检查文件完整性

| GPU     | 验证 CUDA 兼容性 | 费用估算    | CLORE.AI 市场的典型费率（截至 2024 年）： |
| ------- | ----------- | ------- | ---------------------------- |
| 按小时费率   | \~$0.03     | \~$0.70 | \~$0.12                      |
| 速度      | \~$0.06     | \~$1.50 | \~$0.25                      |
| 512x512 | \~$0.10     | \~$2.30 | \~$0.40                      |
| 按日费率    | \~$0.17     | \~$4.00 | \~$0.70                      |
| 4 小时会话  | \~$0.25     | \~$6.00 | \~$1.00                      |

*RTX 3060* [*CLORE.AI 市场*](https://clore.ai/marketplace) *A100 40GB*

**A100 80GB**

* 使用 **竞价** 价格随提供商和需求而异。请查看
* 以获取当前费率。 **CLORE** 节省费用：
* 市场用于灵活工作负载（通常便宜 30-50%）

## 使用以下方式支付

* [微调大型语言模型](https://docs.clore.ai/guides/guides_v2-zh/xun-lian/finetune-llm) - LoRA 训练
* vLLM 推理 - 部署已训练模型
* [Hugging Face 指南](https://docs.clore.ai/guides/guides_v2-zh/xun-lian/huggingface-transformers) - Transformers 库


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/xun-lian/deepspeed-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
