# TRL（RLHF/DPO 训练）

**TRL** （Transformer 强化学习）是 HuggingFace 官方用于使用强化学习技术训练语言模型的库。拥有超过 10K 的 GitHub 收藏，它提供了用于大模型对齐的最先进实现，如 RLHF、DPO、PPO、GRPO 等对齐算法。

{% hint style="success" %}
所有示例都可以在通过以下方式租用的 GPU 服务器上运行 [CLORE.AI 市场](https://clore.ai/marketplace).
{% endhint %}

***

## 什么是 TRL？

TRL 是许多当今最佳对齐语言模型背后的库。它提供：

* **SFT（有监督微调）** — 使用 ChatML 格式的标准指令微调
* **RLHF/PPO** — 使用奖励模型的经典近端策略优化（PPO）
* **DPO** — 直接偏好优化（无需奖励模型！）
* **GRPO** — 群体相对策略优化（DeepSeek-R1 的方法）
* **KTO** — Kahneman-Tversky 优化（适用于非配对偏好）
* **奖励建模** — 从人工偏好数据训练奖励模型
* **IterativeSFT** — 更简单设置的在线强化学习
* **ORPO** — 赔率比偏好优化

TRL 原生集成了 HuggingFace 生态系统： `transformers`, `peft`, `datasets`, `accelerate`，以及 `bitsandbytes`.

{% hint style="info" %}
**你应该使用哪种算法？**

* **DPO** — 最简单、最稳定。当你有配对偏好数据（被选中/被拒绝）时使用。
* **PPO** — 功能最强但更复杂。当你有奖励模型或评分函数时使用。
* **GRPO** — 非常适合理解/数学任务。DeepSeek-R1 的训练方法。
* **SFT** — 在应用任何强化学习方法之前总是从这里开始。
  {% endhint %}

***

## 服务器要求

| 组件       | 最低                       | 推荐                |
| -------- | ------------------------ | ----------------- |
| GPU      | RTX 3090（24 GB）          | A100 80 GB / H100 |
| 显存（VRAM） | 16 GB（SFT/DPO 7B + LoRA） | 80 GB（7B 完整微调）    |
| 内存（RAM）  | 32 GB                    | 64 GB+            |
| CPU      | 8 核                      | 16+ 核             |
| 存储       | 100 GB                   | 300 GB+           |
| 操作系统（OS） | Ubuntu 20.04+            | Ubuntu 22.04      |
| Python   | 3.9+                     | 3.11              |
| CUDA     | 11.8+                    | 12.1+             |

### 按任务的显存需求

| 任务   | 模型          | 方法          | 显存（VRAM）        |
| ---- | ----------- | ----------- | --------------- |
| SFT  | Llama 3 8B  | QLoRA 4-bit | \~8 GB          |
| DPO  | Llama 3 8B  | LoRA        | \~20 GB         |
| PPO  | Llama 3 8B  | 完整（Full）    | \~80 GB（2×A100） |
| GRPO | Qwen 7B     | LoRA        | \~24 GB         |
| SFT  | Llama 3 70B | QLoRA 4-bit | \~48 GB         |
| DPO  | Llama 3 70B | LoRA        | \~80 GB         |

***

## 端口（Ports）

| 端口（Port） | 服务（Service） | 说明（Notes）    |
| -------- | ----------- | ------------ |
| 22       | SSH         | 终端访问、文件传输、监控 |

TRL 是一个训练库——它作为 CLI/Python 脚本运行，不需要 Web 服务器。

***

## 在 Clore.ai 上的安装

### 步骤 1 — 租用服务器

1. 前往 [Clore.ai 市场](https://clore.ai/marketplace)
2. 筛选 **显存 ≥ 24 GB** （RTX 3090、A100 或 H100）
3. 选择一个 **PyTorch** 或 **CUDA 12.1** 基础镜像（base image）
4. 选择 **存储 ≥ 200 GB** 用于模型和数据集
5. 打开端口 **22** 用于 SSH 访问

### 步骤 2 — 通过 SSH 连接

```bash
ssh root@<server-ip> -p <ssh-port>
```

### 步骤 3 — 安装 TRL

```bash
# 创建 Python 虚拟环境
python3 -m venv /opt/trl
source /opt/trl/bin/activate

# 安装带全部依赖的 TRL
pip install trl

# 为完整工作流安装额外依赖
pip install \
    transformers \
    datasets \
    peft \
    accelerate \
    bitsandbytes \
    wandb \
    scipy \
    sentencepiece \
    protobuf

# 验证 GPU 支持
python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"
```

### 步骤 4 — HuggingFace 验证（认证）

```bash
# 登录以访问受限模型（如 Llama、Gemma）
huggingface-cli login
# 从 https://huggingface.co/settings/tokens 输入你的 HF 令牌

# 或者设置环境变量
export HF_TOKEN=hf_your-token-here
```

### 步骤 5 — 可选：Weights & Biases 跟踪

```bash
# 设置实验跟踪（强烈推荐）
pip install wandb
wandb login  # 从 https://wandb.ai/settings 输入你的 W&B API 密钥

# 或者禁用 W&B
export WANDB_DISABLED=true
```

***

## 有监督微调（SFT）

SFT 始终是在应用任何强化学习技术之前的第一步。

### 准备你的数据集

```python
# 格式：datasets 库，带有 'messages' 或 'text' 列
# ChatML 格式（推荐）
from datasets import Dataset

data = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful GPU cloud assistant."},
            {"role": "user", "content": "How do I rent a GPU on Clore.ai?"},
            {"role": "assistant", "content": "Visit clore.ai/marketplace, filter by GPU specs, select a server, and click Rent. SSH access is provided immediately after payment."}
        ]
    },
    # … 更多示例
]

dataset = Dataset.from_list(data)
dataset.save_to_disk("data/sft_dataset")
dataset.push_to_hub("your-username/my-sft-dataset")  # 可选
```

### SFT 训练脚本

```python
# sft_train.py
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import torch

# 模型配置
model_name = "meta-llama/Llama-3.2-8B-Instruct"

# QLoRA：4-bit 量化配置
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# 加载模型
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# 加载分词器（tokenizer）
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# LoRA 配置
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# 加载数据集
dataset = load_dataset("trl-lib/ultrachat_200k", split="train_sft[:10%]")

# 训练配置
training_config = SFTConfig(
    output_dir="./sft_output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    fp16=False,
    bf16=True,
    max_seq_length=2048,
    dataset_text_field="messages",
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    push_to_hub=False,
    report_to="wandb",  # 或者 "none"
)

# 初始化 trainer
trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=dataset,
    peft_config=lora_config,
    tokenizer=tokenizer,
)

# 训练
trainer.train()
trainer.save_model("./sft_final")
```

```bash
# 运行训练
python3 sft_train.py
```

***

## DPO（直接偏好优化）

DPO 是最受欢迎的对齐方法——无需奖励模型，只需偏好对。

### 准备 DPO 数据集

```python
# 格式：每个示例包含 'prompt'、'chosen'、'rejected'
from datasets import Dataset

data = [
    {
        "prompt": "Explain how to optimize GPU utilization",
        "chosen": "To optimize GPU utilization: 1) Use larger batch sizes to maximize occupancy, 2) Enable mixed precision (bf16/fp16), 3) Profile with nvidia-smi to identify bottlenecks, 4) Use CUDA streams for parallel operations.",
        "rejected": "Just use more GPUs."
    },
    # … 更多偏好对
]

dataset = Dataset.from_list(data)
```

### DPO 训练脚本

```python
# dpo_train.py
from trl import DPOTrainer, DPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
import torch

model_name = "./sft_final"  # 从你的 SFT 模型开始！

# 加载 SFT 模型（用于对齐的策略）
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 参考模型（SFT 模型的冻结副本）
ref_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 加载偏好数据集
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:5%]")

# DPO 配置
dpo_config = DPOConfig(
    output_dir="./dpo_output",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,           # 比 SFT 低得多
    beta=0.1,                     # KL 惩罚系数
    loss_type="sigmoid",          # 标准 DPO 损失
    max_length=2048,
    max_prompt_length=512,
    bf16=True,
    logging_steps=10,
    save_steps=50,
    report_to="wandb",
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./dpo_final")
```

***

## PPO（近端策略优化）

PPO 是经典的 RLHF 方法——当你有奖励信号时使用：

```python
# ppo_train.py
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline
from datasets import load_dataset
import torch

model_name = "./sft_final"

# 策略模型（带 PPO 的价值头）
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 奖励模型（可以是任何评分函数）
sentiment_pipe = pipeline(
    "sentiment-analysis",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
    device=0,
)

def reward_fn(texts):
    """为每个响应评分。返回奖励张量的列表。"""
    results = sentiment_pipe(texts)
    rewards = []
    for result in results:
        score = result["score"] if result["label"] == "POSITIVE" else -result["score"]
        rewards.append(torch.tensor(score))
    return rewards

ppo_config = PPOConfig(
    output_dir="./ppo_output",
    learning_rate=1.41e-5,
    mini_batch_size=1,
    batch_size=4,
    gradient_accumulation_steps=4,
    kl_penalty="kl",
    target_kl=6.0,
    cliprange=0.2,
    vf_coef=0.1,
)

trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=None,  # 自动复制初始模型作为参考
    tokenizer=tokenizer,
)

# 训练循环
dataset = load_dataset("imdb", split="train[:1000]")
for epoch in range(3):
    for batch in trainer.dataloader:
        queries = batch["input_ids"]
        
        # 生成响应
        responses = trainer.generate(queries, max_new_tokens=100)
        
        # 对响应打分
        texts = tokenizer.batch_decode(responses, skip_special_tokens=True)
        rewards = reward_fn(texts)
        
        # PPO 更新
        stats = trainer.step(queries, responses, rewards)
        trainer.log_stats(stats, batch, rewards)
```

***

## GRPO（群体相对策略优化）

GRPO 在 DeepSeek-R1 的推理训练中被使用：

```python
# grpo_train.py
from trl import GRPOTrainer, GRPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset
import re, torch

model_name = "Qwen/Qwen2.5-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 数学数据集
def make_math_dataset():
    examples = [
        {"prompt": "What is 2+2?", "answer": "4"},
        {"prompt": "What is 15 * 7?", "answer": "105"},
        # … 更多数学题目
    ]
    return Dataset.from_list(examples)

dataset = make_math_dataset()

def correctness_reward(completions, answer, **kwargs):
    """如果答案正确，奖励 1.0，否则 0.0。"""
    rewards = []
    for completion in completions:
        # 从完成内容中提取最终数字
        numbers = re.findall(r'\d+', completion[-1]["content"])
        if numbers and numbers[-1] == answer:
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

grpo_config = GRPOConfig(
    output_dir="./grpo_output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    num_generations=8,       # GRPO 为每个提示生成 G 个响应
    learning_rate=5e-7,
    bf16=True,
    logging_steps=10,
)

trainer = GRPOTrainer(
    model=model,
    args=grpo_config,
    train_dataset=dataset,
    reward_funcs=correctness_reward,
    tokenizer=tokenizer,
)

trainer.train()
```

***

## 多 GPU 训练

使用 `accelerate` 进行分布式训练：

```bash
# 为多 GPU 配置 accelerate
accelerate config

# 4 卡的示例配置：
# - compute_environment: LOCAL_MACHINE
# - distributed_type: MULTI_GPU
# - num_processes: 4
# - mixed_precision: bf16

# 在所有 GPU 上启动训练
accelerate launch sft_train.py
accelerate launch dpo_train.py

# 或者显式指定 GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
  --num_processes 4 \
  --mixed_precision bf16 \
  sft_train.py
```

***

## 使用 TRL CLI

TRL 提供方便的 CLI 命令：

```bash
# 通过 CLI 进行 SFT
trl sft \
  --model_name_or_path meta-llama/Llama-3.2-8B-Instruct \
  --dataset_name trl-lib/ultrachat_200k \
  --dataset_text_field messages \
  --output_dir ./cli_sft_output \
  --num_train_epochs 3 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --bf16 \
  --use_peft \
  --lora_r 16 \
  --lora_alpha 32

# 通过 CLI 进行 DPO
trl dpo \
  --model_name_or_path ./cli_sft_output \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir ./cli_dpo_output \
  --num_train_epochs 1 \
  --beta 0.1 \
  --bf16
```

***

## 监控训练

```bash
# 观察 GPU 利用率
watch -n 1 nvidia-smi

# 监控训练损失（如果使用 W&B）
# 在浏览器中打开 https://wandb.ai/your-username

# 检查输出目录中的检查点
ls -lh sft_output/checkpoint-*/

# 从检查点恢复
python3 sft_train.py --resume_from_checkpoint sft_output/checkpoint-500/
```

***

## Clore.ai 的 GPU 建议

TRL 训练是对显存消耗极高的工作负载之一。根据模型大小和方法选择你的 GPU：

| 任务                           | GPU                | 说明（Notes）                                            |
| ---------------------------- | ------------------ | ---------------------------------------------------- |
| 7–8B 的 SFT / DPO（QLoRA）      | **RTX 3090** 24 GB | QLoRA 4-bit 大约需要 \~8 GB；可轻松运行；在 Clore.ai 上约 $0.12/小时 |
| 7–8B 的 SFT / DPO（LoRA bf16）  | **RTX 4090** 24 GB | 与 3090 相同的显存但计算速度约快 30%；非常适合加快迭代速度                   |
| 7B 的完整 SFT 或 13B 的 DPO       | **A100 40 GB**     | 40 GB 可容纳 7B 的全精度训练；ECC 内存可避免静默错误                    |
| PPO / 7B 的完整微调，或任何 70B QLoRA | **A100 80 GB**     | PPO 需要在显存中同时放置策略模型和参考模型的两份；80 GB 可在不 OOM 的情况下运行两者    |

**实用提示：** 在 RTX 3090 上使用 QLoRA 开始实验——在 10K 个示例上训练 Llama 3 8B 大约需要 \~2 小时。一旦验证了流水线，就迁移到 A100 80GB 以进行全精度运行或训练 70B 模型。

**速度指标（Llama 3 8B SFT，QLoRA，batch=4，seq=2048）：**

* RTX 3090：约 1,100 tokens/sec 的训练吞吐量
* RTX 4090：约 1,450 tokens/sec
* A100 80GB：约 2,800 tokens/sec（全 bf16，无量化）

***

## 故障排除

### CUDA 内存不足（Out of Memory）

```bash
# 减小 batch 大小
per_device_train_batch_size=1
gradient_accumulation_steps=16  # 保持有效批量大小不变

# 使用 4 位量化 (QLoRA)
# 添加 BitsAndBytesConfig 并设置 load_in_4bit=True

# 启用梯度检查点（gradient checkpointing）
gradient_checkpointing=True

# 减小序列长度
max_seq_length=1024  # 替代 2048+

# 检查 GPU 内存
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
```

### 损失为 NaN

```bash
# 常见原因：学习率过高
learning_rate=1e-5  # 尝试更低的值

# 常见原因：数据有问题（空字符串、None 值）
# 验证数据集：
python3 -c "
from datasets import load_from_disk
ds = load_from_disk('data/sft_dataset')
print(ds[0])
print(f'Length: {len(ds)}')
# 检查 None 值
none_count = sum(1 for x in ds if x.get('messages') is None)
print(f'None count: {none_count}')
"

# 启用 bf16 而不是 fp16（更稳定）
bf16=True
fp16=False
```

### DPO： `chosen_rewards > rejected_rewards` 为 False

```bash
# 这意味着模型更偏好被拒绝的回复——可能是过拟合或数据有问题
# 解决方案：
# 1. 检查你的数据集质量
# 2. 降低 beta（减少 KL 惩罚）
# 3. 降低学习率
# 4. 在 DPO 之前增加更多 SFT 训练
beta=0.05  # 尝试更小的值
```

### 训练非常慢

```bash
# 启用 Flash Attention 2
pip install flash-attn --no-build-isolation

# 在你的代码中：
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
)

# 在 Ampere 及更高架构的 GPU（A100、RTX 3000+）上使用 bf16 而不是 fp16
bf16=True

# 增加 DataLoader 的 worker 数量
dataloader_num_workers=4

# 检查 GPU 是否真正被使用
nvidia-smi  # 应该显示较高的 GPU 利用率
```

### `tokenizer.pad_token` 警告

```bash
# Llama/Mistral 分词器的标准修复方法
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # 对训练稳定性很重要
```

### 权限被拒绝 / HuggingFace 401

```bash
# 重新登录
huggingface-cli login

# 在环境变量中设置 token
export HF_TOKEN=hf_your-token

# 对于私有模型/数据集，确保你有访问权限：
# 访问 https://huggingface.co/meta-llama/Llama-3.2-8B-Instruct
# 点击“请求访问”并接受许可
```

***

## 保存并共享你的模型

```bash
# 将 LoRA 权重合并到基础模型中
python3 << 'EOF'
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./sft_final")
merged = model.merge_and_unload()
merged.save_pretrained("./merged_model")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B-Instruct")
tokenizer.save_pretrained("./merged_model")
print("已保存合并后的模型！")
EOF

# 推送到 HuggingFace
huggingface-cli upload your-username/my-trl-model ./merged_model
```

***

## 有用的链接

* **GitHub**: <https://github.com/huggingface/trl> ⭐ 10K+
* **文档**: <https://huggingface.co/docs/trl>
* **DPO 论文**: <https://arxiv.org/abs/2305.18290>
* **GRPO / DeepSeek-R1**: <https://arxiv.org/abs/2501.12599>
* **PPO 论文（RLHF）**: <https://arxiv.org/abs/2203.02155>
* **HuggingFace PEFT**: <https://github.com/huggingface/peft>
* **Weights & Biases**: <https://wandb.ai>
* **Flash Attention**: <https://github.com/Dao-AILab/flash-attention>
* **Clore.ai 市场**: <https://clore.ai/marketplace>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/xun-lian/trl.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.