# TRL (RLHF/DPO Training)

**TRL** (Transformer Reinforcement Learning) is HuggingFace's official library for training language models with reinforcement learning techniques. With 10K+ GitHub stars, it provides state-of-the-art implementations of RLHF, DPO, PPO, GRPO, and other alignment algorithms for LLMs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## What is TRL?

TRL is the library behind many of today's best-aligned language models. It provides:

* **SFT (Supervised Fine-Tuning)** — standard instruction tuning with ChatML format
* **RLHF/PPO** — classic Proximal Policy Optimization with a reward model
* **DPO** — Direct Preference Optimization (no reward model needed!)
* **GRPO** — Group Relative Policy Optimization (DeepSeek-R1's method)
* **KTO** — Kahneman-Tversky Optimization (works with unpaired preferences)
* **Reward Modeling** — train a reward model from human preference data
* **IterativeSFT** — online RL with a simpler setup
* **ORPO** — Odds Ratio Preference Optimization

TRL integrates natively with HuggingFace ecosystem: `transformers`, `peft`, `datasets`, `accelerate`, and `bitsandbytes`.

{% hint style="info" %}
**Which algorithm should you use?**

* **DPO** — simplest, most stable. Use when you have paired preference data (chosen/rejected).
* **PPO** — most powerful but complex. Use when you have a reward model or scoring function.
* **GRPO** — great for reasoning/math tasks. DeepSeek-R1's training method.
* **SFT** — always start here before applying any RL method.
  {% endhint %}

***

## Server Requirements

| Component | Minimum                   | Recommended              |
| --------- | ------------------------- | ------------------------ |
| GPU       | RTX 3090 (24 GB)          | A100 80 GB / H100        |
| VRAM      | 16 GB (SFT/DPO 7B + LoRA) | 80 GB (full finetune 7B) |
| RAM       | 32 GB                     | 64 GB+                   |
| CPU       | 8 cores                   | 16+ cores                |
| Storage   | 100 GB                    | 300 GB+                  |
| OS        | Ubuntu 20.04+             | Ubuntu 22.04             |
| Python    | 3.9+                      | 3.11                     |
| CUDA      | 11.8+                     | 12.1+                    |

### VRAM by Task

| Task | Model       | Method      | VRAM             |
| ---- | ----------- | ----------- | ---------------- |
| SFT  | Llama 3 8B  | QLoRA 4-bit | \~8 GB           |
| DPO  | Llama 3 8B  | LoRA        | \~20 GB          |
| PPO  | Llama 3 8B  | Full        | \~80 GB (2×A100) |
| GRPO | Qwen 7B     | LoRA        | \~24 GB          |
| SFT  | Llama 3 70B | QLoRA 4-bit | \~48 GB          |
| DPO  | Llama 3 70B | LoRA        | \~80 GB          |

***

## Ports

| Port | Service | Notes                                      |
| ---- | ------- | ------------------------------------------ |
| 22   | SSH     | Terminal access, file transfer, monitoring |

TRL is a training library — it runs as a CLI/Python script, no web server required.

***

## Installation on Clore.ai

### Step 1 — Rent a Server

1. Go to [Clore.ai Marketplace](https://clore.ai/marketplace)
2. Filter for **VRAM ≥ 24 GB** (RTX 3090, A100, or H100)
3. Choose a **PyTorch** or **CUDA 12.1** base image
4. Select **Storage ≥ 200 GB** for models and datasets
5. Open port **22** for SSH access

### Step 2 — Connect via SSH

```bash
ssh root@<server-ip> -p <ssh-port>
```

### Step 3 — Install TRL

```bash
# Create Python virtual environment
python3 -m venv /opt/trl
source /opt/trl/bin/activate

# Install TRL with all dependencies
pip install trl

# Install additional dependencies for full workflows
pip install \
    transformers \
    datasets \
    peft \
    accelerate \
    bitsandbytes \
    wandb \
    scipy \
    sentencepiece \
    protobuf

# Verify GPU support
python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"
```

### Step 4 — HuggingFace Authentication

```bash
# Login to access gated models (Llama, Gemma)
huggingface-cli login
# Enter your HF token from https://huggingface.co/settings/tokens

# Or set environment variable
export HF_TOKEN=hf_your-token-here
```

### Step 5 — Optional: Weights & Biases Tracking

```bash
# Set up experiment tracking (highly recommended)
pip install wandb
wandb login  # Enter your W&B API key from https://wandb.ai/settings

# Or disable W&B
export WANDB_DISABLED=true
```

***

## Supervised Fine-Tuning (SFT)

SFT is always the first step before any RL technique.

### Prepare Your Dataset

```python
# Format: datasets library with 'messages' or 'text' column
# ChatML format (recommended)
from datasets import Dataset

data = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful GPU cloud assistant."},
            {"role": "user", "content": "How do I rent a GPU on Clore.ai?"},
            {"role": "assistant", "content": "Visit clore.ai/marketplace, filter by GPU specs, select a server, and click Rent. SSH access is provided immediately after payment."}
        ]
    },
    # ... more examples
]

dataset = Dataset.from_list(data)
dataset.save_to_disk("data/sft_dataset")
dataset.push_to_hub("your-username/my-sft-dataset")  # optional
```

### SFT Training Script

```python
# sft_train.py
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import torch

# Model configuration
model_name = "meta-llama/Llama-3.2-8B-Instruct"

# QLoRA: 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Load dataset
dataset = load_dataset("trl-lib/ultrachat_200k", split="train_sft[:10%]")

# Training configuration
training_config = SFTConfig(
    output_dir="./sft_output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    fp16=False,
    bf16=True,
    max_seq_length=2048,
    dataset_text_field="messages",
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    push_to_hub=False,
    report_to="wandb",  # or "none"
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=dataset,
    peft_config=lora_config,
    tokenizer=tokenizer,
)

# Train
trainer.train()
trainer.save_model("./sft_final")
```

```bash
# Run training
python3 sft_train.py
```

***

## DPO (Direct Preference Optimization)

DPO is the most popular alignment method — no reward model needed, just preference pairs.

### Prepare DPO Dataset

```python
# Format: each example has 'prompt', 'chosen', 'rejected'
from datasets import Dataset

data = [
    {
        "prompt": "Explain how to optimize GPU utilization",
        "chosen": "To optimize GPU utilization: 1) Use larger batch sizes to maximize occupancy, 2) Enable mixed precision (bf16/fp16), 3) Profile with nvidia-smi to identify bottlenecks, 4) Use CUDA streams for parallel operations.",
        "rejected": "Just use more GPUs."
    },
    # ... more preference pairs
]

dataset = Dataset.from_list(data)
```

### DPO Training Script

```python
# dpo_train.py
from trl import DPOTrainer, DPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
import torch

model_name = "./sft_final"  # Start from your SFT model!

# Load SFT model (the policy to align)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Reference model (frozen copy of SFT model)
ref_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load preference dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:5%]")

# DPO configuration
dpo_config = DPOConfig(
    output_dir="./dpo_output",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,           # Much lower than SFT
    beta=0.1,                     # KL penalty coefficient
    loss_type="sigmoid",          # Standard DPO loss
    max_length=2048,
    max_prompt_length=512,
    bf16=True,
    logging_steps=10,
    save_steps=50,
    report_to="wandb",
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./dpo_final")
```

***

## PPO (Proximal Policy Optimization)

PPO is the classic RLHF approach — use when you have a reward signal:

```python
# ppo_train.py
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline
from datasets import load_dataset
import torch

model_name = "./sft_final"

# Policy model (with value head for PPO)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Reward model (can be any scoring function)
sentiment_pipe = pipeline(
    "sentiment-analysis",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
    device=0,
)

def reward_fn(texts):
    """Score each response. Return list of reward tensors."""
    results = sentiment_pipe(texts)
    rewards = []
    for result in results:
        score = result["score"] if result["label"] == "POSITIVE" else -result["score"]
        rewards.append(torch.tensor(score))
    return rewards

ppo_config = PPOConfig(
    output_dir="./ppo_output",
    learning_rate=1.41e-5,
    mini_batch_size=1,
    batch_size=4,
    gradient_accumulation_steps=4,
    kl_penalty="kl",
    target_kl=6.0,
    cliprange=0.2,
    vf_coef=0.1,
)

trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=None,  # Auto-copies the initial model as reference
    tokenizer=tokenizer,
)

# Training loop
dataset = load_dataset("imdb", split="train[:1000]")
for epoch in range(3):
    for batch in trainer.dataloader:
        queries = batch["input_ids"]
        
        # Generate responses
        responses = trainer.generate(queries, max_new_tokens=100)
        
        # Score responses
        texts = tokenizer.batch_decode(responses, skip_special_tokens=True)
        rewards = reward_fn(texts)
        
        # PPO update
        stats = trainer.step(queries, responses, rewards)
        trainer.log_stats(stats, batch, rewards)
```

***

## GRPO (Group Relative Policy Optimization)

GRPO is used in DeepSeek-R1 for reasoning training:

```python
# grpo_train.py
from trl import GRPOTrainer, GRPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset
import re, torch

model_name = "Qwen/Qwen2.5-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Math dataset
def make_math_dataset():
    examples = [
        {"prompt": "What is 2+2?", "answer": "4"},
        {"prompt": "What is 15 * 7?", "answer": "105"},
        # ... more math problems
    ]
    return Dataset.from_list(examples)

dataset = make_math_dataset()

def correctness_reward(completions, answer, **kwargs):
    """Reward 1.0 if the answer is correct, 0.0 otherwise."""
    rewards = []
    for completion in completions:
        # Extract final number from completion
        numbers = re.findall(r'\d+', completion[-1]["content"])
        if numbers and numbers[-1] == answer:
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

grpo_config = GRPOConfig(
    output_dir="./grpo_output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    num_generations=8,       # GRPO generates G responses per prompt
    learning_rate=5e-7,
    bf16=True,
    logging_steps=10,
)

trainer = GRPOTrainer(
    model=model,
    args=grpo_config,
    train_dataset=dataset,
    reward_funcs=correctness_reward,
    tokenizer=tokenizer,
)

trainer.train()
```

***

## Multi-GPU Training

Use `accelerate` for distributed training:

```bash
# Configure accelerate for multi-GPU
accelerate config

# Example config for 4 GPUs:
# - compute_environment: LOCAL_MACHINE
# - distributed_type: MULTI_GPU
# - num_processes: 4
# - mixed_precision: bf16

# Launch training across all GPUs
accelerate launch sft_train.py
accelerate launch dpo_train.py

# Or specify GPUs explicitly
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
  --num_processes 4 \
  --mixed_precision bf16 \
  sft_train.py
```

***

## Using the TRL CLI

TRL provides convenient CLI commands:

```bash
# SFT via CLI
trl sft \
  --model_name_or_path meta-llama/Llama-3.2-8B-Instruct \
  --dataset_name trl-lib/ultrachat_200k \
  --dataset_text_field messages \
  --output_dir ./cli_sft_output \
  --num_train_epochs 3 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --bf16 \
  --use_peft \
  --lora_r 16 \
  --lora_alpha 32

# DPO via CLI
trl dpo \
  --model_name_or_path ./cli_sft_output \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir ./cli_dpo_output \
  --num_train_epochs 1 \
  --beta 0.1 \
  --bf16
```

***

## Monitoring Training

```bash
# Watch GPU utilization
watch -n 1 nvidia-smi

# Monitor training loss (if using W&B)
# Open https://wandb.ai/your-username in browser

# Check output directory for checkpoints
ls -lh sft_output/checkpoint-*/

# Resume from checkpoint
python3 sft_train.py --resume_from_checkpoint sft_output/checkpoint-500/
```

***

## Clore.ai GPU Recommendations

TRL training is one of the most VRAM-intensive workloads. Pick your GPU based on model size and method:

| Task                                     | GPU                | Notes                                                                  |
| ---------------------------------------- | ------------------ | ---------------------------------------------------------------------- |
| SFT / DPO on 7–8B (QLoRA)                | **RTX 3090** 24 GB | \~8 GB for QLoRA 4-bit; fits comfortably; \~$0.12/hr on Clore.ai       |
| SFT / DPO on 7–8B (LoRA bf16)            | **RTX 4090** 24 GB | Same VRAM as 3090 but 30% faster compute; great for iteration speed    |
| Full SFT on 7B or DPO on 13B             | **A100 40 GB**     | 40 GB fits 7B full-precision training; ECC memory avoids silent errors |
| PPO / full finetune 7B, or any 70B QLoRA | **A100 80 GB**     | PPO needs 2× policy+ref model in VRAM; 80 GB runs both without OOM     |

**Practical tip:** Start on RTX 3090 with QLoRA for experimentation — train Llama 3 8B in \~2 hrs on 10K examples. Once you've validated the pipeline, move to A100 80GB for full-precision runs or 70B models.

**Speed numbers (Llama 3 8B SFT, QLoRA, batch=4, seq=2048):**

* RTX 3090: \~1,100 tokens/sec training throughput
* RTX 4090: \~1,450 tokens/sec
* A100 80GB: \~2,800 tokens/sec (full bf16, no quantization)

***

## Troubleshooting

### CUDA Out of Memory

```bash
# Reduce batch size
per_device_train_batch_size=1
gradient_accumulation_steps=16  # Keep effective batch size the same

# Use 4-bit quantization (QLoRA)
# Add BitsAndBytesConfig with load_in_4bit=True

# Enable gradient checkpointing
gradient_checkpointing=True

# Reduce sequence length
max_seq_length=1024  # instead of 2048+

# Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
```

### Loss is NaN

```bash
# Common cause: learning rate too high
learning_rate=1e-5  # Try lower

# Common cause: bad data (empty strings, None values)
# Validate dataset:
python3 -c "
from datasets import load_from_disk
ds = load_from_disk('data/sft_dataset')
print(ds[0])
print(f'Length: {len(ds)}')
# Check for None
none_count = sum(1 for x in ds if x.get('messages') is None)
print(f'None count: {none_count}')
"

# Enable bf16 instead of fp16 (more stable)
bf16=True
fp16=False
```

### DPO: `chosen_rewards > rejected_rewards` is False

```bash
# This means the model prefers rejected responses — overfitting or bad data
# Solutions:
# 1. Check your dataset quality
# 2. Reduce beta (less KL penalty)
# 3. Reduce learning rate
# 4. Add more SFT training before DPO
beta=0.05  # Try smaller values
```

### Training is very slow

```bash
# Enable Flash Attention 2
pip install flash-attn --no-build-isolation

# In your code:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
)

# Use bf16 instead of fp16 on Ampere+ GPUs (A100, RTX 3000+)
bf16=True

# Increase DataLoader workers
dataloader_num_workers=4

# Check if GPU is actually being used
nvidia-smi  # Should show high GPU utilization
```

### `tokenizer.pad_token` warning

```bash
# Standard fix for Llama/Mistral tokenizers
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Important for training stability
```

### Permission denied / HuggingFace 401

```bash
# Re-login
huggingface-cli login

# Set token in environment
export HF_TOKEN=hf_your-token

# For private models/datasets, ensure you have access:
# Go to https://huggingface.co/meta-llama/Llama-3.2-8B-Instruct
# Click "Request access" and accept the license
```

***

## Saving and Sharing Your Model

```bash
# Merge LoRA weights into base model
python3 << 'EOF'
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./sft_final")
merged = model.merge_and_unload()
merged.save_pretrained("./merged_model")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B-Instruct")
tokenizer.save_pretrained("./merged_model")
print("Merged model saved!")
EOF

# Push to HuggingFace
huggingface-cli upload your-username/my-trl-model ./merged_model
```

***

## Useful Links

* **GitHub**: <https://github.com/huggingface/trl> ⭐ 10K+
* **Documentation**: <https://huggingface.co/docs/trl>
* **DPO Paper**: <https://arxiv.org/abs/2305.18290>
* **GRPO / DeepSeek-R1**: <https://arxiv.org/abs/2501.12599>
* **PPO Paper (RLHF)**: <https://arxiv.org/abs/2203.02155>
* **HuggingFace PEFT**: <https://github.com/huggingface/peft>
* **Weights & Biases**: <https://wandb.ai>
* **Flash Attention**: <https://github.com/Dao-AILab/flash-attention>
* **Clore.ai Marketplace**: <https://clore.ai/marketplace>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/training/trl.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.