# TRL (RLHF/DPO Training)

**TRL** (Transformer Reinforcement Learning) is HuggingFace's official library for training language models with reinforcement learning techniques. With 10K+ GitHub stars, it provides state-of-the-art implementations of RLHF, DPO, PPO, GRPO, and other alignment algorithms for LLMs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## What is TRL?

TRL is the library behind many of today's best-aligned language models. It provides:

* **SFT (Supervised Fine-Tuning)** — standard instruction tuning with ChatML format
* **RLHF/PPO** — classic Proximal Policy Optimization with a reward model
* **DPO** — Direct Preference Optimization (no reward model needed!)
* **GRPO** — Group Relative Policy Optimization (DeepSeek-R1's method)
* **KTO** — Kahneman-Tversky Optimization (works with unpaired preferences)
* **Reward Modeling** — train a reward model from human preference data
* **IterativeSFT** — online RL with a simpler setup
* **ORPO** — Odds Ratio Preference Optimization

TRL integrates natively with HuggingFace ecosystem: `transformers`, `peft`, `datasets`, `accelerate`, and `bitsandbytes`.

{% hint style="info" %}
**Which algorithm should you use?**

* **DPO** — simplest, most stable. Use when you have paired preference data (chosen/rejected).
* **PPO** — most powerful but complex. Use when you have a reward model or scoring function.
* **GRPO** — great for reasoning/math tasks. DeepSeek-R1's training method.
* **SFT** — always start here before applying any RL method.
  {% endhint %}

***

## Server Requirements

| Component | Minimum                   | Recommended              |
| --------- | ------------------------- | ------------------------ |
| GPU       | RTX 3090 (24 GB)          | A100 80 GB / H100        |
| VRAM      | 16 GB (SFT/DPO 7B + LoRA) | 80 GB (full finetune 7B) |
| RAM       | 32 GB                     | 64 GB+                   |
| CPU       | 8 cores                   | 16+ cores                |
| Storage   | 100 GB                    | 300 GB+                  |
| OS        | Ubuntu 20.04+             | Ubuntu 22.04             |
| Python    | 3.9+                      | 3.11                     |
| CUDA      | 11.8+                     | 12.1+                    |

### VRAM by Task

| Task | Model       | Method      | VRAM             |
| ---- | ----------- | ----------- | ---------------- |
| SFT  | Llama 3 8B  | QLoRA 4-bit | \~8 GB           |
| DPO  | Llama 3 8B  | LoRA        | \~20 GB          |
| PPO  | Llama 3 8B  | Full        | \~80 GB (2×A100) |
| GRPO | Qwen 7B     | LoRA        | \~24 GB          |
| SFT  | Llama 3 70B | QLoRA 4-bit | \~48 GB          |
| DPO  | Llama 3 70B | LoRA        | \~80 GB          |

***

## Ports

| Port | Service | Notes                                      |
| ---- | ------- | ------------------------------------------ |
| 22   | SSH     | Terminal access, file transfer, monitoring |

TRL is a training library — it runs as a CLI/Python script, no web server required.

***

## Installation on Clore.ai

### Step 1 — Rent a Server

1. Go to [Clore.ai Marketplace](https://clore.ai/marketplace)
2. Filter for **VRAM ≥ 24 GB** (RTX 3090, A100, or H100)
3. Choose a **PyTorch** or **CUDA 12.1** base image
4. Select **Storage ≥ 200 GB** for models and datasets
5. Open port **22** for SSH access

### Step 2 — Connect via SSH

```bash
ssh root@<server-ip> -p <ssh-port>
```

### Step 3 — Install TRL

```bash
# Create Python virtual environment
python3 -m venv /opt/trl
source /opt/trl/bin/activate

# Install TRL with all dependencies
pip install trl

# Install additional dependencies for full workflows
pip install \
    transformers \
    datasets \
    peft \
    accelerate \
    bitsandbytes \
    wandb \
    scipy \
    sentencepiece \
    protobuf

# Verify GPU support
python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"
```

### Step 4 — HuggingFace Authentication

```bash
# Login to access gated models (Llama, Gemma)
huggingface-cli login
# Enter your HF token from https://huggingface.co/settings/tokens

# Or set environment variable
export HF_TOKEN=hf_your-token-here
```

### Step 5 — Optional: Weights & Biases Tracking

```bash
# Set up experiment tracking (highly recommended)
pip install wandb
wandb login  # Enter your W&B API key from https://wandb.ai/settings

# Or disable W&B
export WANDB_DISABLED=true
```

***

## Supervised Fine-Tuning (SFT)

SFT is always the first step before any RL technique.

### Prepare Your Dataset

```python
# Format: datasets library with 'messages' or 'text' column
# ChatML format (recommended)
from datasets import Dataset

data = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful GPU cloud assistant."},
            {"role": "user", "content": "How do I rent a GPU on Clore.ai?"},
            {"role": "assistant", "content": "Visit clore.ai/marketplace, filter by GPU specs, select a server, and click Rent. SSH access is provided immediately after payment."}
        ]
    },
    # ... more examples
]

dataset = Dataset.from_list(data)
dataset.save_to_disk("data/sft_dataset")
dataset.push_to_hub("your-username/my-sft-dataset")  # optional
```

### SFT Training Script

```python
# sft_train.py
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import torch

# Model configuration
model_name = "meta-llama/Llama-3.2-8B-Instruct"

# QLoRA: 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Load dataset
dataset = load_dataset("trl-lib/ultrachat_200k", split="train_sft[:10%]")

# Training configuration
training_config = SFTConfig(
    output_dir="./sft_output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    fp16=False,
    bf16=True,
    max_seq_length=2048,
    dataset_text_field="messages",
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    push_to_hub=False,
    report_to="wandb",  # or "none"
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=dataset,
    peft_config=lora_config,
    tokenizer=tokenizer,
)

# Train
trainer.train()
trainer.save_model("./sft_final")
```

```bash
# Run training
python3 sft_train.py
```

***

## DPO (Direct Preference Optimization)

DPO is the most popular alignment method — no reward model needed, just preference pairs.

### Prepare DPO Dataset

```python
# Format: each example has 'prompt', 'chosen', 'rejected'
from datasets import Dataset

data = [
    {
        "prompt": "Explain how to optimize GPU utilization",
        "chosen": "To optimize GPU utilization: 1) Use larger batch sizes to maximize occupancy, 2) Enable mixed precision (bf16/fp16), 3) Profile with nvidia-smi to identify bottlenecks, 4) Use CUDA streams for parallel operations.",
        "rejected": "Just use more GPUs."
    },
    # ... more preference pairs
]

dataset = Dataset.from_list(data)
```

### DPO Training Script

```python
# dpo_train.py
from trl import DPOTrainer, DPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
import torch

model_name = "./sft_final"  # Start from your SFT model!

# Load SFT model (the policy to align)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Reference model (frozen copy of SFT model)
ref_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load preference dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:5%]")

# DPO configuration
dpo_config = DPOConfig(
    output_dir="./dpo_output",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,           # Much lower than SFT
    beta=0.1,                     # KL penalty coefficient
    loss_type="sigmoid",          # Standard DPO loss
    max_length=2048,
    max_prompt_length=512,
    bf16=True,
    logging_steps=10,
    save_steps=50,
    report_to="wandb",
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./dpo_final")
```

***

## PPO (Proximal Policy Optimization)

PPO is the classic RLHF approach — use when you have a reward signal:

```python
# ppo_train.py
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline
from datasets import load_dataset
import torch

model_name = "./sft_final"

# Policy model (with value head for PPO)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Reward model (can be any scoring function)
sentiment_pipe = pipeline(
    "sentiment-analysis",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
    device=0,
)

def reward_fn(texts):
    """Score each response. Return list of reward tensors."""
    results = sentiment_pipe(texts)
    rewards = []
    for result in results:
        score = result["score"] if result["label"] == "POSITIVE" else -result["score"]
        rewards.append(torch.tensor(score))
    return rewards

ppo_config = PPOConfig(
    output_dir="./ppo_output",
    learning_rate=1.41e-5,
    mini_batch_size=1,
    batch_size=4,
    gradient_accumulation_steps=4,
    kl_penalty="kl",
    target_kl=6.0,
    cliprange=0.2,
    vf_coef=0.1,
)

trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=None,  # Auto-copies the initial model as reference
    tokenizer=tokenizer,
)

# Training loop
dataset = load_dataset("imdb", split="train[:1000]")
for epoch in range(3):
    for batch in trainer.dataloader:
        queries = batch["input_ids"]
        
        # Generate responses
        responses = trainer.generate(queries, max_new_tokens=100)
        
        # Score responses
        texts = tokenizer.batch_decode(responses, skip_special_tokens=True)
        rewards = reward_fn(texts)
        
        # PPO update
        stats = trainer.step(queries, responses, rewards)
        trainer.log_stats(stats, batch, rewards)
```

***

## GRPO (Group Relative Policy Optimization)

GRPO is used in DeepSeek-R1 for reasoning training:

```python
# grpo_train.py
from trl import GRPOTrainer, GRPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset
import re, torch

model_name = "Qwen/Qwen2.5-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Math dataset
def make_math_dataset():
    examples = [
        {"prompt": "What is 2+2?", "answer": "4"},
        {"prompt": "What is 15 * 7?", "answer": "105"},
        # ... more math problems
    ]
    return Dataset.from_list(examples)

dataset = make_math_dataset()

def correctness_reward(completions, answer, **kwargs):
    """Reward 1.0 if the answer is correct, 0.0 otherwise."""
    rewards = []
    for completion in completions:
        # Extract final number from completion
        numbers = re.findall(r'\d+', completion[-1]["content"])
        if numbers and numbers[-1] == answer:
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

grpo_config = GRPOConfig(
    output_dir="./grpo_output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    num_generations=8,       # GRPO generates G responses per prompt
    learning_rate=5e-7,
    bf16=True,
    logging_steps=10,
)

trainer = GRPOTrainer(
    model=model,
    args=grpo_config,
    train_dataset=dataset,
    reward_funcs=correctness_reward,
    tokenizer=tokenizer,
)

trainer.train()
```

***

## Multi-GPU Training

Use `accelerate` for distributed training:

```bash
# Configure accelerate for multi-GPU
accelerate config

# Example config for 4 GPUs:
# - compute_environment: LOCAL_MACHINE
# - distributed_type: MULTI_GPU
# - num_processes: 4
# - mixed_precision: bf16

# Launch training across all GPUs
accelerate launch sft_train.py
accelerate launch dpo_train.py

# Or specify GPUs explicitly
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
  --num_processes 4 \
  --mixed_precision bf16 \
  sft_train.py
```

***

## Using the TRL CLI

TRL provides convenient CLI commands:

```bash
# SFT via CLI
trl sft \
  --model_name_or_path meta-llama/Llama-3.2-8B-Instruct \
  --dataset_name trl-lib/ultrachat_200k \
  --dataset_text_field messages \
  --output_dir ./cli_sft_output \
  --num_train_epochs 3 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --bf16 \
  --use_peft \
  --lora_r 16 \
  --lora_alpha 32

# DPO via CLI
trl dpo \
  --model_name_or_path ./cli_sft_output \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir ./cli_dpo_output \
  --num_train_epochs 1 \
  --beta 0.1 \
  --bf16
```

***

## Monitoring Training

```bash
# Watch GPU utilization
watch -n 1 nvidia-smi

# Monitor training loss (if using W&B)
# Open https://wandb.ai/your-username in browser

# Check output directory for checkpoints
ls -lh sft_output/checkpoint-*/

# Resume from checkpoint
python3 sft_train.py --resume_from_checkpoint sft_output/checkpoint-500/
```

***

## Clore.ai GPU Recommendations

TRL training is one of the most VRAM-intensive workloads. Pick your GPU based on model size and method:

| Task                                     | GPU                | Notes                                                                  |
| ---------------------------------------- | ------------------ | ---------------------------------------------------------------------- |
| SFT / DPO on 7–8B (QLoRA)                | **RTX 3090** 24 GB | \~8 GB for QLoRA 4-bit; fits comfortably; \~$0.12/hr on Clore.ai       |
| SFT / DPO on 7–8B (LoRA bf16)            | **RTX 4090** 24 GB | Same VRAM as 3090 but 30% faster compute; great for iteration speed    |
| Full SFT on 7B or DPO on 13B             | **A100 40 GB**     | 40 GB fits 7B full-precision training; ECC memory avoids silent errors |
| PPO / full finetune 7B, or any 70B QLoRA | **A100 80 GB**     | PPO needs 2× policy+ref model in VRAM; 80 GB runs both without OOM     |

**Practical tip:** Start on RTX 3090 with QLoRA for experimentation — train Llama 3 8B in \~2 hrs on 10K examples. Once you've validated the pipeline, move to A100 80GB for full-precision runs or 70B models.

**Speed numbers (Llama 3 8B SFT, QLoRA, batch=4, seq=2048):**

* RTX 3090: \~1,100 tokens/sec training throughput
* RTX 4090: \~1,450 tokens/sec
* A100 80GB: \~2,800 tokens/sec (full bf16, no quantization)

***

## Troubleshooting

### CUDA Out of Memory

```bash
# Reduce batch size
per_device_train_batch_size=1
gradient_accumulation_steps=16  # Keep effective batch size the same

# Use 4-bit quantization (QLoRA)
# Add BitsAndBytesConfig with load_in_4bit=True

# Enable gradient checkpointing
gradient_checkpointing=True

# Reduce sequence length
max_seq_length=1024  # instead of 2048+

# Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
```

### Loss is NaN

```bash
# Common cause: learning rate too high
learning_rate=1e-5  # Try lower

# Common cause: bad data (empty strings, None values)
# Validate dataset:
python3 -c "
from datasets import load_from_disk
ds = load_from_disk('data/sft_dataset')
print(ds[0])
print(f'Length: {len(ds)}')
# Check for None
none_count = sum(1 for x in ds if x.get('messages') is None)
print(f'None count: {none_count}')
"

# Enable bf16 instead of fp16 (more stable)
bf16=True
fp16=False
```

### DPO: `chosen_rewards > rejected_rewards` is False

```bash
# This means the model prefers rejected responses — overfitting or bad data
# Solutions:
# 1. Check your dataset quality
# 2. Reduce beta (less KL penalty)
# 3. Reduce learning rate
# 4. Add more SFT training before DPO
beta=0.05  # Try smaller values
```

### Training is very slow

```bash
# Enable Flash Attention 2
pip install flash-attn --no-build-isolation

# In your code:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
)

# Use bf16 instead of fp16 on Ampere+ GPUs (A100, RTX 3000+)
bf16=True

# Increase DataLoader workers
dataloader_num_workers=4

# Check if GPU is actually being used
nvidia-smi  # Should show high GPU utilization
```

### `tokenizer.pad_token` warning

```bash
# Standard fix for Llama/Mistral tokenizers
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Important for training stability
```

### Permission denied / HuggingFace 401

```bash
# Re-login
huggingface-cli login

# Set token in environment
export HF_TOKEN=hf_your-token

# For private models/datasets, ensure you have access:
# Go to https://huggingface.co/meta-llama/Llama-3.2-8B-Instruct
# Click "Request access" and accept the license
```

***

## Saving and Sharing Your Model

```bash
# Merge LoRA weights into base model
python3 << 'EOF'
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./sft_final")
merged = model.merge_and_unload()
merged.save_pretrained("./merged_model")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B-Instruct")
tokenizer.save_pretrained("./merged_model")
print("Merged model saved!")
EOF

# Push to HuggingFace
huggingface-cli upload your-username/my-trl-model ./merged_model
```

***

## Useful Links

* **GitHub**: <https://github.com/huggingface/trl> ⭐ 10K+
* **Documentation**: <https://huggingface.co/docs/trl>
* **DPO Paper**: <https://arxiv.org/abs/2305.18290>
* **GRPO / DeepSeek-R1**: <https://arxiv.org/abs/2501.12599>
* **PPO Paper (RLHF)**: <https://arxiv.org/abs/2203.02155>
* **HuggingFace PEFT**: <https://github.com/huggingface/peft>
* **Weights & Biases**: <https://wandb.ai>
* **Flash Attention**: <https://github.com/Dao-AILab/flash-attention>
* **Clore.ai Marketplace**: <https://clore.ai/marketplace>
