TRL (RLHF/DPO Training)

TRL (Transformer Reinforcement Learning) is HuggingFace's official library for training language models with reinforcement learning techniques. With 10K+ GitHub stars, it provides state-of-the-art implementations of RLHF, DPO, PPO, GRPO, and other alignment algorithms for LLMs.

circle-check

What is TRL?

TRL is the library behind many of today's best-aligned language models. It provides:

  • SFT (Supervised Fine-Tuning) — standard instruction tuning with ChatML format

  • RLHF/PPO — classic Proximal Policy Optimization with a reward model

  • DPO — Direct Preference Optimization (no reward model needed!)

  • GRPO — Group Relative Policy Optimization (DeepSeek-R1's method)

  • KTO — Kahneman-Tversky Optimization (works with unpaired preferences)

  • Reward Modeling — train a reward model from human preference data

  • IterativeSFT — online RL with a simpler setup

  • ORPO — Odds Ratio Preference Optimization

TRL integrates natively with HuggingFace ecosystem: transformers, peft, datasets, accelerate, and bitsandbytes.

circle-info

Which algorithm should you use?

  • DPO — simplest, most stable. Use when you have paired preference data (chosen/rejected).

  • PPO — most powerful but complex. Use when you have a reward model or scoring function.

  • GRPO — great for reasoning/math tasks. DeepSeek-R1's training method.

  • SFT — always start here before applying any RL method.


Server Requirements

Component
Minimum
Recommended

GPU

RTX 3090 (24 GB)

A100 80 GB / H100

VRAM

16 GB (SFT/DPO 7B + LoRA)

80 GB (full finetune 7B)

RAM

32 GB

64 GB+

CPU

8 cores

16+ cores

Storage

100 GB

300 GB+

OS

Ubuntu 20.04+

Ubuntu 22.04

Python

3.9+

3.11

CUDA

11.8+

12.1+

VRAM by Task

Task
Model
Method
VRAM

SFT

Llama 3 8B

QLoRA 4-bit

~8 GB

DPO

Llama 3 8B

LoRA

~20 GB

PPO

Llama 3 8B

Full

~80 GB (2×A100)

GRPO

Qwen 7B

LoRA

~24 GB

SFT

Llama 3 70B

QLoRA 4-bit

~48 GB

DPO

Llama 3 70B

LoRA

~80 GB


Ports

Port
Service
Notes

22

SSH

Terminal access, file transfer, monitoring

TRL is a training library — it runs as a CLI/Python script, no web server required.


Installation on Clore.ai

Step 1 — Rent a Server

  1. Filter for VRAM ≥ 24 GB (RTX 3090, A100, or H100)

  2. Choose a PyTorch or CUDA 12.1 base image

  3. Select Storage ≥ 200 GB for models and datasets

  4. Open port 22 for SSH access

Step 2 — Connect via SSH

Step 3 — Install TRL

Step 4 — HuggingFace Authentication

Step 5 — Optional: Weights & Biases Tracking


Supervised Fine-Tuning (SFT)

SFT is always the first step before any RL technique.

Prepare Your Dataset

SFT Training Script


DPO (Direct Preference Optimization)

DPO is the most popular alignment method — no reward model needed, just preference pairs.

Prepare DPO Dataset

DPO Training Script


PPO (Proximal Policy Optimization)

PPO is the classic RLHF approach — use when you have a reward signal:


GRPO (Group Relative Policy Optimization)

GRPO is used in DeepSeek-R1 for reasoning training:


Multi-GPU Training

Use accelerate for distributed training:


Using the TRL CLI

TRL provides convenient CLI commands:


Monitoring Training


Clore.ai GPU Recommendations

TRL training is one of the most VRAM-intensive workloads. Pick your GPU based on model size and method:

Task
GPU
Notes

SFT / DPO on 7–8B (QLoRA)

RTX 3090 24 GB

~8 GB for QLoRA 4-bit; fits comfortably; ~$0.12/hr on Clore.ai

SFT / DPO on 7–8B (LoRA bf16)

RTX 4090 24 GB

Same VRAM as 3090 but 30% faster compute; great for iteration speed

Full SFT on 7B or DPO on 13B

A100 40 GB

40 GB fits 7B full-precision training; ECC memory avoids silent errors

PPO / full finetune 7B, or any 70B QLoRA

A100 80 GB

PPO needs 2× policy+ref model in VRAM; 80 GB runs both without OOM

Practical tip: Start on RTX 3090 with QLoRA for experimentation — train Llama 3 8B in ~2 hrs on 10K examples. Once you've validated the pipeline, move to A100 80GB for full-precision runs or 70B models.

Speed numbers (Llama 3 8B SFT, QLoRA, batch=4, seq=2048):

  • RTX 3090: ~1,100 tokens/sec training throughput

  • RTX 4090: ~1,450 tokens/sec

  • A100 80GB: ~2,800 tokens/sec (full bf16, no quantization)


Troubleshooting

CUDA Out of Memory

Loss is NaN

DPO: chosen_rewards > rejected_rewards is False

Training is very slow

tokenizer.pad_token warning

Permission denied / HuggingFace 401


Saving and Sharing Your Model


Last updated

Was this helpful?