# Fine-tuning Tools Comparison

Choose the right fine-tuning framework for training LLMs on Clore.ai GPU servers.

{% hint style="info" %}
**Fine-tuning** adapts a pre-trained LLM to your specific task or domain. This guide compares the four leading open-source tools: Unsloth, Axolotl, LLaMA-Factory, and TRL — covering speed, memory efficiency, supported models, and ease of use.
{% endhint %}

***

## Quick Decision Matrix

|                       | Unsloth                        | Axolotl                | LLaMA-Factory     | TRL             |
| --------------------- | ------------------------------ | ---------------------- | ----------------- | --------------- |
| **Best for**          | Speed + memory                 | Config-driven training | Beginner-friendly | Research + RLHF |
| **Speed vs baseline** | 2-5× faster                    | \~1× (standard)        | \~1× (standard)   | \~1× (standard) |
| **Memory reduction**  | 70-80% less                    | QLoRA standard         | QLoRA standard    | Standard        |
| **RLHF/DPO/PPO**      | Basic                          | ✅                      | ✅                 | ✅ (native)      |
| **WebUI**             | ❌                              | ❌                      | ✅                 | ❌               |
| **GitHub stars**      | 23K+                           | 9K+                    | 37K+              | 10K+            |
| **License**           | LGPL (free for non-commercial) | Apache 2.0             | Apache 2.0        | Apache 2.0      |

***

## Overview

### Unsloth

Unsloth is laser-focused on one thing: making fine-tuning as fast and memory-efficient as possible. It rewrites key operations in Triton and optimizes CUDA kernels.

**Philosophy**: Maximum speed, minimum VRAM — no compromises.

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,  # 4-bit quantization
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # ~30% more batch size
    random_state=42,
)
```

### Axolotl

Axolotl wraps HuggingFace Transformers with a YAML-based configuration system. It handles the complexity of training setup so you can focus on data and hyperparameters.

**Philosophy**: Everything in YAML, full flexibility underneath.

```yaml
# config.yml
base_model: meta-llama/Meta-Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

datasets:
  - path: mhenrichsen/alpaca_data_cleaned
    type: alpaca

load_in_4bit: true
adapter: qlora

lora_r: 32
lora_alpha: 16
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj

num_epochs: 3
micro_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2e-4
```

### LLaMA-Factory

LLaMA-Factory supports the widest range of models (100+) and training methods, with a web UI for configuration. It's the most accessible option for non-researchers.

**Philosophy**: Everything works, for everyone.

```bash
# Train via command line
llamafactory-cli train \
  --model_name_or_path meta-llama/Meta-Llama-3-8B \
  --stage sft \
  --do_train \
  --dataset alpaca_gpt4_en \
  --template llama3 \
  --finetuning_type lora \
  --lora_rank 8 \
  --output_dir saves/llama3-8b-lora \
  --num_train_epochs 3.0 \
  --per_device_train_batch_size 2

# Or use WebUI
llamafactory-cli webui
```

### TRL (Transformer Reinforcement Learning)

TRL is HuggingFace's official RLHF library. It's the standard for PPO, DPO, ORPO, and other alignment training methods.

**Philosophy**: Research-first, alignment training native.

```python
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

training_args = SFTConfig(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=load_dataset("tatsu-lab/alpaca", split="train"),
)

trainer.train()
```

***

## Speed Benchmarks

### Training Speed Comparison (tokens/second)

Test setup: LLaMA 3.1 8B, LoRA r=16, 4-bit quantization, batch size 4, A100 80GB

| Tool                  | Tokens/sec | vs Baseline | Memory (VRAM) |
| --------------------- | ---------- | ----------- | ------------- |
| Unsloth (4-bit)       | \~4,200    | **2.8×**    | \~8GB         |
| Axolotl (QLoRA)       | \~1,500    | 1.0×        | \~16GB        |
| LLaMA-Factory (QLoRA) | \~1,480    | \~1.0×      | \~16GB        |
| TRL (QLoRA)           | \~1,450    | \~0.97×     | \~18GB        |
| Unsloth (full 16-bit) | \~2,800    | **1.9×**    | \~22GB        |

{% hint style="success" %}
**Unsloth advantage is real**: 2-5× speed comes from custom Triton kernels for attention, cross-entropy, RoPE, and LoRA. Not just marketing.
{% endhint %}

### VRAM Usage Comparison

Training LLaMA 3.1 8B, sequence length 2048:

| Method                  | Unsloth | Axolotl | LLaMA-Factory | TRL  |
| ----------------------- | ------- | ------- | ------------- | ---- |
| Full fine-tune (bf16)   | 60GB    | 70GB    | 72GB          | 74GB |
| LoRA (bf16)             | 18GB    | 24GB    | 25GB          | 26GB |
| QLoRA (4-bit)           | **8GB** | 16GB    | 16GB          | 18GB |
| QLoRA (4-bit, long ctx) | 12GB    | 24GB    | 24GB          | 26GB |

**Minimum GPU for 8B model**:

* Unsloth: RTX 3080 (10GB) ✅
* Others: RTX 3090 (24GB) required

***

## Supported Models

### Model Support Matrix

| Model Family | Unsloth | Axolotl | LLaMA-Factory | TRL |
| ------------ | ------- | ------- | ------------- | --- |
| LLaMA 3.x    | ✅       | ✅       | ✅             | ✅   |
| LLaMA 2      | ✅       | ✅       | ✅             | ✅   |
| Mistral      | ✅       | ✅       | ✅             | ✅   |
| Mixtral MoE  | ✅       | ✅       | ✅             | ✅   |
| Gemma 2      | ✅       | ✅       | ✅             | ✅   |
| Phi-3/3.5    | ✅       | ✅       | ✅             | ✅   |
| Qwen 2.5     | ✅       | ✅       | ✅             | ✅   |
| DeepSeek     | ✅       | ✅       | ✅             | ✅   |
| Falcon       | ✅       | ✅       | ✅             | ✅   |
| GPT-NeoX     | Partial | ✅       | ✅             | ✅   |
| T5/FLAN      | ❌       | ✅       | ✅             | ✅   |
| BERT/RoBERTa | ❌       | ✅       | ✅             | ✅   |
| Vision LLMs  | Partial | Partial | ✅             | ✅   |

### Training Method Support

| Method                      | Unsloth | Axolotl | LLaMA-Factory | TRL        |
| --------------------------- | ------- | ------- | ------------- | ---------- |
| Full fine-tune              | ✅       | ✅       | ✅             | ✅          |
| LoRA                        | ✅       | ✅       | ✅             | ✅          |
| QLoRA                       | ✅       | ✅       | ✅             | ✅          |
| DoRA                        | ✅       | ✅       | ✅             | ❌          |
| PEFT                        | ✅       | ✅       | ✅             | ✅          |
| SFT                         | ✅       | ✅       | ✅             | ✅ (native) |
| DPO                         | ✅       | ✅       | ✅             | ✅ (native) |
| PPO                         | ❌       | ✅       | ✅             | ✅ (native) |
| ORPO                        | ✅       | ✅       | ✅             | ✅          |
| KTO                         | ❌       | ✅       | ✅             | ✅ (native) |
| GRPO                        | ✅       | ❌       | ✅             | ✅          |
| CPT (continued pretraining) | ✅       | ✅       | ✅             | ✅          |

***

## Unsloth: Deep Dive

### What Makes It Fast

1. **Triton kernels**: Rewrites Flash Attention, cross-entropy loss, and LoRA in Triton
2. **Fused operations**: Combines multiple CUDA ops into one kernel
3. **Smart gradient checkpointing**: "unsloth" mode saves \~30% more memory
4. **Efficient backprop**: Avoids materializing large intermediate tensors

### Installation on Clore.ai

```bash
# CUDA 12.1
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Or with conda
conda create --name unsloth_env python=3.11
conda activate unsloth_env
conda install pytorch-cuda=12.1 pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers -y
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
```

### Complete Training Script

```python
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch

# 1. Load model with Unsloth optimization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    dtype=None,        # Auto-detect
    load_in_4bit=True,
)

# 2. Add LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

# 3. Load and format dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")

def format_prompt(example):
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = dataset.map(format_prompt)

# 4. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)
trainer.train()

# 5. Save
model.save_pretrained("lora_model")
model.save_pretrained_gguf("model_gguf", tokenizer, quantization_method="q4_k_m")
```

**Weaknesses**: No PPO, limited to supported model list, LGPL license (check for commercial use)

***

## Axolotl: Deep Dive

### Configuration-First Approach

Axolotl shines when you want reproducible, version-controlled training configurations:

```yaml
# axolotl_config.yml — full example
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

# Data
datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
  - path: ./my_custom_data.jsonl
    type: sharegpt
dataset_prepared_path: ./prepared_data
val_set_size: 0.01

# Quantization
load_in_4bit: true
adapter: qlora
bf16: true
tf32: true

# LoRA
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Training
sequence_len: 4096
sample_packing: true  # Packs short sequences for efficiency
pad_to_sequence_len: true
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
optimizer: adamw_bnb_8bit
lr_scheduler: cosine

# Logging
logging_steps: 10
eval_steps: 100
save_steps: 100
output_dir: ./outputs/my-model

# wandb
wandb_project: my-fine-tune
wandb_run_id: run-001
```

```bash
# Install and run
pip install axolotl[flash-attn,deepspeed]
axolotl train axolotl_config.yml
```

**Best for**: Teams that want reproducible, config-versioned training runs

***

## LLaMA-Factory: Deep Dive

### WebUI Walkthrough

```bash
# Install
pip install llamafactory

# Launch WebUI
llamafactory-cli webui
# Open http://localhost:7860
```

WebUI tabs:

1. **Train** — configure base model, dataset, method
2. **Evaluate** — run MMLU, CMMLU benchmarks
3. **Chat** — interactive inference
4. **Export** — merge LoRA, quantize to GGUF

### CLI Training Example

```bash
# Supervised Fine-Tuning
llamafactory-cli train \
  --stage sft \
  --model_name_or_path meta-llama/Meta-Llama-3-8B \
  --dataset alpaca_gpt4_en,glaive_toolcall_en \
  --template llama3 \
  --finetuning_type lora \
  --lora_rank 8 \
  --lora_alpha 16 \
  --lora_target all \
  --output_dir saves/llama3-lora \
  --num_train_epochs 3 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --quantization_bit 4 \
  --flash_attn fa2

# DPO Training
llamafactory-cli train \
  --stage dpo \
  --model_name_or_path meta-llama/Meta-Llama-3-8B \
  --dataset dpo_mix_en \
  --template llama3 \
  --finetuning_type lora \
  --output_dir saves/llama3-dpo
```

**Best for**: Beginners, teams wanting WebUI, DPO/RLHF without deep research knowledge

***

## TRL: Deep Dive

### RLHF Pipeline Example

TRL is the go-to for alignment training:

```python
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

# DPO (Direct Preference Optimization) — most common alignment method
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

dpo_config = DPOConfig(
    model_name_or_path=model_name,
    output_dir="dpo_outputs",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    beta=0.1,             # KL penalty coefficient
    loss_type="sigmoid",  # or "hinge", "ipo", "kto_pair"
    learning_rate=5e-7,
)

# Load preference dataset (prompt + chosen + rejected)
dataset = load_dataset("Anthropic/hh-rlhf", split="train")

trainer = DPOTrainer(
    model=model_name,
    args=dpo_config,
    train_dataset=dataset,
)
trainer.train()
```

**Best for**: Alignment research, RLHF, DPO, PPO, ORPO implementations

***

## Choosing the Right Tool

### Decision Flow

```
Need maximum speed/minimum VRAM?
  → YES → Unsloth (2-5× faster, fits on smaller GPUs)

Need alignment training (DPO/PPO/RLHF)?
  → YES → TRL or LLaMA-Factory
  → Research/custom → TRL
  → Production/easy → LLaMA-Factory

Need configuration-first reproducibility?
  → YES → Axolotl

Non-technical team or want WebUI?
  → YES → LLaMA-Factory

Just want to start quickly?
  → LLaMA-Factory or Unsloth
```

### By Team Type

| Team                  | Recommendation | Reason                        |
| --------------------- | -------------- | ----------------------------- |
| Individual researcher | Unsloth        | Speed + Jupyter notebooks     |
| ML engineer           | Axolotl        | Config-driven, reproducible   |
| Product team          | LLaMA-Factory  | WebUI, wide model support     |
| Alignment team        | TRL            | Native RLHF primitives        |
| Startup               | Unsloth + TRL  | Speed + alignment when needed |

***

## Clore.ai GPU Recommendations

| Task              | Min GPU         | Recommended  | Tool            |
| ----------------- | --------------- | ------------ | --------------- |
| 7-8B LoRA (QLoRA) | RTX 3080 (10GB) | RTX 3090     | Unsloth         |
| 13B LoRA          | RTX 3090 (24GB) | A6000 (48GB) | Unsloth/Axolotl |
| 70B LoRA          | A100 (80GB)     | 2×A100       | Axolotl/TRL     |
| 8B Full FT        | A100 (40GB)     | A100 (80GB)  | Any             |
| DPO/PPO 7B        | RTX 4090 (24GB) | A6000 (48GB) | TRL             |

***

## Useful Links

* [Unsloth GitHub](https://github.com/unslothai/unsloth) — 23K+ stars
* [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl) — 9K+ stars
* [LLaMA-Factory GitHub](https://github.com/hiyouga/LLaMA-Factory) — 37K+ stars
* [TRL GitHub](https://github.com/huggingface/trl) — 10K+ stars
* [HuggingFace PEFT Docs](https://huggingface.co/docs/peft)

***

## Summary

| Tool              | Best for                            | Key advantage                 |
| ----------------- | ----------------------------------- | ----------------------------- |
| **Unsloth**       | Speed-critical training, small GPUs | 2-5× faster, 70% less VRAM    |
| **Axolotl**       | Config-driven, reproducible runs    | YAML-first, many data formats |
| **LLaMA-Factory** | 100+ models, WebUI, beginners       | Most model support, GUI       |
| **TRL**           | RLHF, DPO, alignment research       | Native alignment training     |

For most Clore.ai use cases: start with **Unsloth** (speed + memory efficiency), add **TRL** if you need DPO or PPO alignment training.
