# Fine-tuning Tools Comparison

Choose the right fine-tuning framework for training LLMs on Clore.ai GPU servers.

{% hint style="info" %}
**Fine-tuning** adapts a pre-trained LLM to your specific task or domain. This guide compares the four leading open-source tools: Unsloth, Axolotl, LLaMA-Factory, and TRL — covering speed, memory efficiency, supported models, and ease of use.
{% endhint %}

***

## Quick Decision Matrix

|                       | Unsloth                        | Axolotl                | LLaMA-Factory     | TRL             |
| --------------------- | ------------------------------ | ---------------------- | ----------------- | --------------- |
| **Best for**          | Speed + memory                 | Config-driven training | Beginner-friendly | Research + RLHF |
| **Speed vs baseline** | 2-5× faster                    | \~1× (standard)        | \~1× (standard)   | \~1× (standard) |
| **Memory reduction**  | 70-80% less                    | QLoRA standard         | QLoRA standard    | Standard        |
| **RLHF/DPO/PPO**      | Basic                          | ✅                      | ✅                 | ✅ (native)      |
| **WebUI**             | ❌                              | ❌                      | ✅                 | ❌               |
| **GitHub stars**      | 23K+                           | 9K+                    | 37K+              | 10K+            |
| **License**           | LGPL (free for non-commercial) | Apache 2.0             | Apache 2.0        | Apache 2.0      |

***

## Overview

### Unsloth

Unsloth is laser-focused on one thing: making fine-tuning as fast and memory-efficient as possible. It rewrites key operations in Triton and optimizes CUDA kernels.

**Philosophy**: Maximum speed, minimum VRAM — no compromises.

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,  # 4-bit quantization
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # ~30% more batch size
    random_state=42,
)
```

### Axolotl

Axolotl wraps HuggingFace Transformers with a YAML-based configuration system. It handles the complexity of training setup so you can focus on data and hyperparameters.

**Philosophy**: Everything in YAML, full flexibility underneath.

```yaml
# config.yml
base_model: meta-llama/Meta-Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

datasets:
  - path: mhenrichsen/alpaca_data_cleaned
    type: alpaca

load_in_4bit: true
adapter: qlora

lora_r: 32
lora_alpha: 16
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj

num_epochs: 3
micro_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2e-4
```

### LLaMA-Factory

LLaMA-Factory supports the widest range of models (100+) and training methods, with a web UI for configuration. It's the most accessible option for non-researchers.

**Philosophy**: Everything works, for everyone.

```bash
# Train via command line
llamafactory-cli train \
  --model_name_or_path meta-llama/Meta-Llama-3-8B \
  --stage sft \
  --do_train \
  --dataset alpaca_gpt4_en \
  --template llama3 \
  --finetuning_type lora \
  --lora_rank 8 \
  --output_dir saves/llama3-8b-lora \
  --num_train_epochs 3.0 \
  --per_device_train_batch_size 2

# Or use WebUI
llamafactory-cli webui
```

### TRL (Transformer Reinforcement Learning)

TRL is HuggingFace's official RLHF library. It's the standard for PPO, DPO, ORPO, and other alignment training methods.

**Philosophy**: Research-first, alignment training native.

```python
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

training_args = SFTConfig(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=load_dataset("tatsu-lab/alpaca", split="train"),
)

trainer.train()
```

***

## Speed Benchmarks

### Training Speed Comparison (tokens/second)

Test setup: LLaMA 3.1 8B, LoRA r=16, 4-bit quantization, batch size 4, A100 80GB

| Tool                  | Tokens/sec | vs Baseline | Memory (VRAM) |
| --------------------- | ---------- | ----------- | ------------- |
| Unsloth (4-bit)       | \~4,200    | **2.8×**    | \~8GB         |
| Axolotl (QLoRA)       | \~1,500    | 1.0×        | \~16GB        |
| LLaMA-Factory (QLoRA) | \~1,480    | \~1.0×      | \~16GB        |
| TRL (QLoRA)           | \~1,450    | \~0.97×     | \~18GB        |
| Unsloth (full 16-bit) | \~2,800    | **1.9×**    | \~22GB        |

{% hint style="success" %}
**Unsloth advantage is real**: 2-5× speed comes from custom Triton kernels for attention, cross-entropy, RoPE, and LoRA. Not just marketing.
{% endhint %}

### VRAM Usage Comparison

Training LLaMA 3.1 8B, sequence length 2048:

| Method                  | Unsloth | Axolotl | LLaMA-Factory | TRL  |
| ----------------------- | ------- | ------- | ------------- | ---- |
| Full fine-tune (bf16)   | 60GB    | 70GB    | 72GB          | 74GB |
| LoRA (bf16)             | 18GB    | 24GB    | 25GB          | 26GB |
| QLoRA (4-bit)           | **8GB** | 16GB    | 16GB          | 18GB |
| QLoRA (4-bit, long ctx) | 12GB    | 24GB    | 24GB          | 26GB |

**Minimum GPU for 8B model**:

* Unsloth: RTX 3080 (10GB) ✅
* Others: RTX 3090 (24GB) required

***

## Supported Models

### Model Support Matrix

| Model Family | Unsloth | Axolotl | LLaMA-Factory | TRL |
| ------------ | ------- | ------- | ------------- | --- |
| LLaMA 3.x    | ✅       | ✅       | ✅             | ✅   |
| LLaMA 2      | ✅       | ✅       | ✅             | ✅   |
| Mistral      | ✅       | ✅       | ✅             | ✅   |
| Mixtral MoE  | ✅       | ✅       | ✅             | ✅   |
| Gemma 2      | ✅       | ✅       | ✅             | ✅   |
| Phi-3/3.5    | ✅       | ✅       | ✅             | ✅   |
| Qwen 2.5     | ✅       | ✅       | ✅             | ✅   |
| DeepSeek     | ✅       | ✅       | ✅             | ✅   |
| Falcon       | ✅       | ✅       | ✅             | ✅   |
| GPT-NeoX     | Partial | ✅       | ✅             | ✅   |
| T5/FLAN      | ❌       | ✅       | ✅             | ✅   |
| BERT/RoBERTa | ❌       | ✅       | ✅             | ✅   |
| Vision LLMs  | Partial | Partial | ✅             | ✅   |

### Training Method Support

| Method                      | Unsloth | Axolotl | LLaMA-Factory | TRL        |
| --------------------------- | ------- | ------- | ------------- | ---------- |
| Full fine-tune              | ✅       | ✅       | ✅             | ✅          |
| LoRA                        | ✅       | ✅       | ✅             | ✅          |
| QLoRA                       | ✅       | ✅       | ✅             | ✅          |
| DoRA                        | ✅       | ✅       | ✅             | ❌          |
| PEFT                        | ✅       | ✅       | ✅             | ✅          |
| SFT                         | ✅       | ✅       | ✅             | ✅ (native) |
| DPO                         | ✅       | ✅       | ✅             | ✅ (native) |
| PPO                         | ❌       | ✅       | ✅             | ✅ (native) |
| ORPO                        | ✅       | ✅       | ✅             | ✅          |
| KTO                         | ❌       | ✅       | ✅             | ✅ (native) |
| GRPO                        | ✅       | ❌       | ✅             | ✅          |
| CPT (continued pretraining) | ✅       | ✅       | ✅             | ✅          |

***

## Unsloth: Deep Dive

### What Makes It Fast

1. **Triton kernels**: Rewrites Flash Attention, cross-entropy loss, and LoRA in Triton
2. **Fused operations**: Combines multiple CUDA ops into one kernel
3. **Smart gradient checkpointing**: "unsloth" mode saves \~30% more memory
4. **Efficient backprop**: Avoids materializing large intermediate tensors

### Installation on Clore.ai

```bash
# CUDA 12.1
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Or with conda
conda create --name unsloth_env python=3.11
conda activate unsloth_env
conda install pytorch-cuda=12.1 pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers -y
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
```

### Complete Training Script

```python
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch

# 1. Load model with Unsloth optimization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    dtype=None,        # Auto-detect
    load_in_4bit=True,
)

# 2. Add LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

# 3. Load and format dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")

def format_prompt(example):
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = dataset.map(format_prompt)

# 4. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)
trainer.train()

# 5. Save
model.save_pretrained("lora_model")
model.save_pretrained_gguf("model_gguf", tokenizer, quantization_method="q4_k_m")
```

**Weaknesses**: No PPO, limited to supported model list, LGPL license (check for commercial use)

***

## Axolotl: Deep Dive

### Configuration-First Approach

Axolotl shines when you want reproducible, version-controlled training configurations:

```yaml
# axolotl_config.yml — full example
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

# Data
datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
  - path: ./my_custom_data.jsonl
    type: sharegpt
dataset_prepared_path: ./prepared_data
val_set_size: 0.01

# Quantization
load_in_4bit: true
adapter: qlora
bf16: true
tf32: true

# LoRA
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Training
sequence_len: 4096
sample_packing: true  # Packs short sequences for efficiency
pad_to_sequence_len: true
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
optimizer: adamw_bnb_8bit
lr_scheduler: cosine

# Logging
logging_steps: 10
eval_steps: 100
save_steps: 100
output_dir: ./outputs/my-model

# wandb
wandb_project: my-fine-tune
wandb_run_id: run-001
```

```bash
# Install and run
pip install axolotl[flash-attn,deepspeed]
axolotl train axolotl_config.yml
```

**Best for**: Teams that want reproducible, config-versioned training runs

***

## LLaMA-Factory: Deep Dive

### WebUI Walkthrough

```bash
# Install
pip install llamafactory

# Launch WebUI
llamafactory-cli webui
# Open http://localhost:7860
```

WebUI tabs:

1. **Train** — configure base model, dataset, method
2. **Evaluate** — run MMLU, CMMLU benchmarks
3. **Chat** — interactive inference
4. **Export** — merge LoRA, quantize to GGUF

### CLI Training Example

```bash
# Supervised Fine-Tuning
llamafactory-cli train \
  --stage sft \
  --model_name_or_path meta-llama/Meta-Llama-3-8B \
  --dataset alpaca_gpt4_en,glaive_toolcall_en \
  --template llama3 \
  --finetuning_type lora \
  --lora_rank 8 \
  --lora_alpha 16 \
  --lora_target all \
  --output_dir saves/llama3-lora \
  --num_train_epochs 3 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --quantization_bit 4 \
  --flash_attn fa2

# DPO Training
llamafactory-cli train \
  --stage dpo \
  --model_name_or_path meta-llama/Meta-Llama-3-8B \
  --dataset dpo_mix_en \
  --template llama3 \
  --finetuning_type lora \
  --output_dir saves/llama3-dpo
```

**Best for**: Beginners, teams wanting WebUI, DPO/RLHF without deep research knowledge

***

## TRL: Deep Dive

### RLHF Pipeline Example

TRL is the go-to for alignment training:

```python
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

# DPO (Direct Preference Optimization) — most common alignment method
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

dpo_config = DPOConfig(
    model_name_or_path=model_name,
    output_dir="dpo_outputs",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    beta=0.1,             # KL penalty coefficient
    loss_type="sigmoid",  # or "hinge", "ipo", "kto_pair"
    learning_rate=5e-7,
)

# Load preference dataset (prompt + chosen + rejected)
dataset = load_dataset("Anthropic/hh-rlhf", split="train")

trainer = DPOTrainer(
    model=model_name,
    args=dpo_config,
    train_dataset=dataset,
)
trainer.train()
```

**Best for**: Alignment research, RLHF, DPO, PPO, ORPO implementations

***

## Choosing the Right Tool

### Decision Flow

```
Need maximum speed/minimum VRAM?
  → YES → Unsloth (2-5× faster, fits on smaller GPUs)

Need alignment training (DPO/PPO/RLHF)?
  → YES → TRL or LLaMA-Factory
  → Research/custom → TRL
  → Production/easy → LLaMA-Factory

Need configuration-first reproducibility?
  → YES → Axolotl

Non-technical team or want WebUI?
  → YES → LLaMA-Factory

Just want to start quickly?
  → LLaMA-Factory or Unsloth
```

### By Team Type

| Team                  | Recommendation | Reason                        |
| --------------------- | -------------- | ----------------------------- |
| Individual researcher | Unsloth        | Speed + Jupyter notebooks     |
| ML engineer           | Axolotl        | Config-driven, reproducible   |
| Product team          | LLaMA-Factory  | WebUI, wide model support     |
| Alignment team        | TRL            | Native RLHF primitives        |
| Startup               | Unsloth + TRL  | Speed + alignment when needed |

***

## Clore.ai GPU Recommendations

| Task              | Min GPU         | Recommended  | Tool            |
| ----------------- | --------------- | ------------ | --------------- |
| 7-8B LoRA (QLoRA) | RTX 3080 (10GB) | RTX 3090     | Unsloth         |
| 13B LoRA          | RTX 3090 (24GB) | A6000 (48GB) | Unsloth/Axolotl |
| 70B LoRA          | A100 (80GB)     | 2×A100       | Axolotl/TRL     |
| 8B Full FT        | A100 (40GB)     | A100 (80GB)  | Any             |
| DPO/PPO 7B        | RTX 4090 (24GB) | A6000 (48GB) | TRL             |

***

## Useful Links

* [Unsloth GitHub](https://github.com/unslothai/unsloth) — 23K+ stars
* [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl) — 9K+ stars
* [LLaMA-Factory GitHub](https://github.com/hiyouga/LLaMA-Factory) — 37K+ stars
* [TRL GitHub](https://github.com/huggingface/trl) — 10K+ stars
* [HuggingFace PEFT Docs](https://huggingface.co/docs/peft)

***

## Summary

| Tool              | Best for                            | Key advantage                 |
| ----------------- | ----------------------------------- | ----------------------------- |
| **Unsloth**       | Speed-critical training, small GPUs | 2-5× faster, 70% less VRAM    |
| **Axolotl**       | Config-driven, reproducible runs    | YAML-first, many data formats |
| **LLaMA-Factory** | 100+ models, WebUI, beginners       | Most model support, GUI       |
| **TRL**           | RLHF, DPO, alignment research       | Native alignment training     |

For most Clore.ai use cases: start with **Unsloth** (speed + memory efficiency), add **TRL** if you need DPO or PPO alignment training.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/comparisons/finetuning-comparison.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
