Axolotl Universal Fine-tuning

YAML-driven LLM fine-tuning with Axolotl on Clore.ai — LoRA, QLoRA, DPO, multi-GPU

Axolotl wraps HuggingFace Transformers, PEFT, TRL, and DeepSpeed into a single YAML-driven interface. You define your model, dataset, training method, and hyperparameters in one config file — then launch with a single command. No Python scripting required for standard workflows.

circle-check

Key Features

  • YAML-only configuration — define everything in one file, no Python needed

  • All training methods — LoRA, QLoRA, full fine-tune, DPO, ORPO, KTO, RLHF

  • Multi-GPU out of the box — DeepSpeed ZeRO 1/2/3 and FSDP with one flag

  • Sample packing — concatenate short examples to fill sequence length, 3–5× throughput gain

  • Flash Attention 2 — automatic VRAM savings on supported hardware

  • Broad model support — Llama 3.x, Mistral, Qwen 2.5, Gemma 2, Phi-4, DeepSeek, Falcon

  • Built-in dataset formats — alpaca, sharegpt, chat_template, completion, and custom

Requirements

Component
Minimum
Recommended

GPU

RTX 3060 12 GB

RTX 4090 24 GB (×2+)

VRAM

12 GB

24+ GB

RAM

16 GB

64 GB

Disk

50 GB

100 GB

CUDA

11.8

12.1+

Python

3.10

3.11

Clore.ai pricing: RTX 4090 ≈ $0.5–2/day · RTX 3090 ≈ $0.3–1/day · RTX 3060 ≈ $0.15–0.3/day

Quick Start

1. Install Axolotl

Or use the Docker image (recommended for reproducibility):

2. Create a Config File

Save this as config.yml:

3. Launch Training

Training progress logs to stdout and optionally to Weights & Biases.

Configuration Deep Dive

Dataset Formats

Axolotl supports multiple input formats natively:

Multi-GPU with DeepSpeed

Create deepspeed_zero2.json:

Add to your config:

Then launch:

DPO / ORPO Alignment

Full Fine-Tune (No LoRA)

Usage Examples

Inference After Training

Merge LoRA into Base Model

Preprocess Dataset (Validate Before Training)

This tokenizes and validates the dataset. Useful for catching format errors before a long training run.

VRAM Usage Reference

Model
Method
GPUs
VRAM/GPU
Config

Llama 3.1 8B

QLoRA 4bit

1

~12 GB

r=32, seq_len=2048

Llama 3.1 8B

LoRA 16bit

1

~20 GB

r=16, seq_len=2048

Llama 3.1 8B

Full

2

~22 GB

DeepSpeed ZeRO-3

Qwen 2.5 14B

QLoRA 4bit

1

~16 GB

r=16, seq_len=2048

Llama 3.3 70B

QLoRA 4bit

2

~22 GB

r=16, seq_len=2048

Llama 3.3 70B

Full

4

~40 GB

DeepSpeed ZeRO-3+offload

Tips

  • Always enable sample_packing: true — single biggest throughput improvement (3–5× on short datasets)

  • Use flash_attention: true on Ampere+ GPUs for 20–40% VRAM savings

  • Start with QLoRA for experiments, switch to full fine-tune only when LoRA quality plateaus

  • Set val_set_size: 0.02 to monitor overfitting during training

  • Preprocess first — run axolotl.cli.preprocess to validate data formatting before committing to a long run

  • Use the Docker image for reproducible environments — avoids dependency conflicts

  • lora_target_linear: true applies LoRA to all linear layers, generally better than targeting only attention

Troubleshooting

Problem
Solution

OutOfMemoryError

Lower micro_batch_size to 1, enable gradient_checkpointing

Dataset format errors

Run python -m axolotl.cli.preprocess config.yml to debug

sample_packing slow on first epoch

Normal — initial packing computation is one-time

Multi-GPU training hangs

Check NCCL: export NCCL_DEBUG=INFO, ensure all GPUs visible

flash_attention import error

Install: pip install flash-attn --no-build-isolation

Loss not decreasing

Lower LR to 1e-4, increase warmup, check dataset quality

WandB connection error

Run wandb login or set wandb_project: to empty string

Resources

Last updated

Was this helpful?