# Axolotl Universal Fine-tuning

Axolotl wraps HuggingFace Transformers, PEFT, TRL, and DeepSpeed into a single YAML-driven interface. You define your model, dataset, training method, and hyperparameters in one config file — then launch with a single command. No Python scripting required for standard workflows.

{% hint style="success" %}
All examples run on GPU servers rented through the [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Key Features

* **YAML-only configuration** — define everything in one file, no Python needed
* **All training methods** — LoRA, QLoRA, full fine-tune, DPO, ORPO, KTO, RLHF
* **Multi-GPU out of the box** — DeepSpeed ZeRO 1/2/3 and FSDP with one flag
* **Sample packing** — concatenate short examples to fill sequence length, 3–5× throughput gain
* **Flash Attention 2** — automatic VRAM savings on supported hardware
* **Broad model support** — Llama 3.x, Mistral, Qwen 2.5, Gemma 2, Phi-4, DeepSeek, Falcon
* **Built-in dataset formats** — alpaca, sharegpt, chat\_template, completion, and custom

## Requirements

| Component | Minimum        | Recommended          |
| --------- | -------------- | -------------------- |
| GPU       | RTX 3060 12 GB | RTX 4090 24 GB (×2+) |
| VRAM      | 12 GB          | 24+ GB               |
| RAM       | 16 GB          | 64 GB                |
| Disk      | 50 GB          | 100 GB               |
| CUDA      | 11.8           | 12.1+                |
| Python    | 3.10           | 3.11                 |

**Clore.ai pricing:** RTX 4090 ≈ $0.5–2/day · RTX 3090 ≈ $0.3–1/day · RTX 3060 ≈ $0.15–0.3/day

## Quick Start

### 1. Install Axolotl

```bash
# Clone and install
git clone https://github.com/OpenAccess-AI-Collective/axolotl.git
cd axolotl

pip install packaging ninja
pip install -e '.[flash-attn,deepspeed]'
```

Or use the Docker image (recommended for reproducibility):

```bash
docker run --gpus all -it --rm \
  -v /workspace:/workspace \
  winglian/axolotl:main-latest
```

### 2. Create a Config File

Save this as `config.yml`:

```yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true

datasets:
  - path: yahma/alpaca-cleaned
    type: alpaca

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

wandb_project: axolotl-clore
wandb_name: llama3-qlora

output_dir: /workspace/axolotl-output

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
learning_rate: 2e-4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
warmup_steps: 10

bf16: auto
flash_attention: true
gradient_checkpointing: true

logging_steps: 10
save_strategy: steps
save_steps: 500
eval_steps: 500

evals_per_epoch:
val_set_size: 0.02
```

### 3. Launch Training

```bash
# Single GPU
accelerate launch -m axolotl.cli.train config.yml

# Multi-GPU (all available GPUs)
accelerate launch --multi_gpu -m axolotl.cli.train config.yml
```

Training progress logs to stdout and optionally to Weights & Biases.

## Configuration Deep Dive

### Dataset Formats

Axolotl supports multiple input formats natively:

```yaml
# Alpaca-style (instruction / input / output)
datasets:
  - path: yahma/alpaca-cleaned
    type: alpaca

# ShareGPT multi-turn chat
datasets:
  - path: anon8231489123/ShareGPT_Vicuna_unfiltered
    type: sharegpt
    conversation: chatml

# Chat template (auto-detect from tokenizer)
datasets:
  - path: HuggingFaceH4/ultrachat_200k
    type: chat_template
    field_messages: messages
    message_field_role: role
    message_field_content: content

# Local JSONL file
datasets:
  - path: /workspace/data/my_dataset.jsonl
    type: alpaca
    ds_type: json
```

### Multi-GPU with DeepSpeed

Create `deepspeed_zero2.json`:

```json
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": { "device": "cpu" },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  },
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0
}
```

Add to your config:

```yaml
deepspeed: deepspeed_zero2.json
```

Then launch:

```bash
accelerate launch --num_processes 4 -m axolotl.cli.train config.yml
```

### DPO / ORPO Alignment

```yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
rl: dpo
# or: rl: orpo

datasets:
  - path: argilla/ultrafeedback-binarized-preferences
    type: chat_template.default
    field_messages: chosen
    field_chosen: chosen
    field_rejected: rejected

dpo_beta: 0.1
```

### Full Fine-Tune (No LoRA)

```yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct

# No adapter, no quantization
adapter:
load_in_4bit: false
load_in_8bit: false

learning_rate: 5e-6
micro_batch_size: 1
gradient_accumulation_steps: 8
gradient_checkpointing: true
flash_attention: true
bf16: auto

deepspeed: deepspeed_zero3.json  # required for 8B+ full fine-tune
```

## Usage Examples

### Inference After Training

```bash
# Launch interactive inference
accelerate launch -m axolotl.cli.inference config.yml \
  --lora_model_dir /workspace/axolotl-output
```

### Merge LoRA into Base Model

```bash
accelerate launch -m axolotl.cli.merge_lora config.yml \
  --lora_model_dir /workspace/axolotl-output \
  --output_dir /workspace/merged-model
```

### Preprocess Dataset (Validate Before Training)

```bash
python -m axolotl.cli.preprocess config.yml
```

This tokenizes and validates the dataset. Useful for catching format errors before a long training run.

## VRAM Usage Reference

| Model         | Method     | GPUs | VRAM/GPU | Config                   |
| ------------- | ---------- | ---- | -------- | ------------------------ |
| Llama 3.1 8B  | QLoRA 4bit | 1    | \~12 GB  | r=32, seq\_len=2048      |
| Llama 3.1 8B  | LoRA 16bit | 1    | \~20 GB  | r=16, seq\_len=2048      |
| Llama 3.1 8B  | Full       | 2    | \~22 GB  | DeepSpeed ZeRO-3         |
| Qwen 2.5 14B  | QLoRA 4bit | 1    | \~16 GB  | r=16, seq\_len=2048      |
| Llama 3.3 70B | QLoRA 4bit | 2    | \~22 GB  | r=16, seq\_len=2048      |
| Llama 3.3 70B | Full       | 4    | \~40 GB  | DeepSpeed ZeRO-3+offload |

## Tips

* **Always enable `sample_packing: true`** — single biggest throughput improvement (3–5× on short datasets)
* **Use `flash_attention: true`** on Ampere+ GPUs for 20–40% VRAM savings
* **Start with QLoRA** for experiments, switch to full fine-tune only when LoRA quality plateaus
* **Set `val_set_size: 0.02`** to monitor overfitting during training
* **Preprocess first** — run `axolotl.cli.preprocess` to validate data formatting before committing to a long run
* **Use the Docker image** for reproducible environments — avoids dependency conflicts
* **`lora_target_linear: true`** applies LoRA to all linear layers, generally better than targeting only attention

## Troubleshooting

| Problem                              | Solution                                                       |
| ------------------------------------ | -------------------------------------------------------------- |
| `OutOfMemoryError`                   | Lower `micro_batch_size` to 1, enable `gradient_checkpointing` |
| Dataset format errors                | Run `python -m axolotl.cli.preprocess config.yml` to debug     |
| `sample_packing` slow on first epoch | Normal — initial packing computation is one-time               |
| Multi-GPU training hangs             | Check NCCL: `export NCCL_DEBUG=INFO`, ensure all GPUs visible  |
| `flash_attention` import error       | Install: `pip install flash-attn --no-build-isolation`         |
| Loss not decreasing                  | Lower LR to 1e-4, increase warmup, check dataset quality       |
| WandB connection error               | Run `wandb login` or set `wandb_project:` to empty string      |

## Resources

* [Axolotl GitHub](https://github.com/OpenAccess-AI-Collective/axolotl)
* [Example Configs](https://github.com/OpenAccess-AI-Collective/axolotl/tree/main/examples)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)
