# Axolotl Universal Fine-tuning

Axolotl wraps HuggingFace Transformers, PEFT, TRL, and DeepSpeed into a single YAML-driven interface. You define your model, dataset, training method, and hyperparameters in one config file — then launch with a single command. No Python scripting required for standard workflows.

{% hint style="success" %}
All examples run on GPU servers rented through the [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Key Features

* **YAML-only configuration** — define everything in one file, no Python needed
* **All training methods** — LoRA, QLoRA, full fine-tune, DPO, ORPO, KTO, RLHF
* **Multi-GPU out of the box** — DeepSpeed ZeRO 1/2/3 and FSDP with one flag
* **Sample packing** — concatenate short examples to fill sequence length, 3–5× throughput gain
* **Flash Attention 2** — automatic VRAM savings on supported hardware
* **Broad model support** — Llama 3.x, Mistral, Qwen 2.5, Gemma 2, Phi-4, DeepSeek, Falcon
* **Built-in dataset formats** — alpaca, sharegpt, chat\_template, completion, and custom

## Requirements

| Component | Minimum        | Recommended          |
| --------- | -------------- | -------------------- |
| GPU       | RTX 3060 12 GB | RTX 4090 24 GB (×2+) |
| VRAM      | 12 GB          | 24+ GB               |
| RAM       | 16 GB          | 64 GB                |
| Disk      | 50 GB          | 100 GB               |
| CUDA      | 11.8           | 12.1+                |
| Python    | 3.10           | 3.11                 |

**Clore.ai pricing:** RTX 4090 ≈ $0.5–2/day · RTX 3090 ≈ $0.3–1/day · RTX 3060 ≈ $0.15–0.3/day

## Quick Start

### 1. Install Axolotl

```bash
# Clone and install
git clone https://github.com/OpenAccess-AI-Collective/axolotl.git
cd axolotl

pip install packaging ninja
pip install -e '.[flash-attn,deepspeed]'
```

Or use the Docker image (recommended for reproducibility):

```bash
docker run --gpus all -it --rm \
  -v /workspace:/workspace \
  winglian/axolotl:main-latest
```

### 2. Create a Config File

Save this as `config.yml`:

```yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true

datasets:
  - path: yahma/alpaca-cleaned
    type: alpaca

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

wandb_project: axolotl-clore
wandb_name: llama3-qlora

output_dir: /workspace/axolotl-output

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
learning_rate: 2e-4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
warmup_steps: 10

bf16: auto
flash_attention: true
gradient_checkpointing: true

logging_steps: 10
save_strategy: steps
save_steps: 500
eval_steps: 500

evals_per_epoch:
val_set_size: 0.02
```

### 3. Launch Training

```bash
# Single GPU
accelerate launch -m axolotl.cli.train config.yml

# Multi-GPU (all available GPUs)
accelerate launch --multi_gpu -m axolotl.cli.train config.yml
```

Training progress logs to stdout and optionally to Weights & Biases.

## Configuration Deep Dive

### Dataset Formats

Axolotl supports multiple input formats natively:

```yaml
# Alpaca-style (instruction / input / output)
datasets:
  - path: yahma/alpaca-cleaned
    type: alpaca

# ShareGPT multi-turn chat
datasets:
  - path: anon8231489123/ShareGPT_Vicuna_unfiltered
    type: sharegpt
    conversation: chatml

# Chat template (auto-detect from tokenizer)
datasets:
  - path: HuggingFaceH4/ultrachat_200k
    type: chat_template
    field_messages: messages
    message_field_role: role
    message_field_content: content

# Local JSONL file
datasets:
  - path: /workspace/data/my_dataset.jsonl
    type: alpaca
    ds_type: json
```

### Multi-GPU with DeepSpeed

Create `deepspeed_zero2.json`:

```json
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": { "device": "cpu" },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  },
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0
}
```

Add to your config:

```yaml
deepspeed: deepspeed_zero2.json
```

Then launch:

```bash
accelerate launch --num_processes 4 -m axolotl.cli.train config.yml
```

### DPO / ORPO Alignment

```yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
rl: dpo
# or: rl: orpo

datasets:
  - path: argilla/ultrafeedback-binarized-preferences
    type: chat_template.default
    field_messages: chosen
    field_chosen: chosen
    field_rejected: rejected

dpo_beta: 0.1
```

### Full Fine-Tune (No LoRA)

```yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct

# No adapter, no quantization
adapter:
load_in_4bit: false
load_in_8bit: false

learning_rate: 5e-6
micro_batch_size: 1
gradient_accumulation_steps: 8
gradient_checkpointing: true
flash_attention: true
bf16: auto

deepspeed: deepspeed_zero3.json  # required for 8B+ full fine-tune
```

## Usage Examples

### Inference After Training

```bash
# Launch interactive inference
accelerate launch -m axolotl.cli.inference config.yml \
  --lora_model_dir /workspace/axolotl-output
```

### Merge LoRA into Base Model

```bash
accelerate launch -m axolotl.cli.merge_lora config.yml \
  --lora_model_dir /workspace/axolotl-output \
  --output_dir /workspace/merged-model
```

### Preprocess Dataset (Validate Before Training)

```bash
python -m axolotl.cli.preprocess config.yml
```

This tokenizes and validates the dataset. Useful for catching format errors before a long training run.

## VRAM Usage Reference

| Model         | Method     | GPUs | VRAM/GPU | Config                   |
| ------------- | ---------- | ---- | -------- | ------------------------ |
| Llama 3.1 8B  | QLoRA 4bit | 1    | \~12 GB  | r=32, seq\_len=2048      |
| Llama 3.1 8B  | LoRA 16bit | 1    | \~20 GB  | r=16, seq\_len=2048      |
| Llama 3.1 8B  | Full       | 2    | \~22 GB  | DeepSpeed ZeRO-3         |
| Qwen 2.5 14B  | QLoRA 4bit | 1    | \~16 GB  | r=16, seq\_len=2048      |
| Llama 3.3 70B | QLoRA 4bit | 2    | \~22 GB  | r=16, seq\_len=2048      |
| Llama 3.3 70B | Full       | 4    | \~40 GB  | DeepSpeed ZeRO-3+offload |

## Tips

* **Always enable `sample_packing: true`** — single biggest throughput improvement (3–5× on short datasets)
* **Use `flash_attention: true`** on Ampere+ GPUs for 20–40% VRAM savings
* **Start with QLoRA** for experiments, switch to full fine-tune only when LoRA quality plateaus
* **Set `val_set_size: 0.02`** to monitor overfitting during training
* **Preprocess first** — run `axolotl.cli.preprocess` to validate data formatting before committing to a long run
* **Use the Docker image** for reproducible environments — avoids dependency conflicts
* **`lora_target_linear: true`** applies LoRA to all linear layers, generally better than targeting only attention

## Troubleshooting

| Problem                              | Solution                                                       |
| ------------------------------------ | -------------------------------------------------------------- |
| `OutOfMemoryError`                   | Lower `micro_batch_size` to 1, enable `gradient_checkpointing` |
| Dataset format errors                | Run `python -m axolotl.cli.preprocess config.yml` to debug     |
| `sample_packing` slow on first epoch | Normal — initial packing computation is one-time               |
| Multi-GPU training hangs             | Check NCCL: `export NCCL_DEBUG=INFO`, ensure all GPUs visible  |
| `flash_attention` import error       | Install: `pip install flash-attn --no-build-isolation`         |
| Loss not decreasing                  | Lower LR to 1e-4, increase warmup, check dataset quality       |
| WandB connection error               | Run `wandb login` or set `wandb_project:` to empty string      |

## Resources

* [Axolotl GitHub](https://github.com/OpenAccess-AI-Collective/axolotl)
* [Example Configs](https://github.com/OpenAccess-AI-Collective/axolotl/tree/main/examples)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/training/axolotl-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.