# Axolotl 通用微调

Axolotl 将 HuggingFace Transformers、PEFT、TRL 和 DeepSpeed 封装到一个由 YAML 驱动的单一接口中。您可在一个配置文件中定义模型、数据集、训练方法和超参数 —— 然后通过单个命令启动。标准工作流程无需 Python 脚本。

{% hint style="success" %}
所有示例均在通过以下方式租用的 GPU 服务器上运行： [CLORE.AI 市场](https://clore.ai/marketplace).
{% endhint %}

## 主要特性

* **仅 YAML 配置** — 在一个文件中定义所有内容，无需 Python
* **所有训练方法** — LoRA、QLoRA、完整微调、DPO、ORPO、KTO、RLHF
* **开箱即用的多 GPU** — 通过一个标志启用 DeepSpeed ZeRO 1/2/3 和 FSDP
* **样本打包** — 将短示例串联以填满序列长度，吞吐量提升 3–5×
* **Flash Attention 2** — 在支持的硬件上自动节省显存
* **广泛的模型支持** — Llama 3.x、Mistral、Qwen 2.5、Gemma 2、Phi-4、DeepSeek、Falcon
* **内置数据集格式** — alpaca、sharegpt、chat\_template、completion 以及自定义

## 要求

| 组件     | 最低             | 推荐                     |
| ------ | -------------- | ---------------------- |
| GPU    | RTX 3060 12 GB | RTX 4090 24 GB（×2 及以上） |
| 显存     | 12 GB          | 24+ GB                 |
| 内存     | 16 GB          | 64 GB                  |
| 磁盘     | 50 GB          | 100 GB                 |
| CUDA   | 11.8           | 12.1+                  |
| Python | 3.10           | 3.11                   |

**Clore.ai 价格：** RTX 4090 约 $0.5–2/天 · RTX 3090 约 $0.3–1/天 · RTX 3060 约 $0.15–0.3/天

## 快速开始

### 1. 安装 Axolotl

```bash
# 克隆并安装
git clone https://github.com/OpenAccess-AI-Collective/axolotl.git
cd axolotl

pip install packaging ninja
pip install -e '.[flash-attn,deepspeed]'
```

或使用 Docker 镜像（推荐用于可重复性）：

```bash
docker run --gpus all -it --rm \
  -v /workspace:/workspace \
  winglian/axolotl:main-latest
```

### 2. 创建配置文件

将此保存为 `config.yml`:

```yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true

datasets:
  - path: yahma/alpaca-cleaned
    type: alpaca

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

wandb_project: axolotl-clore
wandb_name: llama3-qlora

output_dir: /workspace/axolotl-output

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
learning_rate: 2e-4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
warmup_steps: 10

bf16: auto
flash_attention: true
gradient_checkpointing: true

logging_steps: 10
save_strategy: steps
save_steps: 500
eval_steps: 500

evals_per_epoch:
val_set_size: 0.02
```

### 3. 启动训练

```bash
# 单 GPU
accelerate launch -m axolotl.cli.train config.yml

# 多 GPU（所有可用 GPU）
accelerate launch --multi_gpu -m axolotl.cli.train config.yml
```

训练进度日志输出到 stdout，并可选择发送到 Weights & Biases。

## 配置深度解析

### 数据集格式

Axolotl 原生支持多种输入格式：

```yaml
# Alpaca 风格（instruction / input / output）
datasets:
  - path: yahma/alpaca-cleaned
    type: alpaca

# ShareGPT 多轮对话
datasets:
  - path: anon8231489123/ShareGPT_Vicuna_unfiltered
    type: sharegpt
    conversation: chatml

# 聊天模板（从分词器自动检测）
datasets:
  - path: HuggingFaceH4/ultrachat_200k
    type: chat_template
    field_messages: messages
    message_field_role: role
    message_field_content: content

# 本地 JSONL 文件
datasets:
  - path: /workspace/data/my_dataset.jsonl
    type: alpaca
    ds_type: json
```

### 使用 DeepSpeed 的多 GPU

创建 `deepspeed_zero2.json`:

```json
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": { "device": "cpu" },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  },
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0
}
```

添加到您的配置：

```yaml
deepspeed: deepspeed_zero2.json
```

然后启动：

```bash
accelerate launch --num_processes 4 -m axolotl.cli.train config.yml
```

### DPO / ORPO 对齐

```yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
rl: dpo
# 或： rl: orpo

datasets:
  - path: argilla/ultrafeedback-binarized-preferences
    type: chat_template.default
    field_messages: chosen
    field_chosen: chosen
    field_rejected: rejected

dpo_beta: 0.1
```

### 完整微调（无 LoRA）

```yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct

# 无适配器，无量化
adapter:
load_in_4bit: false
load_in_8bit: false

learning_rate: 5e-6
micro_batch_size: 1
gradient_accumulation_steps: 8
gradient_checkpointing: true
flash_attention: true
bf16: auto

deepspeed: deepspeed_zero3.json  # 8B+ 完整微调所需
```

## 使用示例

### 训练后推理

```bash
# 启动交互式推理
accelerate launch -m axolotl.cli.inference config.yml \
  --lora_model_dir /workspace/axolotl-output
```

### 将 LoRA 合并到基础模型

```bash
accelerate launch -m axolotl.cli.merge_lora config.yml \
  --lora_model_dir /workspace/axolotl-output \
  --output_dir /workspace/merged-model
```

### 预处理数据集（在训练前验证）

```bash
python -m axolotl.cli.preprocess config.yml
```

这会对数据集进行分词和验证。对于在长时间训练前捕获格式错误很有用。

## 显存使用参考

| A100          | 方法         | GPU 数量 | 每 GPU 显存 | 配置                       |
| ------------- | ---------- | ------ | -------- | ------------------------ |
| Llama 3.1 8B  | QLoRA 4bit | 1      | \~12 GB  | r=32，seq\_len=2048       |
| Llama 3.1 8B  | LoRA 16bit | 1      | \~20 GB  | r=16，seq\_len=2048       |
| Llama 3.1 8B  | 完整         | 2      | \~22 GB  | DeepSpeed ZeRO-3         |
| Qwen 2.5 14B  | QLoRA 4bit | 1      | \~16 GB  | r=16，seq\_len=2048       |
| Llama 3.3 70B | QLoRA 4bit | 2      | \~22 GB  | r=16，seq\_len=2048       |
| Llama 3.3 70B | 完整         | 4      | \~40 GB  | DeepSpeed ZeRO-3+offload |

## 提示

* **始终启用 `sample_packing: true`** — 单次最大的吞吐量提升（在短数据集上 3–5×）
* **使用 `flash_attention: true`** 在 Ampere 及更高架构 GPU 上可节省 20–40% 显存
* **从 QLoRA 开始** 用于实验，只有当 LoRA 质量趋于平台期时再切换到完整微调
* **设置 `val_set_size: 0.02`** 以在训练期间监控过拟合
* **先预处理** — 运行 `axolotl.cli.preprocess` 以在开始长时间运行前验证数据格式
* **使用 Docker 镜像** 以获得可重复的环境 — 避免依赖冲突
* **`lora_target_linear: true`** 将 LoRA 应用于所有线性层，通常比仅针对注意力层更好

## # 使用固定种子以获得一致结果

| 问题                      | 解决方案                                                   |
| ----------------------- | ------------------------------------------------------ |
| `OutOfMemoryError`      | 降低 `micro_batch_size` 到 1，启用 `gradient_checkpointing`  |
| 数据集格式错误                 | 运行 `python -m axolotl.cli.preprocess config.yml` 以进行调试 |
| `sample_packing` 在第一轮较慢 | 正常 — 初始打包计算为一次性操作                                      |
| 多 GPU 训练挂起              | 检查 NCCL： `export NCCL_DEBUG=INFO`，确保所有 GPU 可见          |
| `flash_attention` 导入错误  | 安装： `pip install flash-attn --no-build-isolation`      |
| 损失未降低                   | 将学习率降低到 1e-4，增加预热步骤，检查数据集质量                            |
| WandB 连接错误              | 运行 `wandb login` 或设置 `wandb_project:` 为空字符串            |

## 资源

* [Axolotl GitHub](https://github.com/OpenAccess-AI-Collective/axolotl)
* [示例配置](https://github.com/OpenAccess-AI-Collective/axolotl/tree/main/examples)
* [CLORE.AI 市场](https://clore.ai/marketplace)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/xun-lian/axolotl-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.