# Unsloth 2 倍更快微调

Unsloth 使用手工优化的 Triton 内核重写了 HuggingFace Transformers 中性能关键的部分，提供 **2 倍训练速度** 和 **70% 显存减少** 同时不损失精度。它是一个即插即用的替代品——只需替换导入，你现有的 TRL/PEFT 脚本无需修改即可工作。

{% hint style="success" %}
所有示例均在通过以下方式租用的 GPU 服务器上运行： [CLORE.AI 市场](https://clore.ai/marketplace).
{% endhint %}

## 主要特性

* **训练速度提高 2 倍** — 针对注意力、RoPE、交叉熵和 RMS norm 的自定义 Triton 内核
* **显存减少 70%** — 智能梯度检查点和内存映射权重
* **HuggingFace 的即插即用替代方案** — 只需更改一次导入，其他无需修改
* **QLoRA / LoRA / 全量微调** — 所有模式开箱即用支持
* **原生导出** — 直接保存为 GGUF（所有量化类型）、LoRA 适配器或合并的 16 位模型
* **广泛的模型覆盖** — 支持 Llama 3.x、Mistral、Qwen 2.5、Gemma 2、DeepSeek-R1、Phi-4 等
* **免费且开源** （Apache 2.0）

## 要求

| 组件     | 最低             | 推荐             |
| ------ | -------------- | -------------- |
| GPU    | RTX 3060 12 GB | RTX 4090 24 GB |
| 显存     | 10 GB          | 24 GB          |
| 内存     | 16 GB          | 32 GB          |
| 磁盘     | 40 GB          | 80 GB          |
| CUDA   | 11.8           | 12.1+          |
| Python | 3.10           | 3.11           |

**Clore.ai 价格：** RTX 4090 约 $0.5–2/天 · RTX 3090 约 $0.3–1/天 · RTX 3060 约 $0.15–0.3/天

一个 7B 模型使用 4-bit QLoRA 可以放入 **\~10 GB 显存**，使得即使是 RTX 3060 也可行。

## 快速开始

### 1. 安装 Unsloth

```bash
# 创建虚拟环境（推荐）
python -m venv /workspace/unsloth-env
source /workspace/unsloth-env/bin/activate

pip install --upgrade pip
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes xformers
```

### 2. 以 4-bit 量化加载模型

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,            # 自动检测（Ampere 上为 float16，Ada 上为 bfloat16）
    load_in_4bit=True,
)
```

### 3. 应用 LoRA 适配器

```python
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",   # 显存减少 70%
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)
```

### 4. 准备数据并训练

```python
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

dataset = load_dataset("yahma/alpaca-cleaned", split="train")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="/workspace/outputs",
    ),
)

stats = trainer.train()
print(f"Training loss: {stats.training_loss:.4f}")
```

## 导出模型

### 仅保存 LoRA 适配器

```python
model.save_pretrained("/workspace/lora-adapter")
tokenizer.save_pretrained("/workspace/lora-adapter")
```

### 合并并保存完整模型（float16）

```python
model.save_pretrained_merged(
    "/workspace/merged-model",
    tokenizer,
    save_method="merged_16bit",
)
```

### 导出为 GGUF 以供 Ollama / llama.cpp 使用

```python
# 量化为 Q4_K_M（在大小和质量之间的良好平衡）
model.save_pretrained_gguf(
    "/workspace/gguf-output",
    tokenizer,
    quantization_method="q4_k_m",
)

# 其他选项：q5_k_m、q8_0、f16
```

导出后，用 Ollama 部署：

```bash
# 创建 Ollama 的 Modelfile
cat > Modelfile <<EOF
FROM /workspace/gguf-output/unsloth.Q4_K_M.gguf
TEMPLATE "{{ .System }}\n{{ .Prompt }}"
PARAMETER temperature 0.7
EOF

ollama create my-finetuned -f Modelfile
ollama run my-finetuned "Summarize the key points of transformers architecture"
```

## 使用示例

### 在自定义聊天数据集上微调

```python
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

def format_chat(example):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]},
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

dataset = dataset.map(format_chat)
```

### DPO / ORPO 对齐训练

```python
from trl import DPOTrainer, DPOConfig

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,          # Unsloth 在内部处理参考模型
    args=DPOConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-6,
        num_train_epochs=1,
        beta=0.1,
        output_dir="/workspace/dpo-output",
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()
```

## 显存使用参考

| A100           | 量化     | 方法    | 显存      | GPU         |
| -------------- | ------ | ----- | ------- | ----------- |
| Llama 3.1 8B   | 4-bit  | QLoRA | \~10 GB | 按小时费率       |
| Llama 3.1 8B   | 16-bit | LoRA  | \~18 GB | 速度          |
| Qwen 2.5 14B   | 4-bit  | QLoRA | \~14 GB | 速度          |
| Mistral 7B     | 4-bit  | QLoRA | \~9 GB  | 按小时费率       |
| DeepSeek-R1 7B | 4-bit  | QLoRA | \~10 GB | 按小时费率       |
| Llama 3.3 70B  | 4-bit  | QLoRA | \~44 GB | 2× RTX 3090 |

## 提示

* **始终使用 `use_gradient_checkpointing="unsloth"`** — 这是节省显存的最大单项措施，Unsloth 独有
* **设置 `lora_dropout=0`** — Unsloth 的 Triton 内核针对零 dropout 进行了优化，运行更快
* **使用 `packing=True`** 在 SFTTrainer 中以避免在短示例上产生填充浪费
* **如果你是 AI 平台新手，先从 `r=16`** 用于 LoRA 的秩——仅在验证损失停滞时增加到 32 或 64
* **使用 wandb 监控** — 添加 `report_to="wandb"` 到 TrainingArguments 中以跟踪损失
* **批量大小调优** — 增加 `per_device_train_batch_size` 直到接近显存上限，然后用 `gradient_accumulation_steps`

## # 使用固定种子以获得一致结果

| 问题                          | 解决方案                                               |
| --------------------------- | -------------------------------------------------- |
| `OutOfMemoryError` 在训练中进行补偿 | 将批量大小降至 1，减少 `max_seq_length`，或使用 4-bit 量化         |
| Triton 内核编译错误               | 运行 `pip install triton --upgrade` 并确保 CUDA 工具包版本匹配 |
| 第一次步骤慢（编译中）                 | 正常情况——Triton 在首次运行时编译内核，之后会缓存                      |
| `bitsandbytes` CUDA 版本错误    | 安装匹配版本： `pip install bitsandbytes --upgrade`       |
| 训练过程中损失突增                   | 将学习率降至 1e-4，增加预热步骤                                 |
| GGUF 导出崩溃                   | 确保有足够的 RAM（模型大小的 2 倍）和用于转换的磁盘空间                    |

## 资源

* [Unsloth GitHub](https://github.com/unslothai/unsloth)
* [Unsloth 维基 — 所有笔记本](https://github.com/unslothai/unsloth/wiki)
* [CLORE.AI 市场](https://clore.ai/marketplace)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/xun-lian/unsloth-finetune.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.