# Unsloth 2x Faster Fine-tuning

Unsloth rewrites the performance-critical parts of HuggingFace Transformers with hand-optimized Triton kernels, delivering **2x training speed** and **70% VRAM reduction** with zero accuracy loss. It is a drop-in replacement — your existing TRL/PEFT scripts work unchanged after swapping the import.

{% hint style="success" %}
All examples run on GPU servers rented through the [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Key Features

* **2x faster training** — custom Triton kernels for attention, RoPE, cross-entropy, and RMS norm
* **70% less VRAM** — intelligent gradient checkpointing and memory-mapped weights
* **Drop-in HuggingFace replacement** — one import change, nothing else
* **QLoRA / LoRA / full fine-tune** — all modes supported out of the box
* **Native export** — save directly to GGUF (all quant types), LoRA adapters, or merged 16-bit
* **Broad model coverage** — Llama 3.x, Mistral, Qwen 2.5, Gemma 2, DeepSeek-R1, Phi-4, and more
* **Free and open source** (Apache 2.0)

## Requirements

| Component | Minimum        | Recommended    |
| --------- | -------------- | -------------- |
| GPU       | RTX 3060 12 GB | RTX 4090 24 GB |
| VRAM      | 10 GB          | 24 GB          |
| RAM       | 16 GB          | 32 GB          |
| Disk      | 40 GB          | 80 GB          |
| CUDA      | 11.8           | 12.1+          |
| Python    | 3.10           | 3.11           |

**Clore.ai pricing:** RTX 4090 ≈ $0.5–2/day · RTX 3090 ≈ $0.3–1/day · RTX 3060 ≈ $0.15–0.3/day

A 7B model with 4-bit QLoRA fits in **\~10 GB VRAM**, making even an RTX 3060 viable.

## Quick Start

### 1. Install Unsloth

```bash
# Create a venv (recommended)
python -m venv /workspace/unsloth-env
source /workspace/unsloth-env/bin/activate

pip install --upgrade pip
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes xformers
```

### 2. Load a Model with 4-bit Quantization

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,            # auto-detect (float16 on Ampere, bfloat16 on Ada)
    load_in_4bit=True,
)
```

### 3. Apply LoRA Adapters

```python
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",   # 70% VRAM reduction
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)
```

### 4. Prepare Data and Train

```python
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

dataset = load_dataset("yahma/alpaca-cleaned", split="train")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="/workspace/outputs",
    ),
)

stats = trainer.train()
print(f"Training loss: {stats.training_loss:.4f}")
```

## Exporting the Model

### Save LoRA Adapter Only

```python
model.save_pretrained("/workspace/lora-adapter")
tokenizer.save_pretrained("/workspace/lora-adapter")
```

### Merge and Save Full Model (float16)

```python
model.save_pretrained_merged(
    "/workspace/merged-model",
    tokenizer,
    save_method="merged_16bit",
)
```

### Export to GGUF for Ollama / llama.cpp

```python
# Quantize to Q4_K_M (good balance of size and quality)
model.save_pretrained_gguf(
    "/workspace/gguf-output",
    tokenizer,
    quantization_method="q4_k_m",
)

# Other options: q5_k_m, q8_0, f16
```

After export, serve with Ollama:

```bash
# Create an Ollama modelfile
cat > Modelfile <<EOF
FROM /workspace/gguf-output/unsloth.Q4_K_M.gguf
TEMPLATE "{{ .System }}\n{{ .Prompt }}"
PARAMETER temperature 0.7
EOF

ollama create my-finetuned -f Modelfile
ollama run my-finetuned "Summarize the key points of transformers architecture"
```

## Usage Examples

### Fine-Tune on a Custom Chat Dataset

```python
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

def format_chat(example):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]},
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

dataset = dataset.map(format_chat)
```

### DPO / ORPO Alignment Training

```python
from trl import DPOTrainer, DPOConfig

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,          # Unsloth handles reference model internally
    args=DPOConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-6,
        num_train_epochs=1,
        beta=0.1,
        output_dir="/workspace/dpo-output",
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()
```

## VRAM Usage Reference

| Model          | Quant  | Method | VRAM    | GPU         |
| -------------- | ------ | ------ | ------- | ----------- |
| Llama 3.1 8B   | 4-bit  | QLoRA  | \~10 GB | RTX 3060    |
| Llama 3.1 8B   | 16-bit | LoRA   | \~18 GB | RTX 3090    |
| Qwen 2.5 14B   | 4-bit  | QLoRA  | \~14 GB | RTX 3090    |
| Mistral 7B     | 4-bit  | QLoRA  | \~9 GB  | RTX 3060    |
| DeepSeek-R1 7B | 4-bit  | QLoRA  | \~10 GB | RTX 3060    |
| Llama 3.3 70B  | 4-bit  | QLoRA  | \~44 GB | 2× RTX 3090 |

## Tips

* **Always use `use_gradient_checkpointing="unsloth"`** — this is the single biggest VRAM saver, unique to Unsloth
* **Set `lora_dropout=0`** — Unsloth's Triton kernels are optimized for zero dropout and run faster
* **Use `packing=True`** in SFTTrainer to avoid padding waste on short examples
* **Start with `r=16`** for LoRA rank — increase to 32 or 64 only if validation loss plateaus
* **Monitor with wandb** — add `report_to="wandb"` in TrainingArguments for loss tracking
* **Batch size tuning** — increase `per_device_train_batch_size` until you approach VRAM limit, then compensate with `gradient_accumulation_steps`

## Troubleshooting

| Problem                            | Solution                                                            |
| ---------------------------------- | ------------------------------------------------------------------- |
| `OutOfMemoryError` during training | Lower batch size to 1, reduce `max_seq_length`, or use 4-bit quant  |
| Triton kernel compilation errors   | Run `pip install triton --upgrade` and ensure CUDA toolkit matches  |
| Slow first step (compiling)        | Normal — Triton compiles kernels on first run, cached afterwards    |
| `bitsandbytes` CUDA version error  | Install matching version: `pip install bitsandbytes --upgrade`      |
| Loss spikes during training        | Lower learning rate to 1e-4, add warmup steps                       |
| GGUF export crashes                | Ensure enough RAM (2× model size) and disk space for the conversion |

## Resources

* [Unsloth GitHub](https://github.com/unslothai/unsloth)
* [Unsloth Wiki — All Notebooks](https://github.com/unslothai/unsloth/wiki)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/training/unsloth-finetune.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
