# Unsloth 2x Faster Fine-tuning

Unsloth rewrites the performance-critical parts of HuggingFace Transformers with hand-optimized Triton kernels, delivering **2x training speed** and **70% VRAM reduction** with zero accuracy loss. It is a drop-in replacement — your existing TRL/PEFT scripts work unchanged after swapping the import.

{% hint style="success" %}
All examples run on GPU servers rented through the [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Key Features

* **2x faster training** — custom Triton kernels for attention, RoPE, cross-entropy, and RMS norm
* **70% less VRAM** — intelligent gradient checkpointing and memory-mapped weights
* **Drop-in HuggingFace replacement** — one import change, nothing else
* **QLoRA / LoRA / full fine-tune** — all modes supported out of the box
* **Native export** — save directly to GGUF (all quant types), LoRA adapters, or merged 16-bit
* **Broad model coverage** — Llama 3.x, Mistral, Qwen 2.5, Gemma 2, DeepSeek-R1, Phi-4, and more
* **Free and open source** (Apache 2.0)

## Requirements

| Component | Minimum        | Recommended    |
| --------- | -------------- | -------------- |
| GPU       | RTX 3060 12 GB | RTX 4090 24 GB |
| VRAM      | 10 GB          | 24 GB          |
| RAM       | 16 GB          | 32 GB          |
| Disk      | 40 GB          | 80 GB          |
| CUDA      | 11.8           | 12.1+          |
| Python    | 3.10           | 3.11           |

**Clore.ai pricing:** RTX 4090 ≈ $0.5–2/day · RTX 3090 ≈ $0.3–1/day · RTX 3060 ≈ $0.15–0.3/day

A 7B model with 4-bit QLoRA fits in **\~10 GB VRAM**, making even an RTX 3060 viable.

## Quick Start

### 1. Install Unsloth

```bash
# Create a venv (recommended)
python -m venv /workspace/unsloth-env
source /workspace/unsloth-env/bin/activate

pip install --upgrade pip
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes xformers
```

### 2. Load a Model with 4-bit Quantization

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,            # auto-detect (float16 on Ampere, bfloat16 on Ada)
    load_in_4bit=True,
)
```

### 3. Apply LoRA Adapters

```python
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",   # 70% VRAM reduction
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)
```

### 4. Prepare Data and Train

```python
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

dataset = load_dataset("yahma/alpaca-cleaned", split="train")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="/workspace/outputs",
    ),
)

stats = trainer.train()
print(f"Training loss: {stats.training_loss:.4f}")
```

## Exporting the Model

### Save LoRA Adapter Only

```python
model.save_pretrained("/workspace/lora-adapter")
tokenizer.save_pretrained("/workspace/lora-adapter")
```

### Merge and Save Full Model (float16)

```python
model.save_pretrained_merged(
    "/workspace/merged-model",
    tokenizer,
    save_method="merged_16bit",
)
```

### Export to GGUF for Ollama / llama.cpp

```python
# Quantize to Q4_K_M (good balance of size and quality)
model.save_pretrained_gguf(
    "/workspace/gguf-output",
    tokenizer,
    quantization_method="q4_k_m",
)

# Other options: q5_k_m, q8_0, f16
```

After export, serve with Ollama:

```bash
# Create an Ollama modelfile
cat > Modelfile <<EOF
FROM /workspace/gguf-output/unsloth.Q4_K_M.gguf
TEMPLATE "{{ .System }}\n{{ .Prompt }}"
PARAMETER temperature 0.7
EOF

ollama create my-finetuned -f Modelfile
ollama run my-finetuned "Summarize the key points of transformers architecture"
```

## Usage Examples

### Fine-Tune on a Custom Chat Dataset

```python
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

def format_chat(example):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]},
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

dataset = dataset.map(format_chat)
```

### DPO / ORPO Alignment Training

```python
from trl import DPOTrainer, DPOConfig

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,          # Unsloth handles reference model internally
    args=DPOConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-6,
        num_train_epochs=1,
        beta=0.1,
        output_dir="/workspace/dpo-output",
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()
```

## VRAM Usage Reference

| Model          | Quant  | Method | VRAM    | GPU         |
| -------------- | ------ | ------ | ------- | ----------- |
| Llama 3.1 8B   | 4-bit  | QLoRA  | \~10 GB | RTX 3060    |
| Llama 3.1 8B   | 16-bit | LoRA   | \~18 GB | RTX 3090    |
| Qwen 2.5 14B   | 4-bit  | QLoRA  | \~14 GB | RTX 3090    |
| Mistral 7B     | 4-bit  | QLoRA  | \~9 GB  | RTX 3060    |
| DeepSeek-R1 7B | 4-bit  | QLoRA  | \~10 GB | RTX 3060    |
| Llama 3.3 70B  | 4-bit  | QLoRA  | \~44 GB | 2× RTX 3090 |

## Tips

* **Always use `use_gradient_checkpointing="unsloth"`** — this is the single biggest VRAM saver, unique to Unsloth
* **Set `lora_dropout=0`** — Unsloth's Triton kernels are optimized for zero dropout and run faster
* **Use `packing=True`** in SFTTrainer to avoid padding waste on short examples
* **Start with `r=16`** for LoRA rank — increase to 32 or 64 only if validation loss plateaus
* **Monitor with wandb** — add `report_to="wandb"` in TrainingArguments for loss tracking
* **Batch size tuning** — increase `per_device_train_batch_size` until you approach VRAM limit, then compensate with `gradient_accumulation_steps`

## Troubleshooting

| Problem                            | Solution                                                            |
| ---------------------------------- | ------------------------------------------------------------------- |
| `OutOfMemoryError` during training | Lower batch size to 1, reduce `max_seq_length`, or use 4-bit quant  |
| Triton kernel compilation errors   | Run `pip install triton --upgrade` and ensure CUDA toolkit matches  |
| Slow first step (compiling)        | Normal — Triton compiles kernels on first run, cached afterwards    |
| `bitsandbytes` CUDA version error  | Install matching version: `pip install bitsandbytes --upgrade`      |
| Loss spikes during training        | Lower learning rate to 1e-4, add warmup steps                       |
| GGUF export crashes                | Ensure enough RAM (2× model size) and disk space for the conversion |

## Resources

* [Unsloth GitHub](https://github.com/unslothai/unsloth)
* [Unsloth Wiki — All Notebooks](https://github.com/unslothai/unsloth/wiki)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)
