Unsloth 2x Faster Fine-tuning

Fine-tune LLMs 2x faster with 70% less VRAM using Unsloth on Clore.ai

Unsloth rewrites the performance-critical parts of HuggingFace Transformers with hand-optimized Triton kernels, delivering 2x training speed and 70% VRAM reduction with zero accuracy loss. It is a drop-in replacement — your existing TRL/PEFT scripts work unchanged after swapping the import.

All examples run on GPU servers rented through the CLORE.AI Marketplace.

Key Features

2x faster training — custom Triton kernels for attention, RoPE, cross-entropy, and RMS norm
70% less VRAM — intelligent gradient checkpointing and memory-mapped weights
Drop-in HuggingFace replacement — one import change, nothing else
QLoRA / LoRA / full fine-tune — all modes supported out of the box
Native export — save directly to GGUF (all quant types), LoRA adapters, or merged 16-bit
Broad model coverage — Llama 3.x, Mistral, Qwen 2.5, Gemma 2, DeepSeek-R1, Phi-4, and more
Free and open source (Apache 2.0)

Requirements

Component

Minimum

Recommended

GPU

RTX 3060 12 GB

RTX 4090 24 GB

VRAM

10 GB

24 GB

RAM

16 GB

32 GB

Disk

40 GB

80 GB

CUDA

11.8

12.1+

Python

3.10

3.11

Clore.ai pricing: RTX 4090 ≈ $0.5–2/day · RTX 3090 ≈ $0.3–1/day · RTX 3060 ≈ $0.15–0.3/day

A 7B model with 4-bit QLoRA fits in ~10 GB VRAM, making even an RTX 3060 viable.

Quick Start

1. Install Unsloth

# Create a venv (recommended)
python -m venv /workspace/unsloth-env
source /workspace/unsloth-env/bin/activate

pip install --upgrade pip
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes xformers

2. Load a Model with 4-bit Quantization

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,            # auto-detect (float16 on Ampere, bfloat16 on Ada)
    load_in_4bit=True,
)

3. Apply LoRA Adapters

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",   # 70% VRAM reduction
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)

4. Prepare Data and Train

from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

dataset = load_dataset("yahma/alpaca-cleaned", split="train")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="/workspace/outputs",
    ),
)

stats = trainer.train()
print(f"Training loss: {stats.training_loss:.4f}")

Exporting the Model

Save LoRA Adapter Only

model.save_pretrained("/workspace/lora-adapter")
tokenizer.save_pretrained("/workspace/lora-adapter")

Merge and Save Full Model (float16)

model.save_pretrained_merged(
    "/workspace/merged-model",
    tokenizer,
    save_method="merged_16bit",
)

Export to GGUF for Ollama / llama.cpp

# Quantize to Q4_K_M (good balance of size and quality)
model.save_pretrained_gguf(
    "/workspace/gguf-output",
    tokenizer,
    quantization_method="q4_k_m",
)

# Other options: q5_k_m, q8_0, f16

After export, serve with Ollama:

# Create an Ollama modelfile
cat > Modelfile <<EOF
FROM /workspace/gguf-output/unsloth.Q4_K_M.gguf
TEMPLATE "{{ .System }}\n{{ .Prompt }}"
PARAMETER temperature 0.7
EOF

ollama create my-finetuned -f Modelfile
ollama run my-finetuned "Summarize the key points of transformers architecture"

Usage Examples

Fine-Tune on a Custom Chat Dataset

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

def format_chat(example):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]},
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

dataset = dataset.map(format_chat)

DPO / ORPO Alignment Training

from trl import DPOTrainer, DPOConfig

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,          # Unsloth handles reference model internally
    args=DPOConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-6,
        num_train_epochs=1,
        beta=0.1,
        output_dir="/workspace/dpo-output",
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()

VRAM Usage Reference

Model

Quant

Method

VRAM

GPU

Llama 3.1 8B

4-bit

QLoRA

~10 GB

RTX 3060

Llama 3.1 8B

16-bit

LoRA

~18 GB

RTX 3090

Qwen 2.5 14B

4-bit

QLoRA

~14 GB

RTX 3090

Mistral 7B

4-bit

QLoRA

~9 GB

RTX 3060

DeepSeek-R1 7B

4-bit

QLoRA

~10 GB

RTX 3060

Llama 3.3 70B

4-bit

QLoRA

~44 GB

2× RTX 3090

Tips

Always use use_gradient_checkpointing="unsloth" — this is the single biggest VRAM saver, unique to Unsloth
Set lora_dropout=0 — Unsloth's Triton kernels are optimized for zero dropout and run faster
Use packing=True in SFTTrainer to avoid padding waste on short examples
Start with r=16 for LoRA rank — increase to 32 or 64 only if validation loss plateaus
Monitor with wandb — add report_to="wandb" in TrainingArguments for loss tracking
Batch size tuning — increase per_device_train_batch_size until you approach VRAM limit, then compensate with gradient_accumulation_steps

Troubleshooting

Problem

Solution

OutOfMemoryError during training

Lower batch size to 1, reduce max_seq_length, or use 4-bit quant

Triton kernel compilation errors

Run pip install triton --upgrade and ensure CUDA toolkit matches

Slow first step (compiling)

Normal — Triton compiles kernels on first run, cached afterwards

bitsandbytes CUDA version error

Install matching version: pip install bitsandbytes --upgrade

Loss spikes during training

Lower learning rate to 1e-4, add warmup steps

GGUF export crashes

Ensure enough RAM (2× model size) and disk space for the conversion

Resources

PreviousFine-tune LLM NextAxolotl Universal Fine-tuning

Last updated 23 days ago

Was this helpful?

hashtagKey Features

hashtagRequirements

hashtagQuick Start

hashtag1. Install Unsloth

hashtag2. Load a Model with 4-bit Quantization

hashtag3. Apply LoRA Adapters

hashtag4. Prepare Data and Train

hashtagExporting the Model

hashtagSave LoRA Adapter Only

hashtagMerge and Save Full Model (float16)

hashtagExport to GGUF for Ollama / llama.cpp

hashtagUsage Examples

hashtagFine-Tune on a Custom Chat Dataset

hashtagDPO / ORPO Alignment Training

hashtagVRAM Usage Reference

hashtagTips

hashtagTroubleshooting

hashtagResources

Key Features

Requirements

Quick Start

1. Install Unsloth

2. Load a Model with 4-bit Quantization

3. Apply LoRA Adapters

4. Prepare Data and Train

Exporting the Model

Save LoRA Adapter Only

Merge and Save Full Model (float16)

Export to GGUF for Ollama / llama.cpp

Usage Examples

Fine-Tune on a Custom Chat Dataset

DPO / ORPO Alignment Training

VRAM Usage Reference

Tips

Troubleshooting

Resources