Unsloth 2x Faster Fine-tuning

Fine-tune LLMs 2x faster with 70% less VRAM using Unsloth on Clore.ai

Unsloth rewrites the performance-critical parts of HuggingFace Transformers with hand-optimized Triton kernels, delivering 2x training speed and 70% VRAM reduction with zero accuracy loss. It is a drop-in replacement — your existing TRL/PEFT scripts work unchanged after swapping the import.

circle-check

Key Features

  • 2x faster training — custom Triton kernels for attention, RoPE, cross-entropy, and RMS norm

  • 70% less VRAM — intelligent gradient checkpointing and memory-mapped weights

  • Drop-in HuggingFace replacement — one import change, nothing else

  • QLoRA / LoRA / full fine-tune — all modes supported out of the box

  • Native export — save directly to GGUF (all quant types), LoRA adapters, or merged 16-bit

  • Broad model coverage — Llama 3.x, Mistral, Qwen 2.5, Gemma 2, DeepSeek-R1, Phi-4, and more

  • Free and open source (Apache 2.0)

Requirements

Component
Minimum
Recommended

GPU

RTX 3060 12 GB

RTX 4090 24 GB

VRAM

10 GB

24 GB

RAM

16 GB

32 GB

Disk

40 GB

80 GB

CUDA

11.8

12.1+

Python

3.10

3.11

Clore.ai pricing: RTX 4090 ≈ $0.5–2/day · RTX 3090 ≈ $0.3–1/day · RTX 3060 ≈ $0.15–0.3/day

A 7B model with 4-bit QLoRA fits in ~10 GB VRAM, making even an RTX 3060 viable.

Quick Start

1. Install Unsloth

2. Load a Model with 4-bit Quantization

3. Apply LoRA Adapters

4. Prepare Data and Train

Exporting the Model

Save LoRA Adapter Only

Merge and Save Full Model (float16)

Export to GGUF for Ollama / llama.cpp

After export, serve with Ollama:

Usage Examples

Fine-Tune on a Custom Chat Dataset

DPO / ORPO Alignment Training

VRAM Usage Reference

Model
Quant
Method
VRAM
GPU

Llama 3.1 8B

4-bit

QLoRA

~10 GB

RTX 3060

Llama 3.1 8B

16-bit

LoRA

~18 GB

RTX 3090

Qwen 2.5 14B

4-bit

QLoRA

~14 GB

RTX 3090

Mistral 7B

4-bit

QLoRA

~9 GB

RTX 3060

DeepSeek-R1 7B

4-bit

QLoRA

~10 GB

RTX 3060

Llama 3.3 70B

4-bit

QLoRA

~44 GB

2× RTX 3090

Tips

  • Always use use_gradient_checkpointing="unsloth" — this is the single biggest VRAM saver, unique to Unsloth

  • Set lora_dropout=0 — Unsloth's Triton kernels are optimized for zero dropout and run faster

  • Use packing=True in SFTTrainer to avoid padding waste on short examples

  • Start with r=16 for LoRA rank — increase to 32 or 64 only if validation loss plateaus

  • Monitor with wandb — add report_to="wandb" in TrainingArguments for loss tracking

  • Batch size tuning — increase per_device_train_batch_size until you approach VRAM limit, then compensate with gradient_accumulation_steps

Troubleshooting

Problem
Solution

OutOfMemoryError during training

Lower batch size to 1, reduce max_seq_length, or use 4-bit quant

Triton kernel compilation errors

Run pip install triton --upgrade and ensure CUDA toolkit matches

Slow first step (compiling)

Normal — Triton compiles kernels on first run, cached afterwards

bitsandbytes CUDA version error

Install matching version: pip install bitsandbytes --upgrade

Loss spikes during training

Lower learning rate to 1e-4, add warmup steps

GGUF export crashes

Ensure enough RAM (2× model size) and disk space for the conversion

Resources

Last updated

Was this helpful?