LoRA (Low-Rank Adaptation) - Train small adapter layers instead of full model
QLoRA - LoRA with quantization for even less VRAM
Train 7B model on single RTX 3090
Train 70B model on single A100
Requirements
Model
Method
Min VRAM
Recommended
7B
QLoRA
12GB
RTX 3090
13B
QLoRA
20GB
RTX 4090
70B
QLoRA
48GB
A100 80GB
7B
Full LoRA
24GB
RTX 4090
Quick Deploy
Docker Image:
Ports:
Command:
Accessing Your Service
After deployment, find your http_pub URL in My Orders:
Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)
Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.
Dataset Preparation
Chat Format (Recommended)
Instruction Format
Alpaca Format
Supported Modern Models (2025)
Model
HF ID
Min VRAM (QLoRA)
Llama 3.1 / 3.3 8B
meta-llama/Llama-3.1-8B-Instruct
12GB
Qwen 2.5 7B / 14B
Qwen/Qwen2.5-7B-Instruct
12GB / 20GB
DeepSeek-R1-Distill (7B/8B)
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
12GB
Mistral 7B v0.3
mistralai/Mistral-7B-Instruct-v0.3
12GB
Gemma 2 9B
google/gemma-2-9b-it
14GB
Phi-4 14B
microsoft/phi-4
20GB
QLoRA Fine-tuning Script
Modern example with PEFT 0.14+, Flash Attention 2, DoRA support, and Qwen2.5 / DeepSeek-R1 compatibility:
Flash Attention 2
Flash Attention 2 reduces VRAM usage and speeds up training significantly. Requires Ampere+ GPU (RTX 3090, RTX 4090, A100).
Setting
VRAM (7B)
Speed
Standard attention (fp16)
~22GB
baseline
Flash Attention 2 (bf16)
~16GB
+30%
Flash Attention 2 + QLoRA
~12GB
+30%
DoRA (Weight-Decomposed LoRA)
DoRA (PEFT >= 0.14) decomposes pre-trained weights into magnitude and direction components. It improves fine-tuning quality, especially for smaller ranks.
Qwen2.5 & DeepSeek-R1-Distill Examples
Qwen2.5 Fine-tuning
DeepSeek-R1-Distill Fine-tuning
DeepSeek-R1-Distill models (Qwen-7B, Qwen-14B, Llama-8B, Llama-70B) are reasoning-focused. Fine-tune to adapt their chain-of-thought style to your domain.
# Enable in model loading:
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
attn_implementation="flash_attention_2", # <-- add this
torch_dtype=torch.bfloat16, # FA2 requires bf16 or fp16
device_map="auto",
)
from peft import LoraConfig
# Standard LoRA
lora_config = LoraConfig(r=64, lora_alpha=16, use_dora=False, ...)
# DoRA — same parameters, better quality
lora_config = LoraConfig(r=64, lora_alpha=16, use_dora=True, ...)
# Note: DoRA adds ~5-10% VRAM overhead vs standard LoRA
# Note: Not compatible with quantized (4-bit/8-bit) models in all cases
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
# For 14B: "Qwen/Qwen2.5-14B-Instruct" (needs 20GB+ VRAM with QLoRA)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True, # Required for Qwen2.5
attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# Qwen2.5 uses ChatML format — use apply_chat_template
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there! How can I help?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
# DeepSeek-R1-Distill variants
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" # 7B on Qwen2.5 base
# MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # 8B on Llama3 base
# MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B" # 14B (needs A100)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
attn_implementation="flash_attention_2",
)
# DeepSeek-R1 uses <think>...</think> tags for reasoning
# Keep this in training data to preserve chain-of-thought capability
example_format = """<|im_start|>user
Solve: What is 15 * 23?<|im_end|>
<|im_start|>assistant
<think>
15 * 23 = 15 * 20 + 15 * 3 = 300 + 45 = 345
</think>
The answer is 345.<|im_end|>"""
# LoRA target modules for DeepSeek-R1-Distill (Qwen2.5 base)
lora_config = LoraConfig(
r=32,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_dora=True,
task_type="CAUSAL_LM",
)
# Save LoRA adapter
trainer.save_model("./lora_adapter")
# Save merged model
merged_model.save_pretrained("./full_model")
# Upload to HuggingFace
huggingface-cli login
merged_model.push_to_hub("username/my-model")