Fine-tune LLM

Fine-tune custom LLMs with efficient techniques on Clore.ai GPUs

Train your own custom LLM using efficient fine-tuning techniques on CLORE.AI GPUs.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

What is LoRA/QLoRA?

LoRA (Low-Rank Adaptation) - Train small adapter layers instead of full model
QLoRA - LoRA with quantization for even less VRAM
Train 7B model on single RTX 3090
Train 70B model on single A100

Requirements

Model

Method

Min VRAM

Recommended

QLoRA

12GB

RTX 3090

13B

QLoRA

20GB

RTX 4090

70B

QLoRA

48GB

A100 80GB

Full LoRA

24GB

RTX 4090

Quick Deploy

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

Ports:

22/tcp
8888/http
6006/http

Command:

pip install "transformers>=4.45" "datasets>=2.20" accelerate "peft>=0.14" \
    bitsandbytes "trl>=0.12" wandb jupyterlab && \
jupyter lab --ip=0.0.0.0 --port=8888 --allow-root

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Dataset Preparation

Chat Format (Recommended)

[
  {
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Python?"},
      {"role": "assistant", "content": "Python is a programming language..."}
    ]
  }
]

Instruction Format

[
  {
    "instruction": "Translate to French",
    "input": "Hello, how are you?",
    "output": "Bonjour, comment allez-vous?"
  }
]

Alpaca Format

[
  {
    "instruction": "Give three tips for staying healthy.",
    "input": "",
    "output": "1. Eat balanced meals..."
  }
]

Supported Modern Models (2025)

Model

HF ID

Min VRAM (QLoRA)

Llama 3.1 / 3.3 8B

meta-llama/Llama-3.1-8B-Instruct

12GB

Qwen 2.5 7B / 14B

Qwen/Qwen2.5-7B-Instruct

12GB / 20GB

DeepSeek-R1-Distill (7B/8B)

deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

12GB

Mistral 7B v0.3

mistralai/Mistral-7B-Instruct-v0.3

12GB

Gemma 2 9B

google/gemma-2-9b-it

14GB

Phi-4 14B

microsoft/phi-4

20GB

QLoRA Fine-tuning Script

Modern example with PEFT 0.14+, Flash Attention 2, DoRA support, and Qwen2.5 / DeepSeek-R1 compatibility:

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

# === Configuration ===
# Choose one of: Qwen2.5, DeepSeek-R1-Distill, Llama 3.1, Mistral, etc.
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
# MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
# MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

DATASET = "your_dataset.json"  # or HuggingFace dataset name
OUTPUT_DIR = "./output"
MAX_SEQ_LENGTH = 4096           # Qwen2.5 supports up to 32K context
USE_DORA = True                 # DoRA improves quality over standard LoRA
USE_FLASH_ATTN = True           # Flash Attention 2 saves VRAM & speeds up

# === Load Model with 4-bit Quantization ===
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,  # Required for Qwen2.5 and DeepSeek
    # Flash Attention 2: requires Ampere+ GPU (RTX 30/40, A100)
    attn_implementation="flash_attention_2" if USE_FLASH_ATTN else "eager",
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# === Configure LoRA with optional DoRA ===
# DoRA (Weight-Decomposed Low-Rank Adaptation) — PEFT >= 0.14 required
# use_dora=True decomposes weights into magnitude + direction for better quality
lora_config = LoraConfig(
    r=64,                    # Rank (higher = more capacity, more VRAM)
    lora_alpha=16,           # Scaling factor (keep equal to or half of r)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
        "gate_proj", "up_proj", "down_proj",      # MLP layers
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_dora=USE_DORA,        # DoRA: improved quality (PEFT 0.14+)
    # use_rslora=True,        # Optional: Rank-Stabilized LoRA
)

# Prepare model for QLoRA training
model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)
model = get_peft_model(model, lora_config)

# Print trainable parameters summary
model.print_trainable_parameters()
# Example output: trainable params: 42,991,616 || all params: 7,284,891,648 || trainable%: 0.59

# === Load Dataset ===
dataset = load_dataset("json", data_files=DATASET)
# Or use a public dataset:
# dataset = load_dataset("HuggingFaceH4/ultrachat_200k")

# === Format Dataset for Qwen2.5 / ChatML format ===
def format_chat_qwen(example):
    """Format for Qwen2.5 using ChatML template."""
    messages = example.get("messages", [])
    if not messages:
        # Handle alpaca-style data
        text = f"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
        text += f"<|im_start|>user\n{example['instruction']}"
        if example.get("input"):
            text += f"\n{example['input']}"
        text += f"<|im_end|>\n<|im_start|>assistant\n{example['output']}<|im_end|>"
    else:
        # Handle messages format (ChatML)
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
    return {"text": text}

dataset = dataset.map(format_chat_qwen, remove_columns=dataset["train"].column_names)

# === Training Arguments (PEFT 0.14+ / TRL 0.12+) ===
training_args = SFTConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,         # Effective batch = 2 * 8 = 16
    learning_rate=2e-4,
    weight_decay=0.001,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    bf16=True,                             # Use bf16 for modern GPUs (A100, RTX 30/40)
    # fp16=True,                           # Use fp16 for older GPUs
    optim="paged_adamw_8bit",
    max_grad_norm=0.3,
    group_by_length=True,
    report_to="wandb",                     # or "tensorboard"
    # SFTConfig-specific:
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",
    packing=True,                          # Pack multiple examples for efficiency
)

# === Train ===
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    args=training_args,
)

trainer.train()

# === Save LoRA adapter ===
trainer.save_model(f"{OUTPUT_DIR}/final")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/final")
print(f"Model saved to {OUTPUT_DIR}/final")

Flash Attention 2

Flash Attention 2 reduces VRAM usage and speeds up training significantly. Requires Ampere+ GPU (RTX 3090, RTX 4090, A100).

# Install Flash Attention 2
pip install flash-attn --no-build-isolation

# Enable in model loading:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    attn_implementation="flash_attention_2",  # <-- add this
    torch_dtype=torch.bfloat16,               # FA2 requires bf16 or fp16
    device_map="auto",
)

Setting

VRAM (7B)

Speed

Standard attention (fp16)

~22GB

baseline

Flash Attention 2 (bf16)

~16GB

+30%

Flash Attention 2 + QLoRA

~12GB

+30%

DoRA (Weight-Decomposed LoRA)

DoRA (PEFT >= 0.14) decomposes pre-trained weights into magnitude and direction components. It improves fine-tuning quality, especially for smaller ranks.

from peft import LoraConfig

# Standard LoRA
lora_config = LoraConfig(r=64, lora_alpha=16, use_dora=False, ...)

# DoRA — same parameters, better quality
lora_config = LoraConfig(r=64, lora_alpha=16, use_dora=True, ...)
# Note: DoRA adds ~5-10% VRAM overhead vs standard LoRA
# Note: Not compatible with quantized (4-bit/8-bit) models in all cases

Qwen2.5 & DeepSeek-R1-Distill Examples

Qwen2.5 Fine-tuning

MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
# For 14B: "Qwen/Qwen2.5-14B-Instruct" (needs 20GB+ VRAM with QLoRA)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,          # Required for Qwen2.5
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Qwen2.5 uses ChatML format — use apply_chat_template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)

DeepSeek-R1-Distill Fine-tuning

DeepSeek-R1-Distill models (Qwen-7B, Qwen-14B, Llama-8B, Llama-70B) are reasoning-focused. Fine-tune to adapt their chain-of-thought style to your domain.

# DeepSeek-R1-Distill variants
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"   # 7B on Qwen2.5 base
# MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # 8B on Llama3 base
# MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B" # 14B (needs A100)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)

# DeepSeek-R1 uses <think>...</think> tags for reasoning
# Keep this in training data to preserve chain-of-thought capability
example_format = """<|im_start|>user
Solve: What is 15 * 23?<|im_end|>
<|im_start|>assistant
<think>
15 * 23 = 15 * 20 + 15 * 3 = 300 + 45 = 345
</think>
The answer is 345.<|im_end|>"""

# LoRA target modules for DeepSeek-R1-Distill (Qwen2.5 base)
lora_config = LoraConfig(
    r=32,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_dora=True,
    task_type="CAUSAL_LM",
)

Using Axolotl (Easier)

Axolotl simplifies fine-tuning with YAML configs:

pip install axolotl

# Create config
cat > config.yml << 'EOF'
base_model: mistralai/Mistral-7B-v0.1
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 16

datasets:
  - path: your_data.json
    type: alpaca

sequence_len: 4096
sample_packing: true

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
learning_rate: 2e-4

output_dir: ./output
EOF

# Train
accelerate launch -m axolotl.cli.train config.yml

Axolotl Config Examples

Chat Model

base_model: mistralai/Mistral-7B-Instruct-v0.2
load_in_4bit: true
adapter: qlora

datasets:
  - path: data.json
    type: sharegpt

chat_template: mistral

Code Model

base_model: codellama/CodeLlama-7b-hf
load_in_4bit: true
adapter: qlora

datasets:
  - path: code_data.json
    type: alpaca

sequence_len: 8192  # Longer context for code

Merging LoRA Weights

After training, merge LoRA back into base model:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load LoRA
model = PeftModel.from_pretrained(base_model, "./output/final")

# Merge
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

Convert to GGUF

For use with llama.cpp/Ollama:


# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert
python convert.py ../merged_model --outtype f16 --outfile model-f16.gguf

# Quantize
./quantize model-f16.gguf model-q4_k_m.gguf q4_k_m

Monitoring Training

Weights & Biases

import wandb
wandb.init(project="llm-finetune", name="mistral-7b-lora")

TensorBoard


# In training args
report_to="tensorboard"
logging_dir="./logs"

# View
tensorboard --logdir ./logs --port 6006 --bind_all

Best Practices

Hyperparameters

Parameter

7B Model

13B Model

70B Model

batch_size

grad_accum

2e-4

1e-4

5e-5

lora_r

epochs

2-3

1-2

Dataset Size

Minimum: 1,000 examples
Good: 10,000+ examples
Quality > Quantity

Avoiding Overfitting

training_args = TrainingArguments(
    ...
    weight_decay=0.01,
    warmup_ratio=0.03,
    save_total_limit=3,
    load_best_model_at_end=True,
    evaluation_strategy="steps",
    eval_steps=100,
)

Multi-GPU Training


# With accelerate
accelerate launch --multi_gpu --num_processes 4 train.py

# With DeepSpeed
accelerate launch --use_deepspeed --num_processes 4 train.py

DeepSpeed config:

{
  "bf16": {"enabled": true},
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {"device": "cpu"}
  }
}

Saving & Exporting


# Save LoRA adapter
trainer.save_model("./lora_adapter")

# Save merged model
merged_model.save_pretrained("./full_model")

# Upload to HuggingFace
huggingface-cli login
merged_model.push_to_hub("username/my-model")

Troubleshooting

OOM Errors

Reduce batch size
Increase gradient accumulation
Use gradient_checkpointing=True
Reduce lora_r

Training Loss Not Decreasing

Check data format
Increase learning rate
Check for data issues

NaN Loss

Reduce learning rate
Use fp32 instead of fp16
Check for corrupted data

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

📚 See also: How to Fine-Tune LLaMA 3 on a Cloud GPU — Step-by-Step Guide

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

PreviousKohya Training NextUnsloth 2x Faster Fine-tuning

Last updated 10 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagWhat is LoRA/QLoRA?

hashtagRequirements

hashtagQuick Deploy

hashtagAccessing Your Service

hashtagDataset Preparation

hashtagChat Format (Recommended)

hashtagInstruction Format

hashtagAlpaca Format

hashtagSupported Modern Models (2025)

hashtagQLoRA Fine-tuning Script

hashtagFlash Attention 2

hashtagDoRA (Weight-Decomposed LoRA)

hashtagQwen2.5 & DeepSeek-R1-Distill Examples

hashtagQwen2.5 Fine-tuning

hashtagDeepSeek-R1-Distill Fine-tuning

hashtagUsing Axolotl (Easier)

hashtagAxolotl Config Examples

hashtagChat Model

hashtagCode Model

hashtagMerging LoRA Weights

hashtagConvert to GGUF

hashtagMonitoring Training

hashtagWeights & Biases

hashtagTensorBoard

hashtagBest Practices

hashtagHyperparameters

hashtagDataset Size

hashtagAvoiding Overfitting

hashtagMulti-GPU Training

hashtagSaving & Exporting

hashtagTroubleshooting

hashtagOOM Errors

hashtagTraining Loss Not Decreasing

hashtagNaN Loss

hashtagCost Estimate

Renting on CLORE.AI

Access Your Server

What is LoRA/QLoRA?

Requirements

Quick Deploy

Accessing Your Service

Dataset Preparation

Chat Format (Recommended)

Instruction Format

Alpaca Format

Supported Modern Models (2025)

QLoRA Fine-tuning Script

Flash Attention 2

DoRA (Weight-Decomposed LoRA)

Qwen2.5 & DeepSeek-R1-Distill Examples

Qwen2.5 Fine-tuning

DeepSeek-R1-Distill Fine-tuning

Using Axolotl (Easier)

Axolotl Config Examples

Chat Model

Code Model

Merging LoRA Weights

Convert to GGUF

Monitoring Training

Weights & Biases

TensorBoard

Best Practices

Hyperparameters

Dataset Size

Avoiding Overfitting

Multi-GPU Training

Saving & Exporting

Troubleshooting

OOM Errors

Training Loss Not Decreasing

NaN Loss

Cost Estimate