# Fine-tune LLM

Train your own custom LLM using efficient fine-tuning techniques on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is LoRA/QLoRA?

* **LoRA** (Low-Rank Adaptation) - Train small adapter layers instead of full model
* **QLoRA** - LoRA with quantization for even less VRAM
* Train 7B model on single RTX 3090
* Train 70B model on single A100

## Requirements

| Model | Method    | Min VRAM | Recommended |
| ----- | --------- | -------- | ----------- |
| 7B    | QLoRA     | 12GB     | RTX 3090    |
| 13B   | QLoRA     | 20GB     | RTX 4090    |
| 70B   | QLoRA     | 48GB     | A100 80GB   |
| 7B    | Full LoRA | 24GB     | RTX 4090    |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
8888/http
6006/http
```

**Command:**

```bash
pip install "transformers>=4.45" "datasets>=2.20" accelerate "peft>=0.14" \
    bitsandbytes "trl>=0.12" wandb jupyterlab && \
jupyter lab --ip=0.0.0.0 --port=8888 --allow-root
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Dataset Preparation

### Chat Format (Recommended)

```json
[
  {
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Python?"},
      {"role": "assistant", "content": "Python is a programming language..."}
    ]
  }
]
```

### Instruction Format

```json
[
  {
    "instruction": "Translate to French",
    "input": "Hello, how are you?",
    "output": "Bonjour, comment allez-vous?"
  }
]
```

### Alpaca Format

```json
[
  {
    "instruction": "Give three tips for staying healthy.",
    "input": "",
    "output": "1. Eat balanced meals..."
  }
]
```

## Supported Modern Models (2025)

| Model                       | HF ID                                     | Min VRAM (QLoRA) |
| --------------------------- | ----------------------------------------- | ---------------- |
| Llama 3.1 / 3.3 8B          | `meta-llama/Llama-3.1-8B-Instruct`        | 12GB             |
| Qwen 2.5 7B / 14B           | `Qwen/Qwen2.5-7B-Instruct`                | 12GB / 20GB      |
| DeepSeek-R1-Distill (7B/8B) | `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B` | 12GB             |
| Mistral 7B v0.3             | `mistralai/Mistral-7B-Instruct-v0.3`      | 12GB             |
| Gemma 2 9B                  | `google/gemma-2-9b-it`                    | 14GB             |
| Phi-4 14B                   | `microsoft/phi-4`                         | 20GB             |

## QLoRA Fine-tuning Script

Modern example with PEFT 0.14+, Flash Attention 2, DoRA support, and Qwen2.5 / DeepSeek-R1 compatibility:

```python
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

# === Configuration ===
# Choose one of: Qwen2.5, DeepSeek-R1-Distill, Llama 3.1, Mistral, etc.
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
# MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
# MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

DATASET = "your_dataset.json"  # or HuggingFace dataset name
OUTPUT_DIR = "./output"
MAX_SEQ_LENGTH = 4096           # Qwen2.5 supports up to 32K context
USE_DORA = True                 # DoRA improves quality over standard LoRA
USE_FLASH_ATTN = True           # Flash Attention 2 saves VRAM & speeds up

# === Load Model with 4-bit Quantization ===
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,  # Required for Qwen2.5 and DeepSeek
    # Flash Attention 2: requires Ampere+ GPU (RTX 30/40, A100)
    attn_implementation="flash_attention_2" if USE_FLASH_ATTN else "eager",
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# === Configure LoRA with optional DoRA ===
# DoRA (Weight-Decomposed Low-Rank Adaptation) — PEFT >= 0.14 required
# use_dora=True decomposes weights into magnitude + direction for better quality
lora_config = LoraConfig(
    r=64,                    # Rank (higher = more capacity, more VRAM)
    lora_alpha=16,           # Scaling factor (keep equal to or half of r)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
        "gate_proj", "up_proj", "down_proj",      # MLP layers
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_dora=USE_DORA,        # DoRA: improved quality (PEFT 0.14+)
    # use_rslora=True,        # Optional: Rank-Stabilized LoRA
)

# Prepare model for QLoRA training
model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)
model = get_peft_model(model, lora_config)

# Print trainable parameters summary
model.print_trainable_parameters()
# Example output: trainable params: 42,991,616 || all params: 7,284,891,648 || trainable%: 0.59

# === Load Dataset ===
dataset = load_dataset("json", data_files=DATASET)
# Or use a public dataset:
# dataset = load_dataset("HuggingFaceH4/ultrachat_200k")

# === Format Dataset for Qwen2.5 / ChatML format ===
def format_chat_qwen(example):
    """Format for Qwen2.5 using ChatML template."""
    messages = example.get("messages", [])
    if not messages:
        # Handle alpaca-style data
        text = f"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
        text += f"<|im_start|>user\n{example['instruction']}"
        if example.get("input"):
            text += f"\n{example['input']}"
        text += f"<|im_end|>\n<|im_start|>assistant\n{example['output']}<|im_end|>"
    else:
        # Handle messages format (ChatML)
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
    return {"text": text}

dataset = dataset.map(format_chat_qwen, remove_columns=dataset["train"].column_names)

# === Training Arguments (PEFT 0.14+ / TRL 0.12+) ===
training_args = SFTConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,         # Effective batch = 2 * 8 = 16
    learning_rate=2e-4,
    weight_decay=0.001,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    bf16=True,                             # Use bf16 for modern GPUs (A100, RTX 30/40)
    # fp16=True,                           # Use fp16 for older GPUs
    optim="paged_adamw_8bit",
    max_grad_norm=0.3,
    group_by_length=True,
    report_to="wandb",                     # or "tensorboard"
    # SFTConfig-specific:
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",
    packing=True,                          # Pack multiple examples for efficiency
)

# === Train ===
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    args=training_args,
)

trainer.train()

# === Save LoRA adapter ===
trainer.save_model(f"{OUTPUT_DIR}/final")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/final")
print(f"Model saved to {OUTPUT_DIR}/final")
```

## Flash Attention 2

Flash Attention 2 reduces VRAM usage and speeds up training significantly. Requires Ampere+ GPU (RTX 3090, RTX 4090, A100).

```bash
# Install Flash Attention 2
pip install flash-attn --no-build-isolation
```

```python
# Enable in model loading:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    attn_implementation="flash_attention_2",  # <-- add this
    torch_dtype=torch.bfloat16,               # FA2 requires bf16 or fp16
    device_map="auto",
)
```

| Setting                   | VRAM (7B) | Speed    |
| ------------------------- | --------- | -------- |
| Standard attention (fp16) | \~22GB    | baseline |
| Flash Attention 2 (bf16)  | \~16GB    | +30%     |
| Flash Attention 2 + QLoRA | \~12GB    | +30%     |

## DoRA (Weight-Decomposed LoRA)

DoRA (PEFT >= 0.14) decomposes pre-trained weights into magnitude and direction components. It improves fine-tuning quality, especially for smaller ranks.

```python
from peft import LoraConfig

# Standard LoRA
lora_config = LoraConfig(r=64, lora_alpha=16, use_dora=False, ...)

# DoRA — same parameters, better quality
lora_config = LoraConfig(r=64, lora_alpha=16, use_dora=True, ...)
# Note: DoRA adds ~5-10% VRAM overhead vs standard LoRA
# Note: Not compatible with quantized (4-bit/8-bit) models in all cases
```

## Qwen2.5 & DeepSeek-R1-Distill Examples

### Qwen2.5 Fine-tuning

```python
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
# For 14B: "Qwen/Qwen2.5-14B-Instruct" (needs 20GB+ VRAM with QLoRA)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,          # Required for Qwen2.5
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Qwen2.5 uses ChatML format — use apply_chat_template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
```

### DeepSeek-R1-Distill Fine-tuning

DeepSeek-R1-Distill models (Qwen-7B, Qwen-14B, Llama-8B, Llama-70B) are reasoning-focused. Fine-tune to adapt their chain-of-thought style to your domain.

```python
# DeepSeek-R1-Distill variants
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"   # 7B on Qwen2.5 base
# MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # 8B on Llama3 base
# MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B" # 14B (needs A100)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)

# DeepSeek-R1 uses <think>...</think> tags for reasoning
# Keep this in training data to preserve chain-of-thought capability
example_format = """<|im_start|>user
Solve: What is 15 * 23?<|im_end|>
<|im_start|>assistant
<think>
15 * 23 = 15 * 20 + 15 * 3 = 300 + 45 = 345
</think>
The answer is 345.<|im_end|>"""

# LoRA target modules for DeepSeek-R1-Distill (Qwen2.5 base)
lora_config = LoraConfig(
    r=32,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_dora=True,
    task_type="CAUSAL_LM",
)
```

## Using Axolotl (Easier)

Axolotl simplifies fine-tuning with YAML configs:

```bash
pip install axolotl

# Create config
cat > config.yml << 'EOF'
base_model: mistralai/Mistral-7B-v0.1
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 16

datasets:
  - path: your_data.json
    type: alpaca

sequence_len: 4096
sample_packing: true

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
learning_rate: 2e-4

output_dir: ./output
EOF

# Train
accelerate launch -m axolotl.cli.train config.yml
```

## Axolotl Config Examples

### Chat Model

```yaml
base_model: mistralai/Mistral-7B-Instruct-v0.2
load_in_4bit: true
adapter: qlora

datasets:
  - path: data.json
    type: sharegpt

chat_template: mistral
```

### Code Model

```yaml
base_model: codellama/CodeLlama-7b-hf
load_in_4bit: true
adapter: qlora

datasets:
  - path: code_data.json
    type: alpaca

sequence_len: 8192  # Longer context for code
```

## Merging LoRA Weights

After training, merge LoRA back into base model:

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load LoRA
model = PeftModel.from_pretrained(base_model, "./output/final")

# Merge
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")
```

## Convert to GGUF

For use with llama.cpp/Ollama:

```bash

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert
python convert.py ../merged_model --outtype f16 --outfile model-f16.gguf

# Quantize
./quantize model-f16.gguf model-q4_k_m.gguf q4_k_m
```

## Monitoring Training

### Weights & Biases

```python
import wandb
wandb.init(project="llm-finetune", name="mistral-7b-lora")
```

### TensorBoard

```python

# In training args
report_to="tensorboard"
logging_dir="./logs"

# View
tensorboard --logdir ./logs --port 6006 --bind_all
```

## Best Practices

### Hyperparameters

| Parameter   | 7B Model | 13B Model | 70B Model |
| ----------- | -------- | --------- | --------- |
| batch\_size | 4        | 2         | 1         |
| grad\_accum | 4        | 8         | 16        |
| lr          | 2e-4     | 1e-4      | 5e-5      |
| lora\_r     | 64       | 32        | 16        |
| epochs      | 3        | 2-3       | 1-2       |

### Dataset Size

* Minimum: 1,000 examples
* Good: 10,000+ examples
* Quality > Quantity

### Avoiding Overfitting

```python
training_args = TrainingArguments(
    ...
    weight_decay=0.01,
    warmup_ratio=0.03,
    save_total_limit=3,
    load_best_model_at_end=True,
    evaluation_strategy="steps",
    eval_steps=100,
)
```

## Multi-GPU Training

```bash

# With accelerate
accelerate launch --multi_gpu --num_processes 4 train.py

# With DeepSpeed
accelerate launch --use_deepspeed --num_processes 4 train.py
```

DeepSpeed config:

```json
{
  "bf16": {"enabled": true},
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {"device": "cpu"}
  }
}
```

## Saving & Exporting

```bash

# Save LoRA adapter
trainer.save_model("./lora_adapter")

# Save merged model
merged_model.save_pretrained("./full_model")

# Upload to HuggingFace
huggingface-cli login
merged_model.push_to_hub("username/my-model")
```

## Troubleshooting

### OOM Errors

* Reduce batch size
* Increase gradient accumulation
* Use `gradient_checkpointing=True`
* Reduce lora\_r

### Training Loss Not Decreasing

* Check data format
* Increase learning rate
* Check for data issues

### NaN Loss

* Reduce learning rate
* Use fp32 instead of fp16
* Check for corrupted data

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

> 📚 See also: [How to Fine-Tune LLaMA 3 on a Cloud GPU — Step-by-Step Guide](https://blog.clore.ai/how-to-fine-tune-llama-3-cloud-gpu/)

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/training/finetune-llm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
