# LLaMA-Factory

LLaMA-Factory is the most comprehensive open-source fine-tuning framework, supporting 100+ language models including all LLaMA variants, Qwen, Mistral, Phi, Falcon, ChatGLM, and more. It offers LoRA, QLoRA, full fine-tuning, RLHF, DPO, and PPO — all through an intuitive web interface (LLaMA Board) or CLI. CLORE.AI's on-demand GPU servers make it the perfect platform for launching fine-tuning jobs at fraction of the cost of cloud providers.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter | Minimum          | Recommended    |
| --------- | ---------------- | -------------- |
| RAM       | 16 GB            | 32 GB+         |
| VRAM      | 8 GB (QLoRA)     | 24 GB+         |
| Disk      | 50 GB            | 200 GB+        |
| GPU       | NVIDIA RTX 2080+ | A100, RTX 4090 |

{% hint style="info" %}
**Training method determines GPU requirements:**

* **QLoRA (4-bit)**: 8 GB VRAM for 7B models, 16 GB for 13B
* **LoRA (float16)**: 16 GB VRAM for 7B models, 40 GB for 13B
* **Full fine-tuning**: \~14 GB VRAM per 7B parameter (+ optimizer states)
* Multi-GPU (DeepSpeed/FSDP) scales across any number of GPUs
  {% endhint %}

## Quick Deploy on CLORE.AI

**Docker Image:** `hiyouga/llamafactory:latest`

**Ports:** `22/tcp`, `7860/http`

**Environment Variables:**

| Variable               | Example     | Description                              |
| ---------------------- | ----------- | ---------------------------------------- |
| `HF_TOKEN`             | `hf_xxx...` | HuggingFace token for gated models       |
| `WANDB_API_KEY`        | `xxx...`    | Weights & Biases for experiment tracking |
| `CUDA_VISIBLE_DEVICES` | `0,1`       | GPUs to use                              |

## Step-by-Step Setup

### 1. Rent a GPU Server on CLORE.AI

Visit [CLORE.AI Marketplace](https://clore.ai/marketplace) and select based on your task:

| Task       | VRAM   | Recommended GPU |
| ---------- | ------ | --------------- |
| QLoRA 7B   | 8 GB   | RTX 3070/2080   |
| QLoRA 13B  | 16 GB  | RTX 3090/A4000  |
| LoRA 7B    | 16 GB  | RTX 3090/A4000  |
| LoRA 13B   | 40 GB  | A6000/A100 40GB |
| Full FT 7B | 80 GB  | A100 80GB       |
| Multi-GPU  | Varies | 2-8× any GPU    |

### 2. SSH into Your Server

```bash
ssh -p <PORT> root@<SERVER_IP>
```

### 3. Create Working Directories

```bash
mkdir -p /root/llamafactory/{data,models,output,saves}
```

### 4. Pull the Docker Image

```bash
docker pull hiyouga/llamafactory:latest
```

### 5. Launch LLaMA-Factory

**Launch with Web UI (LLaMA Board):**

```bash
docker run -d \
  --name llamafactory \
  --gpus all \
  -p 7860:7860 \
  -v /root/llamafactory/data:/app/LLaMA-Factory/data \
  -v /root/llamafactory/models:/root/.cache/huggingface \
  -v /root/llamafactory/output:/app/LLaMA-Factory/output \
  -v /root/llamafactory/saves:/app/LLaMA-Factory/saves \
  -e HF_TOKEN=hf_your_token_here \
  hiyouga/llamafactory:latest \
  llamafactory-cli webui
```

**With Weights & Biases tracking:**

```bash
docker run -d \
  --name llamafactory \
  --gpus all \
  -p 7860:7860 \
  -v /root/llamafactory/data:/app/LLaMA-Factory/data \
  -v /root/llamafactory/models:/root/.cache/huggingface \
  -v /root/llamafactory/output:/app/LLaMA-Factory/output \
  -v /root/llamafactory/saves:/app/LLaMA-Factory/saves \
  -e HF_TOKEN=hf_your_token_here \
  -e WANDB_API_KEY=your_wandb_key \
  hiyouga/llamafactory:latest \
  llamafactory-cli webui
```

**Multi-GPU with DeepSpeed (4 GPUs):**

```bash
docker run -d \
  --name llamafactory \
  --gpus all \
  --shm-size 16g \
  --ipc host \
  -p 7860:7860 \
  -v /root/llamafactory/data:/app/LLaMA-Factory/data \
  -v /root/llamafactory/models:/root/.cache/huggingface \
  -v /root/llamafactory/output:/app/LLaMA-Factory/output \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  hiyouga/llamafactory:latest \
  bash -c "llamafactory-cli webui"
```

### 6. Access the Web Interface

Check logs and get the URL:

```bash
docker logs -f llamafactory
```

Your CLORE.AI http\_pub URL for port 7860:

```
https://<order-id>-7860.clore.ai/
```

***

## Usage Examples

### Example 1: LoRA Fine-Tuning via Web UI (LLaMA Board)

1. Open LLaMA Board at your CLORE.AI URL
2. Go to the **Train** tab
3. Configure:
   * **Model Name**: `LLaMA-3` → `Meta-Llama-3-8B-Instruct`
   * **Training Stage**: `Supervised Fine-Tuning`
   * **Dataset**: Select your dataset (or upload custom)
   * **Fine-tuning method**: `lora`
   * **LoRA rank**: `8` (higher = more parameters trained)
   * **Learning rate**: `1e-4`
   * **Epochs**: `3`
   * **Output dir**: `llama3-finetuned`
4. Click **Start** to begin training
5. Monitor loss curves in the **Loss** chart

### Example 2: CLI-Based QLoRA Fine-Tuning

Prepare a training config YAML:

```bash
docker exec -it llamafactory bash

cat > /app/LLaMA-Factory/configs/qlora_mistral.yaml << 'EOF'
### Model
model_name_or_path: mistralai/Mistral-7B-Instruct-v0.3

### Method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05

### Dataset
dataset: alpaca_en
template: mistral
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### Output
output_dir: saves/mistral-qlora
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### Train
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true

### Quantization
quantization_method: bitsandbytes
quantization_bit: 4
EOF

# Run training
llamafactory-cli train /app/LLaMA-Factory/configs/qlora_mistral.yaml
```

### Example 3: Upload Custom Dataset

Create a custom dataset in Alpaca format:

```bash
# Create dataset file
cat > /root/llamafactory/data/my_dataset.json << 'EOF'
[
  {
    "instruction": "You are a customer service agent for a tech company. Answer helpfully.",
    "input": "My laptop won't turn on after the update. What should I do?",
    "output": "I understand how frustrating that can be. Let's troubleshoot step by step:\n\n1. **Hold power button for 10 seconds** to force shutdown\n2. **Remove power cable** and battery if removable, wait 30 seconds\n3. **Reconnect power** and try turning on\n4. If still no response, press **F8/F11** during boot to access recovery mode\n\nWould you like me to walk you through any of these steps in detail?"
  },
  {
    "instruction": "You are a customer service agent for a tech company.",
    "input": "How do I reset my password?",
    "output": "Resetting your password is easy! Here's how:\n\n1. Go to the login page and click **'Forgot Password'**\n2. Enter your **registered email address**\n3. Check your email for a reset link (check spam folder too)\n4. Click the link and **create a new password**\n\nThe reset link expires in 24 hours. If you don't receive the email within 5 minutes, contact our support team."
  }
]
EOF

# Register dataset in dataset_info.json
docker exec -it llamafactory bash -c "
cat >> /app/LLaMA-Factory/data/dataset_info.json << 'EOF2'
,
\"my_dataset\": {
  \"file_name\": \"/root/llamafactory/data/my_dataset.json\"
}
EOF2
"
```

Then select `my_dataset` in the LLaMA Board Dataset dropdown.

### Example 4: DPO (Direct Preference Optimization)

```yaml
### configs/dpo_llama.yaml

model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct

### Method - DPO
stage: dpo
do_train: true
finetuning_type: lora
lora_rank: 8

### DPO-specific
pref_beta: 0.1
pref_loss: sigmoid  # sigmoid, hinge, ipo

### Dataset (must be preference format)
dataset: dpo_en_demo
template: llama3
cutoff_len: 2048

### Output
output_dir: saves/llama3-dpo
logging_steps: 10
save_steps: 100

### Train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5e-5
num_train_epochs: 1.0
fp16: true
```

```bash
docker exec -it llamafactory bash -c "llamafactory-cli train /configs/dpo_llama.yaml"
```

### Example 5: Inference with Fine-Tuned Model

After training, test your model:

```bash
docker exec -it llamafactory bash

# Interactive chat
llamafactory-cli chat \
  --model_name_or_path mistralai/Mistral-7B-Instruct-v0.3 \
  --adapter_name_or_path /app/LLaMA-Factory/saves/mistral-qlora \
  --template mistral \
  --finetuning_type lora
```

Or export the merged model:

```bash
llamafactory-cli export \
  --model_name_or_path mistralai/Mistral-7B-Instruct-v0.3 \
  --adapter_name_or_path /app/LLaMA-Factory/saves/mistral-qlora \
  --template mistral \
  --finetuning_type lora \
  --export_dir /app/LLaMA-Factory/output/mistral-merged \
  --export_size 4 \
  --export_legacy_format false
```

***

## Configuration

### Key Training Parameters

| Parameter                     | Typical Value | Description                          |
| ----------------------------- | ------------- | ------------------------------------ |
| `lora_rank`                   | 8–64          | LoRA rank (higher = more expressive) |
| `lora_alpha`                  | 2× rank       | LoRA alpha scaling                   |
| `lora_dropout`                | 0.0–0.1       | Dropout for LoRA layers              |
| `lora_target`                 | `all`         | Which layers to apply LoRA           |
| `learning_rate`               | `1e-4`        | Starting learning rate               |
| `num_train_epochs`            | 1–5           | Training epochs                      |
| `per_device_train_batch_size` | 1–4           | Batch size per GPU                   |
| `gradient_accumulation_steps` | 4–16          | Effective batch multiplier           |
| `cutoff_len`                  | 1024–4096     | Max sequence length                  |
| `quantization_bit`            | 4 or 8        | QLoRA quantization bits              |
| `warmup_ratio`                | 0.05–0.1      | LR warmup fraction                   |
| `lr_scheduler_type`           | `cosine`      | LR schedule                          |

### Supported Fine-tuning Methods

| Method               | Memory Use | Quality   | When to Use        |
| -------------------- | ---------- | --------- | ------------------ |
| `full`               | Very High  | Best      | Unlimited VRAM     |
| `freeze`             | Medium     | Good      | Freeze base layers |
| `lora`               | Low        | Very Good | Default choice     |
| `qlora` (lora+quant) | Lowest     | Good      | Limited VRAM       |

### Multi-GPU DeepSpeed Training

For training on multiple GPUs, launch with `torchrun`:

```bash
docker exec -it llamafactory bash -c "
FORCE_TORCHRUN=1 NNODES=1 RANK=0 MASTER_ADDR=127.0.0.1 MASTER_PORT=29500 \
llamafactory-cli train configs/qlora_mistral.yaml \
  --deepspeed examples/deepspeed/ds_z3_config.json
"
```

***

## Performance Tips

### 1. Optimal QLoRA Settings by GPU

**8 GB VRAM (RTX 3070):**

```yaml
quantization_bit: 4
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
cutoff_len: 1024
```

**24 GB VRAM (RTX 3090/4090):**

```yaml
quantization_bit: 4  # Still use QLoRA for larger batch size
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
cutoff_len: 2048
```

**80 GB VRAM (A100):**

```yaml
# No quantization needed — use LoRA directly
finetuning_type: lora
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
cutoff_len: 4096
fp16: true
```

### 2. Flash Attention 2 for Longer Contexts

```yaml
flash_attn: fa2  # Requires Ampere+ GPU
```

This enables training with 2× longer sequences on the same VRAM.

### 3. Gradient Checkpointing

Saves VRAM at the cost of \~20% slower training:

```yaml
gradient_checkpointing: true
```

### 4. Choose the Right LoRA Target

```yaml
lora_target: all  # All linear layers (default, best quality)
# or
lora_target: q_proj,v_proj  # Minimal, fastest, less quality
```

### 5. Freeze Top Layers for Fast Adaptation

```yaml
finetuning_type: freeze
freeze_trainable_layers: 2   # Train only top 2 layers
freeze_trainable_modules: all
```

Much faster than full LoRA for simple task adaptation.

### 6. Monitor with TensorBoard

```bash
# In a separate terminal
docker exec -it llamafactory bash -c "
tensorboard --logdir /app/LLaMA-Factory/saves --host 0.0.0.0 --port 6006
"
```

Add port 6006 to your CLORE.AI order to access TensorBoard.

***

## Troubleshooting

### Problem: "CUDA out of memory" during training

1. Reduce batch size: `per_device_train_batch_size: 1`
2. Enable gradient checkpointing: `gradient_checkpointing: true`
3. Reduce context length: `cutoff_len: 512`
4. Use QLoRA (4-bit): `quantization_bit: 4`
5. Reduce LoRA rank: `lora_rank: 4`

### Problem: Training loss not decreasing

* Check learning rate — try `5e-5` or `2e-4`
* Verify dataset format matches template
* Increase `lora_rank` (8→16→32)
* Check that `lora_target: all` is set

### Problem: Slow training speed

```bash
# Check GPU utilization inside container
docker exec -it llamafactory bash -c "watch -n 1 nvidia-smi"
```

If GPU is < 80% utilized:

* Increase batch size
* Use Flash Attention: `flash_attn: fa2`
* Remove `gradient_checkpointing` if VRAM allows

### Problem: Model not found in web UI

```bash
# Pre-download to cache volume
docker exec -it llamafactory bash -c "
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3
"
```

Then refresh the model list in LLaMA Board.

### Problem: Dataset format errors

All dataset formats must match `dataset_info.json` specification:

```bash
# Validate dataset
docker exec -it llamafactory python3 -c "
import json
with open('/app/LLaMA-Factory/data/my_dataset.json') as f:
    data = json.load(f)
print(f'Dataset has {len(data)} samples')
print('First sample keys:', list(data[0].keys()))
"
```

### Problem: WebUI port not accessible

Ensure LLaMA-Factory started the Gradio server:

```bash
docker logs llamafactory 2>&1 | grep -E "Running on|Error|Traceback"
```

Add `--share` flag for a public Gradio URL as alternative.

***

## Links

* [GitHub](https://github.com/hiyouga/LLaMA-Factory)
* [Documentation](https://llamafactory.readthedocs.io)
* [Docker Hub (hiyouga)](https://hub.docker.com/r/hiyouga/llamafactory)
* [Supported Models](https://github.com/hiyouga/LLaMA-Factory?tab=readme-ov-file#supported-models)
* [Dataset Format](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README.md)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)

***

## Clore.ai GPU Recommendations

| Use Case             | Recommended GPU | Est. Cost on Clore.ai |
| -------------------- | --------------- | --------------------- |
| Development/Testing  | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Fine-tuning (7B–13B) | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Models (70B+)  | A100 80GB       | \~$1.20/gpu/hr        |
| Multi-GPU Training   | 2-4x A100 80GB  | \~$2.40–$4.80/hr      |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.
