LLaMA-Factory

Fine-tune 100+ LLMs with LoRA/QLoRA and a web UI on Clore.ai GPUs using LLaMA-Factory

LLaMA-Factory is the most comprehensive open-source fine-tuning framework, supporting 100+ language models including all LLaMA variants, Qwen, Mistral, Phi, Falcon, ChatGLM, and more. It offers LoRA, QLoRA, full fine-tuning, RLHF, DPO, and PPO — all through an intuitive web interface (LLaMA Board) or CLI. CLORE.AI's on-demand GPU servers make it the perfect platform for launching fine-tuning jobs at fraction of the cost of cloud providers.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

16 GB

32 GB+

VRAM

8 GB (QLoRA)

24 GB+

Disk

50 GB

200 GB+

GPU

NVIDIA RTX 2080+

A100, RTX 4090

circle-info

Training method determines GPU requirements:

  • QLoRA (4-bit): 8 GB VRAM for 7B models, 16 GB for 13B

  • LoRA (float16): 16 GB VRAM for 7B models, 40 GB for 13B

  • Full fine-tuning: ~14 GB VRAM per 7B parameter (+ optimizer states)

  • Multi-GPU (DeepSpeed/FSDP) scales across any number of GPUs

Quick Deploy on CLORE.AI

Docker Image: hiyouga/llamafactory:latest

Ports: 22/tcp, 7860/http

Environment Variables:

Variable
Example
Description

HF_TOKEN

hf_xxx...

HuggingFace token for gated models

WANDB_API_KEY

xxx...

Weights & Biases for experiment tracking

CUDA_VISIBLE_DEVICES

0,1

GPUs to use

Step-by-Step Setup

1. Rent a GPU Server on CLORE.AI

Visit CLORE.AI Marketplacearrow-up-right and select based on your task:

Task
VRAM
Recommended GPU

QLoRA 7B

8 GB

RTX 3070/2080

QLoRA 13B

16 GB

RTX 3090/A4000

LoRA 7B

16 GB

RTX 3090/A4000

LoRA 13B

40 GB

A6000/A100 40GB

Full FT 7B

80 GB

A100 80GB

Multi-GPU

Varies

2-8× any GPU

2. SSH into Your Server

3. Create Working Directories

4. Pull the Docker Image

5. Launch LLaMA-Factory

Launch with Web UI (LLaMA Board):

With Weights & Biases tracking:

Multi-GPU with DeepSpeed (4 GPUs):

6. Access the Web Interface

Check logs and get the URL:

Your CLORE.AI http_pub URL for port 7860:


Usage Examples

Example 1: LoRA Fine-Tuning via Web UI (LLaMA Board)

  1. Open LLaMA Board at your CLORE.AI URL

  2. Go to the Train tab

  3. Configure:

    • Model Name: LLaMA-3Meta-Llama-3-8B-Instruct

    • Training Stage: Supervised Fine-Tuning

    • Dataset: Select your dataset (or upload custom)

    • Fine-tuning method: lora

    • LoRA rank: 8 (higher = more parameters trained)

    • Learning rate: 1e-4

    • Epochs: 3

    • Output dir: llama3-finetuned

  4. Click Start to begin training

  5. Monitor loss curves in the Loss chart

Example 2: CLI-Based QLoRA Fine-Tuning

Prepare a training config YAML:

Example 3: Upload Custom Dataset

Create a custom dataset in Alpaca format:

Then select my_dataset in the LLaMA Board Dataset dropdown.

Example 4: DPO (Direct Preference Optimization)

Example 5: Inference with Fine-Tuned Model

After training, test your model:

Or export the merged model:


Configuration

Key Training Parameters

Parameter
Typical Value
Description

lora_rank

8–64

LoRA rank (higher = more expressive)

lora_alpha

2× rank

LoRA alpha scaling

lora_dropout

0.0–0.1

Dropout for LoRA layers

lora_target

all

Which layers to apply LoRA

learning_rate

1e-4

Starting learning rate

num_train_epochs

1–5

Training epochs

per_device_train_batch_size

1–4

Batch size per GPU

gradient_accumulation_steps

4–16

Effective batch multiplier

cutoff_len

1024–4096

Max sequence length

quantization_bit

4 or 8

QLoRA quantization bits

warmup_ratio

0.05–0.1

LR warmup fraction

lr_scheduler_type

cosine

LR schedule

Supported Fine-tuning Methods

Method
Memory Use
Quality
When to Use

full

Very High

Best

Unlimited VRAM

freeze

Medium

Good

Freeze base layers

lora

Low

Very Good

Default choice

qlora (lora+quant)

Lowest

Good

Limited VRAM

Multi-GPU DeepSpeed Training

For training on multiple GPUs, launch with torchrun:


Performance Tips

1. Optimal QLoRA Settings by GPU

8 GB VRAM (RTX 3070):

24 GB VRAM (RTX 3090/4090):

80 GB VRAM (A100):

2. Flash Attention 2 for Longer Contexts

This enables training with 2× longer sequences on the same VRAM.

3. Gradient Checkpointing

Saves VRAM at the cost of ~20% slower training:

4. Choose the Right LoRA Target

5. Freeze Top Layers for Fast Adaptation

Much faster than full LoRA for simple task adaptation.

6. Monitor with TensorBoard

Add port 6006 to your CLORE.AI order to access TensorBoard.


Troubleshooting

Problem: "CUDA out of memory" during training

  1. Reduce batch size: per_device_train_batch_size: 1

  2. Enable gradient checkpointing: gradient_checkpointing: true

  3. Reduce context length: cutoff_len: 512

  4. Use QLoRA (4-bit): quantization_bit: 4

  5. Reduce LoRA rank: lora_rank: 4

Problem: Training loss not decreasing

  • Check learning rate — try 5e-5 or 2e-4

  • Verify dataset format matches template

  • Increase lora_rank (8→16→32)

  • Check that lora_target: all is set

Problem: Slow training speed

If GPU is < 80% utilized:

  • Increase batch size

  • Use Flash Attention: flash_attn: fa2

  • Remove gradient_checkpointing if VRAM allows

Problem: Model not found in web UI

Then refresh the model list in LLaMA Board.

Problem: Dataset format errors

All dataset formats must match dataset_info.json specification:

Problem: WebUI port not accessible

Ensure LLaMA-Factory started the Gradio server:

Add --share flag for a public Gradio URL as alternative.



Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Fine-tuning (7B–13B)

RTX 4090 (24GB)

~$0.70/gpu/hr

Large Models (70B+)

A100 80GB

~$1.20/gpu/hr

Multi-GPU Training

2-4x A100 80GB

~$2.40–$4.80/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?