Fine-Tuning Models with Hugging Face

What We're Building

A complete pipeline for fine-tuning Hugging Face models (LLMs, vision, etc.) on Clore GPUs with QLoRA, PEFT, and automatic checkpoint management — achieving state-of-the-art results at minimal cost.

Prerequisites

  • Clore.ai API key

  • Python 3.10+

  • Hugging Face account (for model access)

Step 1: Fine-Tuning Configuration

# finetune_config.py
"""Configuration for Hugging Face fine-tuning."""

from dataclasses import dataclass, field
from typing import Optional, List

@dataclass
class ModelConfig:
    """Model configuration."""
    model_name: str = "meta-llama/Llama-2-7b-hf"
    trust_remote_code: bool = True
    torch_dtype: str = "bfloat16"
    
    # Quantization
    load_in_4bit: bool = True
    bnb_4bit_compute_dtype: str = "bfloat16"
    bnb_4bit_quant_type: str = "nf4"
    bnb_4bit_use_double_quant: bool = True

@dataclass
class LoRAConfig:
    """LoRA/QLoRA configuration."""
    r: int = 64
    lora_alpha: int = 16
    lora_dropout: float = 0.1
    target_modules: List[str] = field(default_factory=lambda: [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ])
    bias: str = "none"
    task_type: str = "CAUSAL_LM"

@dataclass
class TrainingConfig:
    """Training configuration."""
    # Data
    dataset_name: str = "databricks/databricks-dolly-15k"
    max_seq_length: int = 2048
    
    # Training
    num_epochs: int = 3
    per_device_batch_size: int = 4
    gradient_accumulation_steps: int = 4
    learning_rate: float = 2e-4
    weight_decay: float = 0.01
    warmup_ratio: float = 0.03
    
    # Optimizer
    optim: str = "paged_adamw_8bit"
    
    # Logging
    logging_steps: int = 10
    save_steps: int = 100
    
    # Output
    output_dir: str = "/workspace/fine-tuned-model"
    hub_model_id: Optional[str] = None

@dataclass
class CloreConfig:
    """Clore provisioning configuration."""
    api_key: str = ""
    gpu_type: str = "RTX 4090"
    max_price_usd: float = 0.50
    image: str = "pytorch/pytorch:2.7.1-cuda12.8-cudnn9-devel"

Step 2: Fine-Tuning Script

Step 3: Remote Fine-Tuning Orchestrator

Model-Specific Configurations

Llama 2 7B (QLoRA)

Mistral 7B

Llama 2 13B (QLoRA)

VRAM Requirements

Model
4-bit QLoRA
Full Fine-tune

Llama 2 7B

~18GB

~56GB

Llama 2 13B

~22GB

~104GB

Llama 2 70B

~48GB (2x24GB)

~560GB

Mistral 7B

~16GB

~56GB

Falcon 7B

~16GB

~56GB

Cost Comparison

Model
Dataset
Clore (RTX 4090)
AWS (A100)
Savings

Llama 2 7B

Dolly 15K

~$0.80

~$12.00

93%

Mistral 7B

Custom 50K

~$2.40

~$36.00

93%

Llama 2 13B

Alpaca

~$1.60

~$24.00

93%

Next Steps

Last updated

Was this helpful?