Training a PyTorch Model on Clore

What We're Building

A complete PyTorch training pipeline that automatically provisions a GPU, trains a model with checkpointing, logs metrics to Weights & Biases, and handles cleanup — all for a fraction of cloud GPU costs.

Prerequisites

  • Clore.ai API key

  • Python 3.10+

  • PyTorch experience

Step 1: Project Setup

# requirements.txt
torch>=2.7.1
torchvision>=0.22.0
wandb>=0.15.0
requests>=2.28.0
paramiko>=3.0.0
scp>=0.14.0
# config.py
"""Training configuration."""

from dataclasses import dataclass
from typing import Optional, List

@dataclass
class TrainingConfig:
    # Model
    model_name: str = "resnet18"
    num_classes: int = 10
    pretrained: bool = True
    
    # Training
    epochs: int = 10
    batch_size: int = 64
    learning_rate: float = 0.001
    weight_decay: float = 1e-4
    
    # Data
    dataset: str = "cifar10"
    num_workers: int = 4
    
    # Checkpointing
    checkpoint_dir: str = "/workspace/checkpoints"
    checkpoint_every: int = 1  # epochs
    
    # Logging
    wandb_project: str = "clore-training"
    wandb_run_name: Optional[str] = None
    log_every: int = 100  # batches
    
    # GPU
    gpu_type: str = "RTX 4090"
    max_price_usd: float = 0.50

@dataclass
class CloreConfig:
    api_key: str
    image: str = "pytorch/pytorch:2.7.1-cuda12.8-cudnn9-devel"
    ssh_password: str = "PyTorchTrain123!"
    ports: dict = None
    
    def __post_init__(self):
        self.ports = self.ports or {"22": "tcp", "6006": "http"}

Step 2: The Training Script (Runs on GPU)

Step 3: Remote Training Orchestrator

Step 4: One-Shot Training Script

Quick Start

Cost Comparison

Model
Epochs
Clore (RTX 4090)
AWS (p4d.24xlarge)
Savings

ResNet18

10

~$0.20

~$5.30

96%

ResNet50

20

~$0.80

~$21.20

96%

ViT-Base

50

~$4.00

~$106.00

96%

Estimated based on ~$0.40/hr Clore RTX 4090 vs ~$10.60/hr AWS A100

Next Steps

Last updated

Was this helpful?