Training YOLO Object Detection Models

What We're Building

A complete YOLOv8 object detection training pipeline on Clore.ai GPUs. Train custom detection, segmentation, and pose estimation models with automatic GPU provisioning, data preparation, and model export.

Key Features:

Automatic GPU provisioning via Clore.ai API
YOLOv8 detection, segmentation, and pose models
Custom dataset training (COCO format)
Data augmentation and preprocessing
Model export (ONNX, TensorRT, CoreML)
Training metrics and visualization
Multi-GPU training support

Prerequisites

Clore.ai account with API key (get one here)
Python 3.10+
Labeled dataset (YOLO format or COCO format)

pip install requests paramiko scp ultralytics roboflow

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                  YOLOv8 Training Pipeline                        │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │  Dataset    │  │  Training   │  │   Export & Deploy       │  │
│  │  Roboflow/  │──│  YOLOv8     │──│   ONNX/TensorRT/CoreML  │  │
│  │  Local      │  │  Ultralytics│  │                         │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
│         │                │                    │                  │
│         └────────────────┴────────────────────┘                  │
│                          │                                       │
│                 ┌────────▼────────┐                              │
│                 │  Clore.ai GPU   │                              │
│                 │  RTX 4090/A100  │                              │
│                 └─────────────────┘                              │
└─────────────────────────────────────────────────────────────────┘

Step 1: Clore.ai YOLO Client

# clore_yolo_client.py
import requests
import time
import secrets
from typing import Dict, Any, Optional
from dataclasses import dataclass

@dataclass
class YOLOServer:
    """GPU server for YOLO training."""
    server_id: int
    order_id: int
    ssh_host: str
    ssh_port: int
    ssh_password: str
    gpu_model: str
    gpu_count: int
    hourly_cost: float


class CloreYOLOClient:
    """Clore.ai client for YOLO training."""
    
    BASE_URL = "https://api.clore.ai"
    YOLO_IMAGE = "ultralytics/ultralytics:latest-python"
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.headers = {"auth": api_key}
    
    def _request(self, method: str, endpoint: str, **kwargs) -> Dict[str, Any]:
        """Make API request."""
        url = f"{self.BASE_URL}{endpoint}"
        
        for attempt in range(3):
            response = requests.request(
                method, url,
                headers=self.headers,
                timeout=30,
                **kwargs
            )
            data = response.json()
            
            if data.get("code") == 5:
                time.sleep(2 ** attempt)
                continue
            
            if data.get("code") != 0:
                raise Exception(f"API Error: {data}")
            return data
        
        raise Exception("Max retries exceeded")
    
    def find_yolo_gpu(self, max_price_usd: float = 0.50) -> Optional[Dict]:
        """Find GPU suitable for YOLO training."""
        servers = self._request("GET", "/v1/marketplace")["servers"]
        
        # GPUs good for YOLO (fast training, reasonable VRAM)
        yolo_gpus = ["RTX 4090", "RTX 4080", "RTX 3090", "RTX 3080",
                     "A100", "A6000", "A5000"]
        
        candidates = []
        for server in servers:
            if server.get("rented"):
                continue
            
            gpu_array = server.get("gpu_array", [])
            if not any(any(g in gpu for g in yolo_gpus) for gpu in gpu_array):
                continue
            
            price = server.get("price", {}).get("usd", {}).get("spot")
            if not price or price > max_price_usd:
                continue
            
            candidates.append({
                "id": server["id"],
                "gpus": gpu_array,
                "gpu_count": len(gpu_array),
                "price_usd": price,
                "reliability": server.get("reliability", 0)
            })
        
        if not candidates:
            return None
        
        candidates.sort(key=lambda x: (x["price_usd"], -x["reliability"]))
        return candidates[0]
    
    def rent_yolo_server(self, server: Dict, use_spot: bool = True) -> YOLOServer:
        """Rent a server for YOLO training."""
        ssh_password = secrets.token_urlsafe(16)
        
        order_data = {
            "renting_server": server["id"],
            "type": "spot" if use_spot else "on-demand",
            "currency": "CLORE-Blockchain",
            "image": self.YOLO_IMAGE,
            "ports": {"22": "tcp", "6006": "http"},
            "env": {"NVIDIA_VISIBLE_DEVICES": "all"},
            "ssh_password": ssh_password
        }
        
        if use_spot:
            order_data["spotprice"] = server["price_usd"] * 1.15
        
        result = self._request("POST", "/v1/create_order", json=order_data)
        order_id = result["order_id"]
        
        # Wait for server
        for _ in range(120):
            orders = self._request("GET", "/v1/my_orders")["orders"]
            order = next((o for o in orders if o["order_id"] == order_id), None)
            
            if order and order.get("status") == "running":
                conn = order["connection"]["ssh"]
                parts = conn.split()
                ssh_host = parts[1].split("@")[1] if "@" in parts[1] else parts[1]
                ssh_port = int(parts[-1]) if "-p" in conn else 22
                
                return YOLOServer(
                    server_id=server["id"],
                    order_id=order_id,
                    ssh_host=ssh_host,
                    ssh_port=ssh_port,
                    ssh_password=ssh_password,
                    gpu_model=server["gpus"][0] if server["gpus"] else "Unknown",
                    gpu_count=server["gpu_count"],
                    hourly_cost=server["price_usd"]
                )
            
            time.sleep(2)
        
        raise Exception("Timeout waiting for server")
    
    def cancel_order(self, order_id: int):
        """Cancel an order."""
        self._request("POST", "/v1/cancel_order", json={"id": order_id})

Step 2: YOLO Training Engine

# yolo_trainer.py
import paramiko
from scp import SCPClient
import json
import time
import os
from typing import Dict, List, Optional
from dataclasses import dataclass

@dataclass
class TrainingConfig:
    """YOLO training configuration."""
    model: str = "yolov8n.pt"  # yolov8n, yolov8s, yolov8m, yolov8l, yolov8x
    task: str = "detect"  # detect, segment, classify, pose
    epochs: int = 100
    batch_size: int = 16
    img_size: int = 640
    learning_rate: float = 0.01
    device: str = "0"
    workers: int = 8
    patience: int = 50
    project: str = "yolo_training"
    name: str = "run"


@dataclass
class TrainingResult:
    """Training results."""
    model_path: str
    metrics: Dict
    training_time_seconds: float
    epochs_completed: int
    success: bool
    error: Optional[str] = None


class RemoteYOLOTrainer:
    """Execute YOLO training on remote GPU."""
    
    def __init__(self, ssh_host: str, ssh_port: int, ssh_password: str):
        self.ssh_host = ssh_host
        self.ssh_port = ssh_port
        self.ssh_password = ssh_password
        self._ssh = None
        self._scp = None
    
    def connect(self):
        """Establish SSH connection."""
        self._ssh = paramiko.SSHClient()
        self._ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        self._ssh.connect(
            self.ssh_host,
            port=self.ssh_port,
            username="root",
            password=self.ssh_password,
            timeout=30
        )
        self._scp = SCPClient(self._ssh.get_transport())
    
    def disconnect(self):
        """Close connections."""
        if self._scp:
            self._scp.close()
        if self._ssh:
            self._ssh.close()
    
    def _exec(self, cmd: str, timeout: int = 7200) -> str:
        """Execute command."""
        stdin, stdout, stderr = self._ssh.exec_command(cmd, timeout=timeout)
        stdout.channel.recv_exit_status()
        return stdout.read().decode()
    
    def upload_dataset(self, local_path: str, dataset_name: str = "dataset"):
        """Upload dataset to server."""
        remote_path = f"/tmp/{dataset_name}"
        self._exec(f"mkdir -p {remote_path}")
        self._scp.put(local_path, remote_path, recursive=True)
        return remote_path
    
    def upload_file(self, local_path: str, remote_path: str):
        """Upload single file."""
        self._scp.put(local_path, remote_path)
    
    def download_file(self, remote_path: str, local_path: str):
        """Download file."""
        self._scp.get(remote_path, local_path)
    
    def download_directory(self, remote_path: str, local_path: str):
        """Download directory."""
        self._scp.get(remote_path, local_path, recursive=True)
    
    def setup_environment(self):
        """Ensure YOLOv8 is installed."""
        print("Setting up environment...")
        self._exec("pip install -q ultralytics")
        self._exec("mkdir -p /tmp/yolo_training")
    
    def verify_gpu(self) -> Dict:
        """Verify GPU availability."""
        script = '''
import torch
from ultralytics import YOLO

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
'''
        output = self._exec(f"python3 -c '{script}'")
        return {"output": output}
    
    def train(self, dataset_yaml: str, config: TrainingConfig) -> TrainingResult:
        """Train YOLO model."""
        
        training_script = f'''
import json
import time
from ultralytics import YOLO

start_time = time.time()
result = {{"success": False}}

try:
    # Load model
    model = YOLO("{config.model}")
    
    # Train
    results = model.train(
        data="{dataset_yaml}",
        epochs={config.epochs},
        batch={config.batch_size},
        imgsz={config.img_size},
        lr0={config.learning_rate},
        device="{config.device}",
        workers={config.workers},
        patience={config.patience},
        project="/tmp/{config.project}",
        name="{config.name}",
        exist_ok=True,
        verbose=True
    )
    
    # Get best model path
    best_model = f"/tmp/{config.project}/{config.name}/weights/best.pt"
    
    # Validate
    metrics = model.val()
    
    result = {{
        "success": True,
        "model_path": best_model,
        "metrics": {{
            "mAP50": float(metrics.box.map50) if hasattr(metrics.box, 'map50') else 0,
            "mAP50_95": float(metrics.box.map) if hasattr(metrics.box, 'map') else 0,
            "precision": float(metrics.box.mp) if hasattr(metrics.box, 'mp') else 0,
            "recall": float(metrics.box.mr) if hasattr(metrics.box, 'mr') else 0
        }},
        "epochs_completed": {config.epochs},
        "training_time": time.time() - start_time
    }}
    
except Exception as e:
    result = {{"success": False, "error": str(e)}}

print("RESULT:" + json.dumps(result))
'''
        
        # Write script
        self._exec(f"cat > /tmp/train_yolo.py << 'EOF'\n{training_script}\nEOF")
        
        # Run training
        print(f"Training {config.model} for {config.epochs} epochs...")
        output = self._exec("python3 /tmp/train_yolo.py 2>&1", timeout=86400)
        
        # Parse result
        for line in output.split("\n"):
            if line.startswith("RESULT:"):
                result_data = json.loads(line[7:])
                return TrainingResult(
                    model_path=result_data.get("model_path", ""),
                    metrics=result_data.get("metrics", {}),
                    training_time_seconds=result_data.get("training_time", 0),
                    epochs_completed=result_data.get("epochs_completed", 0),
                    success=result_data.get("success", False),
                    error=result_data.get("error")
                )
        
        return TrainingResult(
            model_path="",
            metrics={},
            training_time_seconds=0,
            epochs_completed=0,
            success=False,
            error="Failed to parse training result"
        )
    
    def export_model(self, model_path: str, format: str = "onnx") -> str:
        """Export model to different format."""
        export_script = f'''
from ultralytics import YOLO
model = YOLO("{model_path}")
path = model.export(format="{format}")
print(f"EXPORTED:{path}")
'''
        output = self._exec(f"python3 -c '{export_script}'")
        
        for line in output.split("\n"):
            if line.startswith("EXPORTED:"):
                return line[9:]
        
        return ""

Step 3: Complete YOLO Training Pipeline

# yolo_pipeline.py
import os
import time
import yaml
from typing import Optional
from dataclasses import asdict

from clore_yolo_client import CloreYOLOClient, YOLOServer
from yolo_trainer import RemoteYOLOTrainer, TrainingConfig, TrainingResult


class YOLOPipeline:
    """End-to-end YOLO training pipeline on Clore.ai."""
    
    def __init__(self, api_key: str):
        self.client = CloreYOLOClient(api_key)
        self.server: YOLOServer = None
        self.trainer: RemoteYOLOTrainer = None
    
    def setup(self, max_price_usd: float = 0.50):
        """Provision GPU for YOLO training."""
        
        print("🔍 Finding GPU for YOLO training...")
        gpu = self.client.find_yolo_gpu(max_price_usd=max_price_usd)
        
        if not gpu:
            raise Exception(f"No GPU available under ${max_price_usd}/hr")
        
        print(f"   Found: {gpu['gpus']} @ ${gpu['price_usd']:.2f}/hr")
        
        print("🚀 Provisioning server...")
        self.server = self.client.rent_yolo_server(gpu)
        
        print(f"   Server ready: {self.server.ssh_host}:{self.server.ssh_port}")
        
        # Connect trainer
        self.trainer = RemoteYOLOTrainer(
            self.server.ssh_host,
            self.server.ssh_port,
            self.server.ssh_password
        )
        self.trainer.connect()
        self.trainer.setup_environment()
        self.trainer.verify_gpu()
        
        return self
    
    def prepare_dataset(self, 
                        images_path: str,
                        labels_path: str,
                        classes: list,
                        val_split: float = 0.2) -> str:
        """Prepare and upload dataset."""
        
        # Create dataset.yaml
        dataset_yaml = {
            "path": "/tmp/dataset",
            "train": "images/train",
            "val": "images/val",
            "names": {i: name for i, name in enumerate(classes)}
        }
        
        # Write YAML locally
        yaml_path = "/tmp/dataset.yaml"
        with open(yaml_path, "w") as f:
            yaml.dump(dataset_yaml, f)
        
        # Upload dataset
        print("📤 Uploading dataset...")
        self.trainer.upload_dataset(images_path, "dataset/images")
        self.trainer.upload_dataset(labels_path, "dataset/labels")
        self.trainer.upload_file(yaml_path, "/tmp/dataset.yaml")
        
        return "/tmp/dataset.yaml"
    
    def train(self, 
              dataset_yaml: str,
              model: str = "yolov8n.pt",
              epochs: int = 100,
              batch_size: int = 16,
              img_size: int = 640) -> TrainingResult:
        """Train YOLO model."""
        
        config = TrainingConfig(
            model=model,
            epochs=epochs,
            batch_size=batch_size,
            img_size=img_size
        )
        
        return self.trainer.train(dataset_yaml, config)
    
    def export(self, model_path: str, format: str = "onnx") -> str:
        """Export trained model."""
        return self.trainer.export_model(model_path, format)
    
    def download_model(self, remote_path: str, local_path: str):
        """Download trained model."""
        self.trainer.download_file(remote_path, local_path)
    
    def download_training_results(self, local_dir: str):
        """Download all training results."""
        self.trainer.download_directory("/tmp/yolo_training", local_dir)
    
    def cleanup(self):
        """Release resources."""
        if self.trainer:
            self.trainer.disconnect()
        if self.server:
            print("🧹 Releasing server...")
            self.client.cancel_order(self.server.order_id)
    
    def __enter__(self):
        return self
    
    def __exit__(self, *args):
        self.cleanup()

Full Script: Production YOLO Training

#!/usr/bin/env python3
"""
YOLOv8 Training on Clore.ai GPUs.

Usage:
    # Train with local dataset
    python train_yolo.py --api-key YOUR_API_KEY --data dataset.yaml --model yolov8s.pt --epochs 100
    
    # Train with Roboflow dataset
    python train_yolo.py --api-key YOUR_API_KEY --roboflow WORKSPACE/PROJECT/VERSION --model yolov8m.pt
"""

import argparse
import os
import time
import json
import secrets
import requests
import paramiko
from scp import SCPClient
from typing import Dict, Optional
from dataclasses import dataclass


@dataclass
class TrainingResult:
    model_path: str
    mAP50: float
    mAP50_95: float
    precision: float
    recall: float
    epochs: int
    time_seconds: float
    cost_usd: float
    success: bool


class CloreYOLOTrainer:
    """Complete YOLOv8 training on Clore.ai."""
    
    BASE_URL = "https://api.clore.ai"
    IMAGE = "ultralytics/ultralytics:latest-python"
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.headers = {"auth": api_key}
        self.order_id = None
        self.ssh_host = None
        self.ssh_port = None
        self.ssh_password = None
        self.hourly_cost = 0.0
        self._ssh = None
        self._scp = None
    
    def _api(self, method: str, endpoint: str, **kwargs) -> Dict:
        url = f"{self.BASE_URL}{endpoint}"
        for attempt in range(3):
            response = requests.request(method, url, headers=self.headers, **kwargs)
            data = response.json()
            if data.get("code") == 5:
                time.sleep(2 ** attempt)
                continue
            if data.get("code") != 0:
                raise Exception(f"API Error: {data}")
            return data
        raise Exception("Max retries")
    
    def setup(self, max_price: float = 0.50):
        print("🔍 Finding GPU...")
        servers = self._api("GET", "/v1/marketplace")["servers"]
        
        gpus = ["RTX 4090", "RTX 4080", "RTX 3090", "RTX 3080", "A100"]
        candidates = []
        
        for s in servers:
            if s.get("rented"):
                continue
            gpu_array = s.get("gpu_array", [])
            if not any(any(g in gpu for g in gpus) for gpu in gpu_array):
                continue
            price = s.get("price", {}).get("usd", {}).get("spot")
            if price and price <= max_price:
                candidates.append({"id": s["id"], "gpus": gpu_array, "price": price})
        
        if not candidates:
            raise Exception(f"No GPU under ${max_price}/hr")
        
        gpu = min(candidates, key=lambda x: x["price"])
        print(f"   {gpu['gpus']} @ ${gpu['price']:.2f}/hr")
        
        self.ssh_password = secrets.token_urlsafe(16)
        self.hourly_cost = gpu["price"]
        
        print("🚀 Provisioning server...")
        order_data = {
            "renting_server": gpu["id"],
            "type": "spot",
            "currency": "CLORE-Blockchain",
            "image": self.IMAGE,
            "ports": {"22": "tcp"},
            "env": {"NVIDIA_VISIBLE_DEVICES": "all"},
            "ssh_password": self.ssh_password,
            "spotprice": gpu["price"] * 1.15
        }
        
        result = self._api("POST", "/v1/create_order", json=order_data)
        self.order_id = result["order_id"]
        
        print("⏳ Waiting for server...")
        for _ in range(120):
            orders = self._api("GET", "/v1/my_orders")["orders"]
            order = next((o for o in orders if o["order_id"] == self.order_id), None)
            if order and order.get("status") == "running":
                conn = order["connection"]["ssh"]
                parts = conn.split()
                self.ssh_host = parts[1].split("@")[1]
                self.ssh_port = int(parts[-1]) if "-p" in conn else 22
                break
            time.sleep(2)
        else:
            raise Exception("Timeout")
        
        # Connect SSH
        self._ssh = paramiko.SSHClient()
        self._ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        self._ssh.connect(self.ssh_host, port=self.ssh_port,
                          username="root", password=self.ssh_password, timeout=30)
        self._scp = SCPClient(self._ssh.get_transport())
        
        print(f"✅ Server ready: {self.ssh_host}:{self.ssh_port}")
        
        # Setup YOLO
        print("📦 Setting up YOLOv8...")
        self._exec("pip install -q ultralytics", timeout=120)
    
    def _exec(self, cmd: str, timeout: int = 86400) -> str:
        stdin, stdout, stderr = self._ssh.exec_command(cmd, timeout=timeout)
        stdout.channel.recv_exit_status()
        return stdout.read().decode()
    
    def upload_dataset(self, local_path: str) -> str:
        """Upload local dataset."""
        print(f"📤 Uploading dataset from {local_path}...")
        remote_path = "/tmp/dataset"
        self._exec(f"mkdir -p {remote_path}")
        self._scp.put(local_path, remote_path, recursive=True)
        return remote_path
    
    def download_roboflow(self, workspace: str, project: str, version: int, api_key: str) -> str:
        """Download dataset from Roboflow."""
        print(f"📥 Downloading from Roboflow: {workspace}/{project}/v{version}")
        
        script = f'''
from roboflow import Roboflow
rf = Roboflow(api_key="{api_key}")
project = rf.workspace("{workspace}").project("{project}")
dataset = project.version({version}).download("yolov8", location="/tmp/dataset")
print("DONE:/tmp/dataset/data.yaml")
'''
        output = self._exec(f"python3 -c '{script}'", timeout=600)
        
        for line in output.split("\n"):
            if line.startswith("DONE:"):
                return line[5:]
        
        return "/tmp/dataset/data.yaml"
    
    def train(self, data_yaml: str, model: str = "yolov8n.pt", epochs: int = 100,
              batch: int = 16, imgsz: int = 640) -> TrainingResult:
        
        script = f'''
import json
import time
from ultralytics import YOLO

start = time.time()
result = {{"success": False}}

try:
    model = YOLO("{model}")
    results = model.train(
        data="{data_yaml}",
        epochs={epochs},
        batch={batch},
        imgsz={imgsz},
        device=0,
        project="/tmp/runs",
        name="train",
        exist_ok=True,
        verbose=True
    )
    
    metrics = model.val()
    
    result = {{
        "success": True,
        "model_path": "/tmp/runs/train/weights/best.pt",
        "mAP50": float(metrics.box.map50) if hasattr(metrics.box, 'map50') else 0,
        "mAP50_95": float(metrics.box.map) if hasattr(metrics.box, 'map') else 0,
        "precision": float(metrics.box.mp) if hasattr(metrics.box, 'mp') else 0,
        "recall": float(metrics.box.mr) if hasattr(metrics.box, 'mr') else 0,
        "epochs": {epochs},
        "time": time.time() - start
    }}
except Exception as e:
    result = {{"success": False, "error": str(e)}}

print("RESULT:" + json.dumps(result))
'''
        
        self._exec(f"cat > /tmp/train.py << 'EOF'\n{script}\nEOF")
        
        print(f"🎯 Training {model} for {epochs} epochs...")
        start = time.time()
        output = self._exec("python3 /tmp/train.py 2>&1", timeout=86400)
        elapsed = time.time() - start
        
        # Parse result
        result_data = {"success": False}
        for line in output.split("\n"):
            if line.startswith("RESULT:"):
                result_data = json.loads(line[7:])
                break
        
        cost = (elapsed / 3600) * self.hourly_cost
        
        return TrainingResult(
            model_path=result_data.get("model_path", ""),
            mAP50=result_data.get("mAP50", 0),
            mAP50_95=result_data.get("mAP50_95", 0),
            precision=result_data.get("precision", 0),
            recall=result_data.get("recall", 0),
            epochs=result_data.get("epochs", 0),
            time_seconds=elapsed,
            cost_usd=cost,
            success=result_data.get("success", False)
        )
    
    def export(self, model_path: str, format: str = "onnx") -> str:
        """Export model to different format."""
        script = f'''
from ultralytics import YOLO
model = YOLO("{model_path}")
path = model.export(format="{format}")
print(f"EXPORTED:{{path}}")
'''
        output = self._exec(f"python3 -c '{script}'", timeout=600)
        
        for line in output.split("\n"):
            if line.startswith("EXPORTED:"):
                return line[9:]
        return ""
    
    def download_model(self, remote_path: str, local_path: str):
        """Download model file."""
        os.makedirs(os.path.dirname(local_path) or ".", exist_ok=True)
        self._scp.get(remote_path, local_path)
    
    def cleanup(self):
        if self._scp:
            self._scp.close()
        if self._ssh:
            self._ssh.close()
        if self.order_id:
            print("🧹 Releasing server...")
            self._api("POST", "/v1/cancel_order", json={"id": self.order_id})
    
    def __enter__(self):
        return self
    
    def __exit__(self, *args):
        self.cleanup()


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--api-key", required=True, help="Clore.ai API key")
    parser.add_argument("--data", help="Local dataset path or dataset.yaml")
    parser.add_argument("--roboflow", help="Roboflow dataset (WORKSPACE/PROJECT/VERSION)")
    parser.add_argument("--roboflow-key", help="Roboflow API key")
    parser.add_argument("--model", default="yolov8n.pt", help="Base model")
    parser.add_argument("--epochs", type=int, default=100)
    parser.add_argument("--batch", type=int, default=16)
    parser.add_argument("--imgsz", type=int, default=640)
    parser.add_argument("--output", default="./best.pt")
    parser.add_argument("--export", choices=["onnx", "torchscript", "tflite", "coreml"])
    parser.add_argument("--max-price", type=float, default=0.50)
    args = parser.parse_args()
    
    with CloreYOLOTrainer(args.api_key) as trainer:
        trainer.setup(args.max_price)
        
        # Get dataset
        if args.roboflow:
            parts = args.roboflow.split("/")
            workspace, project, version = parts[0], parts[1], int(parts[2])
            data_yaml = trainer.download_roboflow(workspace, project, version, args.roboflow_key)
        elif args.data:
            if os.path.isdir(args.data):
                trainer.upload_dataset(args.data)
                data_yaml = "/tmp/dataset/data.yaml"
            else:
                trainer._scp.put(args.data, "/tmp/data.yaml")
                data_yaml = "/tmp/data.yaml"
        else:
            # Use COCO128 for demo
            data_yaml = "coco128.yaml"
        
        # Train
        result = trainer.train(data_yaml, args.model, args.epochs, args.batch, args.imgsz)
        
        print("\n" + "="*60)
        print("📊 TRAINING COMPLETE")
        print("="*60)
        print(f"   Model: {args.model}")
        print(f"   Epochs: {result.epochs}")
        print(f"   Time: {result.time_seconds:.1f}s ({result.time_seconds/60:.1f} min)")
        print(f"   Cost: ${result.cost_usd:.4f}")
        print(f"\n📈 Metrics:")
        print(f"   mAP50: {result.mAP50:.4f}")
        print(f"   mAP50-95: {result.mAP50_95:.4f}")
        print(f"   Precision: {result.precision:.4f}")
        print(f"   Recall: {result.recall:.4f}")
        
        if result.success and result.model_path:
            # Download model
            trainer.download_model(result.model_path, args.output)
            print(f"\n✅ Model saved: {args.output}")
            
            # Export if requested
            if args.export:
                print(f"\n📦 Exporting to {args.export}...")
                exported = trainer.export(result.model_path, args.export)
                if exported:
                    export_local = args.output.replace(".pt", f".{args.export}")
                    trainer.download_model(exported, export_local)
                    print(f"   Exported: {export_local}")


if __name__ == "__main__":
    main()

Example Training Commands

# Train YOLOv8 nano on COCO128 (demo)
python train_yolo.py --api-key YOUR_KEY --model yolov8n.pt --epochs 50

# Train YOLOv8 small on custom dataset
python train_yolo.py --api-key YOUR_KEY --data ./my_dataset --model yolov8s.pt --epochs 100

# Train from Roboflow dataset
python train_yolo.py --api-key YOUR_KEY \
    --roboflow myworkspace/myproject/1 \
    --roboflow-key RF_API_KEY \
    --model yolov8m.pt --epochs 150

# Train and export to ONNX
python train_yolo.py --api-key YOUR_KEY --data dataset.yaml \
    --model yolov8l.pt --epochs 200 --export onnx

Model Variants Comparison

Model

Size

mAP50

Speed (V100)

Clore.ai Cost (100 epochs)

YOLOv8n

3.2MB

37.3

1.2ms

~$0.15

YOLOv8s

11.2MB

44.9

2.0ms

~$0.25

YOLOv8m

25.9MB

50.2

3.5ms

~$0.40

YOLOv8l

43.7MB

52.9

5.5ms

~$0.60

YOLOv8x

68.2MB

53.9

8.5ms

~$0.80

Cost Comparison

Platform

RTX 4090

100 epochs COCO

Cost

Clore.ai

$0.35/hr

~45 min

$0.26

AWS p3.2xlarge

$3.06/hr

~90 min

$4.59

Google Colab Pro

$10/mo

~60 min

Limited

Lambda Labs

$1.10/hr

~45 min

$0.83

Next Steps

PreviousAuto-Scaling ML Training Pipeline NextReinforcement Learning on Cloud GPUs

Last updated 27 days ago

Was this helpful?

hashtagWhat We're Building

hashtagPrerequisites

hashtagArchitecture Overview

hashtagStep 1: Clore.ai YOLO Client

hashtagStep 2: YOLO Training Engine

hashtagStep 3: Complete YOLO Training Pipeline

hashtagFull Script: Production YOLO Training

hashtagExample Training Commands

hashtagModel Variants Comparison

hashtagCost Comparison

hashtagNext Steps