Auto-Scaling ML Training Pipeline

What We're Building

A production-grade auto-scaling pipeline that dynamically provisions and releases Clore GPUs based on your job queue depth. Submit training jobs, and the pipeline handles everything: finding servers, renting them, distributing work, managing checkpoints, and releasing resources when idle β€” all while respecting your hourly budget.

Key Features:

  • File-based job queue (no Redis required, optional Redis backend)

  • Automatic GPU scaling from 1 to N workers

  • Cost-aware scheduling with hourly budget caps

  • Checkpoint saving and resumption across workers

  • Real-time terminal dashboard

  • Graceful shutdown and cleanup

Prerequisites

pip install requests rich schedule
# Optional: pip install redis  (for Redis queue backend)

Architecture Overview

Step 1: Job Queue

The queue is a simple JSON-lines file. Each line is one job. Works without any external services.

Step 2: Set Up the Clore Client

πŸ“¦ Using the standard Clore API client. See Clore API Client Reference for the full implementation and setup instructions. Save it as clore_client.py in your project.

Step 3: Worker β€” Runs One Job on a Rented GPU

Step 4: Auto-Scaling Scheduler

This is the brain. It watches queue depth, compares it against active workers, and decides when to scale up or down.

Step 5: Checkpoint Management

Checkpoints let jobs resume if a worker is evicted or fails mid-run.

Step 6: Terminal Dashboard

A live dashboard using rich so you can watch the pipeline in real time.

Step 7: Putting It All Together

Scaling Behaviour Reference

Queue Depth
Active Workers
Action

β‰₯ 2 pending

0 workers

Scale up to 1

β‰₯ 2 pending

1 worker

Scale up to 2

β‰₯ 2 pending

4 workers (max)

Hold

< 1 pending

2 workers idle

Scale down

0 pending, 0 running

any

Pipeline exits

Cost Estimation

Workers
GPU
Price/hr
6-job run (est.)

1

RTX 4090

$0.35

~$2.80 (8hr)

2

RTX 4090

$0.70

~$2.10 (3hr)

4

RTX 4090

$1.40

~$1.75 (1.25hr)

With max_hourly_budget=8.0 you can safely run up to 10 parallel workers.

Budget Guard Rails

The scaler checks _current_spend() before every scale-up event. If adding another worker would exceed max_hourly_budget, it logs a warning and waits for the next poll cycle. This prevents runaway spend even if a bug submits hundreds of jobs.

Best Practices

  1. Set max_hourly_budget first. It's your safety net β€” set it before anything else.

  2. Use min_workers=0 for batch workloads; only pay when jobs are actually running.

  3. Set idle_timeout_sec=60–120 to avoid paying for idle GPUs between jobs.

  4. Use priority field to push urgent jobs to the front of the queue without rewriting code.

  5. Save checkpoints every N epochs so a failed worker doesn't waste all previous compute.

  6. Keep max_price_per_gpu honest β€” too low means the scaler won't find servers; too high wastes money.

Monitoring & Alerts

The dashboard gives you real-time visibility in the terminal. For production, add a simple notifier:

Next Steps

Last updated

Was this helpful?