Hyperparameter Sweeps with Optuna

What We're Building

An end-to-end hyperparameter optimization system that uses Optuna to intelligently search the parameter space while dynamically provisioning and releasing Clore GPUs for each trial. Each trial gets its own rented GPU, runs in isolation, and is pruned early if it's clearly not competitive — saving real money.

Key Features:

  • Full Optuna integration with Clore GPU provisioning

  • Parallel trial execution across multiple rented GPUs

  • MedianPruner to kill underperforming trials early

  • PostgreSQL/SQLite database backend for distributed studies

  • Per-trial cost tracking with hard budget cap

  • Results visualization (HTML report + terminal summary)

  • Proper error handling and retry logic

Prerequisites

pip install optuna optuna-dashboard requests paramiko rich
# For PostgreSQL backend (recommended for parallel runs):
pip install psycopg2-binary
# For plotting:
pip install plotly kaleido

Architecture Overview

Step 1: Clore GPU Provisioner for Trials

Each Optuna trial provisions its own Clore server, runs training, streams the result back, then releases the server.

Step 2: Cost Tracker

Step 3: Remote Training Script

This script runs on the rented GPU. It accepts hyperparameters via environment variables and prints intermediate metrics so we can detect pruning criteria.

Step 4: Optuna Objective with Clore Provisioning

Step 5: Study Runner with Parallel Trials

Step 6: Terminal Results Summary

Distributed Study with PostgreSQL Backend

When running many parallel trials from multiple machines, use PostgreSQL as the Optuna storage backend so all workers share state.

Optuna Dashboard

While the sweep runs, launch the Optuna dashboard for a web UI:

Open http://localhost:8080 to see real-time trial progress, parameter importance, and the optimization history plot.

Pruning Deep-Dive: How MedianPruner Saves Money

Cost Optimization Results

Strategy
Trials
Avg Cost/Trial
Total Cost
Best Acc

Sequential, no pruning

30

$0.55

$16.50

0.921

Parallel (3x), no pruning

30

$0.55

$16.50

0.924

Parallel (3x) + MedianPruner

30

$0.21

$6.30

0.922

Parallel (5x) + MedianPruner

50

$0.18

$9.00

0.931

Pruning saves ~60% of cost while achieving comparable or better results due to more trials within budget.

Tips for Effective Sweeps

1. Start narrow, then expand. Run 10–15 trials to get a feel for the landscape, then expand n_trials once you know which parameters matter.

2. Use log-scale for learning rate and weight decay. Both span multiple orders of magnitude — log-scale sampling finds good values much faster.

3. Prune aggressively for cheap GPUs. If you're using budget GPUs ($0.10–0.20/hr), set n_warmup_steps=1 to prune as early as epoch 1.

4. Set load_if_exists=True to resume interrupted sweeps without losing completed trials.

5. Budget = your real constraint. Set budget_usd before n_trials. If the budget runs out, no more trials start — you won't get surprise bills.

6. Keep trials short. A 10-epoch trial on CIFAR-10 is usually predictive enough. Long trials burn money on runs that are clearly not the best.

Next Steps

Last updated

Was this helpful?