An end-to-end hyperparameter optimization system that uses Optuna to intelligently search the parameter space while dynamically provisioning and releasing Clore GPUs for each trial. Each trial gets its own rented GPU, runs in isolation, and is pruned early if it's clearly not competitive — saving real money.
Key Features:
Full Optuna integration with Clore GPU provisioning
Parallel trial execution across multiple rented GPUs
MedianPruner to kill underperforming trials early
PostgreSQL/SQLite database backend for distributed studies
pipinstalloptunaoptuna-dashboardrequestsparamikorich# For PostgreSQL backend (recommended for parallel runs):pipinstallpsycopg2-binary# For plotting:pipinstallplotlykaleido
Architecture Overview
Step 1: Clore GPU Provisioner for Trials
Each Optuna trial provisions its own Clore server, runs training, streams the result back, then releases the server.
Step 2: Cost Tracker
Step 3: Remote Training Script
This script runs on the rented GPU. It accepts hyperparameters via environment variables and prints intermediate metrics so we can detect pruning criteria.
Step 4: Optuna Objective with Clore Provisioning
Step 5: Study Runner with Parallel Trials
Step 6: Terminal Results Summary
Distributed Study with PostgreSQL Backend
When running many parallel trials from multiple machines, use PostgreSQL as the Optuna storage backend so all workers share state.
Optuna Dashboard
While the sweep runs, launch the Optuna dashboard for a web UI:
Open http://localhost:8080 to see real-time trial progress, parameter importance, and the optimization history plot.
Pruning Deep-Dive: How MedianPruner Saves Money
Cost Optimization Results
Strategy
Trials
Avg Cost/Trial
Total Cost
Best Acc
Sequential, no pruning
30
$0.55
$16.50
0.921
Parallel (3x), no pruning
30
$0.55
$16.50
0.924
Parallel (3x) + MedianPruner
30
$0.21
$6.30
0.922
Parallel (5x) + MedianPruner
50
$0.18
$9.00
0.931
Pruning saves ~60% of cost while achieving comparable or better results due to more trials within budget.
Tips for Effective Sweeps
1. Start narrow, then expand. Run 10–15 trials to get a feel for the landscape, then expand n_trials once you know which parameters matter.
2. Use log-scale for learning rate and weight decay. Both span multiple orders of magnitude — log-scale sampling finds good values much faster.
3. Prune aggressively for cheap GPUs. If you're using budget GPUs ($0.10–0.20/hr), set n_warmup_steps=1 to prune as early as epoch 1.
4. Set load_if_exists=True to resume interrupted sweeps without losing completed trials.
5. Budget = your real constraint. Set budget_usd before n_trials. If the budget runs out, no more trials start — you won't get surprise bills.
6. Keep trials short. A 10-epoch trial on CIFAR-10 is usually predictive enough. Long trials burn money on runs that are clearly not the best.
# print_results.py
"""Print a formatted summary of sweep results to the terminal."""
import json
import sys
import math
from pathlib import Path
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from rich.text import Text
console = Console()
def print_sweep_summary(results_path: str = "sweep_results/results.json"):
path = Path(results_path)
if not path.exists():
console.print(f"[red]Results not found: {path}[/]")
sys.exit(1)
with open(path) as f:
data = json.load(f)
best = data["best"]
trials = data["all_trials"]
costs = data["cost_summary"]
# ── Header ────────────────────────────────────────────────────
console.print(Panel(
f"[bold cyan]Best Trial #{best['trial_number']}[/]\n"
f"[green]Val Accuracy: {best['value']:.4f}[/]\n\n"
+ "\n".join(f" {k}: {v}" for k, v in best["params"].items()),
title="🏆 Sweep Results",
))
# ── Cost summary ──────────────────────────────────────────────
console.print(Panel(
f"Total trials: {costs['total_trials']}\n"
f"Pruned: {costs['pruned_trials']} "
f"({100*costs['pruned_trials']/max(costs['total_trials'],1):.0f}%)\n"
f"Failed: {costs['failed_trials']}\n"
f"Total cost: ${costs['total_cost_usd']:.4f}\n"
f"Budget used: {costs['budget_used_pct']:.1f}%\n"
f"Cost per trial: ${costs['total_cost_usd']/max(costs['total_trials'],1):.4f}",
title="💸 Cost Summary",
))
# ── Top 10 trials table ───────────────────────────────────────
completed = [
t for t in trials
if t["state"] == "TrialState.COMPLETE" and t["value"] is not None
]
completed.sort(key=lambda t: -(t["value"] or 0))
table = Table(title="📊 Top Trials", show_header=True,
header_style="bold cyan")
table.add_column("#", width=5)
table.add_column("Acc", width=8)
table.add_column("LR", width=10)
table.add_column("Batch", width=6)
table.add_column("WD", width=10)
table.add_column("Dropout",width=8)
table.add_column("Optim", width=8)
for t in completed[:10]:
p = t["params"]
highlight = "bold green" if t["number"] == best["trial_number"] else "white"
table.add_row(
Text(str(t["number"]), style=highlight),
Text(f"{t['value']:.4f}", style=highlight),
f"{p.get('lr',0):.2e}",
str(p.get("batch_size", "")),
f"{p.get('weight_decay',0):.2e}",
f"{p.get('dropout',0):.2f}",
p.get("optimizer", ""),
)
console.print(table)
if __name__ == "__main__":
import sys
results_path = sys.argv[1] if len(sys.argv) > 1 else "sweep_results/results.json"
print_sweep_summary(results_path)
# Start PostgreSQL (or use a managed service)
docker run -d --name optuna-db \
-e POSTGRES_DB=optuna \
-e POSTGRES_USER=optuna \
-e POSTGRES_PASSWORD=optuna_pw \
-p 5432:5432 \
postgres:15
# run_distributed_sweep.py
"""Run the sweep from multiple machines using a shared PostgreSQL study."""
import optuna
STORAGE = "postgresql://optuna:optuna_pw@your-db-host:5432/optuna"
STUDY_NAME = "clore-cifar10-sweep"
# All workers load the same study — Optuna coordinates trial assignment
study = optuna.load_study(
study_name=STUDY_NAME,
storage=STORAGE,
)
# Each machine runs its own subset of trials
study.optimize(
objective, # same CloreObjective as above
n_trials=10, # this machine's share
n_jobs=2, # 2 parallel GPUs on this machine
)
pip install optuna-dashboard
optuna-dashboard sqlite:///sweep.db
# or
optuna-dashboard postgresql://optuna:optuna_pw@localhost:5432/optuna
# The pruner compares each trial's intermediate value
# against the median of all completed trials at the same step.
# Trials below the median are killed early.
pruner = optuna.pruners.MedianPruner(
n_startup_trials=5, # Complete 5 trials before pruning begins
n_warmup_steps=2, # Never prune before step (epoch) 2
interval_steps=1, # Check at every reported step
)
# Alternative: PercentilePruner (kills bottom X% instead of below median)
pruner = optuna.pruners.PercentilePruner(
percentile=25.0, # Kill bottom 25%
n_startup_trials=5,
n_warmup_steps=3,
)
# Alternative: HyperbandPruner (Hyperband algorithm, very aggressive)
pruner = optuna.pruners.HyperbandPruner(
min_resource=1,
max_resource=10,
reduction_factor=3,
)
lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True) # ✅
lr = trial.suggest_float("lr", 0.00001, 0.01) # ❌ wastes budget on tiny values