ClearML

circle-info

ClearML (formerly Trains) is an open-source MLOps platform for experiment tracking, data versioning, model management, pipeline orchestration, and compute resource management — all in one unified suite.

Overview

ClearML is a comprehensive ML lifecycle management platform from Allegro AI. It automatically captures experiment parameters, metrics, artifacts, and code with minimal code changes. ClearML supports the full ML workflow: from data management and experiment tracking to model registry, automated pipelines, and distributed task execution on GPU clusters.

Property
Value

Category

MLOps / Experiment Tracking

Developer

Allegro AI

License

Apache 2.0

Stars

5.5K+

Docker Hub

allegroai/clearml

Ports

22 (SSH), 8008 (API Server), 8081 (Web UI)


Architecture

ClearML consists of four main components:

Component
Port
Description

ClearML Server

Backend coordinator

Web UI

8081

Browser-based dashboard

API Server

8008

REST API for SDK and agents

File Server

8081

Artifact and model storage

ClearML Agent

Worker that executes ML tasks


Key Features

  • Zero-code experiment tracking — add 2 lines of code to capture everything automatically

  • Automatic logging — metrics, parameters, models, console output, plots, images

  • Git integration — auto-capture git commit, diff, and uncommitted changes

  • Data management — versioned datasets with lineage tracking

  • Model registry — store, version, and serve ML models

  • Pipeline orchestration — build and run multi-step ML pipelines

  • Remote execution — queue experiments and run on remote GPU workers (ClearML Agent)

  • Hyperparameter optimization — automated HPO with population-based training

  • Resource monitoring — GPU/CPU/RAM monitoring per experiment

  • Self-hosted or cloud — run your own server or use ClearML's hosted platform


Clore.ai Setup

Option 1 — Full Self-Hosted Server

Run the ClearML server on Clore.ai for full control.

Step 1 — Choose a Server

Use Case
Recommended
VRAM
RAM

Server only (no training)

CPU instance

8 GB+

Server + training

RTX 3080

10 GB

16 GB

Full MLOps cluster

Multiple GPUs

32 GB+

Step 2 — Rent a Server on Clore.ai

  1. Go to clore.aiarrow-up-rightMarketplace

  2. For the server component: CPU instances work fine

  3. For training workers: GPU instances (RTX 3090, 4090, A100)

  4. Open ports: 22, 8008, 8081

  5. Ensure ≥ 50 GB disk for experiment artifacts

Step 3 — Deploy with Docker Compose

Create docker-compose.yml:

Start the stack:

circle-exclamation

Option 2 — Use ClearML Hosted (Free)

For experiment tracking without running a server, use the free hosted plan:


Accessing the Interface

Web Dashboard

Default credentials: create your account on first login.

API Server

Via SSH


SDK Integration

Installation

Initial Configuration

Enter your server URL (http://<server-ip>:8008) and API credentials from the dashboard.

Or configure programmatically:


Tracking Experiments

Minimal Integration (2 lines)

Manual Metric Logging

Hyperparameter Tracking


Data Management


Model Registry


Pipeline Orchestration


ClearML Agent (Worker)

Run a ClearML Agent on a GPU server to execute queued experiments:

On Clore.ai, spin up multiple GPU nodes as ClearML agents to create a distributed compute cluster.


Hyperparameter Optimization


Monitoring & Alerts


Troubleshooting

circle-exclamation
circle-exclamation
circle-info

Experiments not appearing in UI — Check that CLEARML_API_HOST in your SDK config points to http://<server-ip>:8008, not localhost.

circle-info

Out of disk space — ClearML stores all artifacts locally. Configure S3/GCS storage or increase disk allocation in Clore.ai.

Issue
Fix

MongoDB connection refused

Check mongo container: docker logs clearml_mongo_1

Task stuck in queue

Ensure ClearML Agent is running and connected to the queue

Slow UI

Elasticsearch needs time to index — wait 2–3 min after startup

API 401 Unauthorized

Regenerate API credentials in ClearML web dashboard


Use Cases for GPU Researchers

  • Track training runs — never lose hyperparameters or results again

  • Compare experiments — side-by-side metric comparison in the UI

  • Reproduce results — ClearML captures git commit + code diff automatically

  • Share results — collaborators see all experiments in the shared dashboard

  • Remote GPU jobs — queue training jobs from laptop, run on Clore.ai GPU nodes

  • Automated HPO — run hyperparameter search across multiple GPU nodes in parallel



ClearML on Clore.ai combines experiment tracking with GPU compute management — giving your ML team full MLOps capabilities without cloud vendor lock-in.


Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Production Training

RTX 4090 (24GB)

~$0.70/gpu/hr

Large Scale Experiments

A100 80GB

~$1.20/gpu/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?