# ClearML

{% hint style="info" %}
**ClearML** (formerly Trains) is an open-source MLOps platform for experiment tracking, data versioning, model management, pipeline orchestration, and compute resource management — all in one unified suite.
{% endhint %}

## Overview

ClearML is a comprehensive ML lifecycle management platform from Allegro AI. It automatically captures experiment parameters, metrics, artifacts, and code with minimal code changes. ClearML supports the full ML workflow: from data management and experiment tracking to model registry, automated pipelines, and distributed task execution on GPU clusters.

| Property       | Value                                                     |
| -------------- | --------------------------------------------------------- |
| **Category**   | MLOps / Experiment Tracking                               |
| **Developer**  | Allegro AI                                                |
| **License**    | Apache 2.0                                                |
| **GitHub**     | [allegroai/clearml](https://github.com/allegroai/clearml) |
| **Stars**      | 5.5K+                                                     |
| **Docker Hub** | `allegroai/clearml`                                       |
| **Ports**      | 22 (SSH), 8008 (API Server), 8081 (Web UI)                |

***

## Architecture

ClearML consists of four main components:

| Component          | Port | Description                   |
| ------------------ | ---- | ----------------------------- |
| **ClearML Server** | —    | Backend coordinator           |
| **Web UI**         | 8081 | Browser-based dashboard       |
| **API Server**     | 8008 | REST API for SDK and agents   |
| **File Server**    | 8081 | Artifact and model storage    |
| **ClearML Agent**  | —    | Worker that executes ML tasks |

***

## Key Features

* **Zero-code experiment tracking** — add 2 lines of code to capture everything automatically
* **Automatic logging** — metrics, parameters, models, console output, plots, images
* **Git integration** — auto-capture git commit, diff, and uncommitted changes
* **Data management** — versioned datasets with lineage tracking
* **Model registry** — store, version, and serve ML models
* **Pipeline orchestration** — build and run multi-step ML pipelines
* **Remote execution** — queue experiments and run on remote GPU workers (ClearML Agent)
* **Hyperparameter optimization** — automated HPO with population-based training
* **Resource monitoring** — GPU/CPU/RAM monitoring per experiment
* **Self-hosted or cloud** — run your own server or use ClearML's hosted platform

***

## Clore.ai Setup

### Option 1 — Full Self-Hosted Server

Run the ClearML server on Clore.ai for full control.

### Step 1 — Choose a Server

| Use Case                  | Recommended   | VRAM  | RAM    |
| ------------------------- | ------------- | ----- | ------ |
| Server only (no training) | CPU instance  | —     | 8 GB+  |
| Server + training         | RTX 3080      | 10 GB | 16 GB  |
| Full MLOps cluster        | Multiple GPUs | —     | 32 GB+ |

### Step 2 — Rent a Server on Clore.ai

1. Go to [clore.ai](https://clore.ai) → **Marketplace**
2. For the **server** component: CPU instances work fine
3. For **training workers**: GPU instances (RTX 3090, 4090, A100)
4. Open ports: **22**, **8008**, **8081**
5. Ensure **≥ 50 GB disk** for experiment artifacts

### Step 3 — Deploy with Docker Compose

Create `docker-compose.yml`:

```yaml
version: "3.6"

services:
  apiserver:
    image: allegroai/clearml:latest
    restart: unless-stopped
    volumes:
      - /opt/clearml/logs:/var/log/clearml
      - /opt/clearml/config:/opt/clearml/config
      - /opt/clearml/data/fileserver:/mnt/fileserver
    environment:
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_ELASTICSEARCH_SERVICE_HOST: elasticsearch
      CLEARML_ELASTICSEARCH_SERVICE_PORT: 9200
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
    ports:
      - "8008:8008"
    depends_on:
      - mongo
      - elasticsearch
      - redis

  webserver:
    image: allegroai/clearml-webserver:latest
    restart: unless-stopped
    ports:
      - "8081:80"
    environment:
      CLEARML_API_HOST: http://localhost:8008

  fileserver:
    image: allegroai/clearml-fileserver:latest
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/fileserver:/mnt/fileserver
    ports:
      - "8081:8081"

  mongo:
    image: mongo:4.4
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/mongo:/data/db
    command: --setParameter internalQueryMaxBlockingSortMemoryUsageBytes=196100200

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.6
    restart: unless-stopped
    environment:
      ES_JAVA_OPTS: "-Xms512m -Xmx2048m"
      bootstrap.memory_lock: "true"
      cluster.name: "clearml"
      discovery.type: "single-node"
      http.publish_host: "$CLEARML_HOST_IP"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - /opt/clearml/data/elastic:/usr/share/elasticsearch/data

  redis:
    image: redis:6
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/redis:/data

networks:
  default:
    name: clearml_network
```

Start the stack:

```bash
mkdir -p /opt/clearml/{logs,config,data/{fileserver,mongo,elastic,redis}}

# Set your server's public IP
export CLEARML_HOST_IP=<your-server-ip>

docker-compose up -d
```

{% hint style="warning" %}
ClearML Server requires \~4 GB RAM for the full stack (MongoDB + Elasticsearch + Redis + API server + WebUI). Make sure your Clore.ai instance has sufficient RAM.
{% endhint %}

### Option 2 — Use ClearML Hosted (Free)

For experiment tracking without running a server, use the free hosted plan:

```bash
# Install SDK
pip install clearml

# Configure with hosted server
clearml-init
# Enter: https://api.clear.ml  when prompted for API host
# Get credentials from: https://app.clear.ml/settings/workspace-configuration
```

***

## Accessing the Interface

### Web Dashboard

```
http://<server-ip>:8081
```

Default credentials: create your account on first login.

### API Server

```
http://<server-ip>:8008
```

### Via SSH

```bash
ssh root@<server-ip> -p 22
```

***

## SDK Integration

### Installation

```bash
pip install clearml
```

### Initial Configuration

```bash
clearml-init
```

Enter your server URL (`http://<server-ip>:8008`) and API credentials from the dashboard.

Or configure programmatically:

```python
from clearml import Task

Task.set_credentials(
    api_host="http://<server-ip>:8008",
    web_host="http://<server-ip>:8081",
    files_host="http://<server-ip>:8081",
    key="YOUR_ACCESS_KEY",
    secret="YOUR_SECRET_KEY"
)
```

***

## Tracking Experiments

### Minimal Integration (2 lines)

```python
from clearml import Task

# Initialize task — this captures EVERYTHING automatically
task = Task.init(project_name="MyProject", task_name="experiment-001")

# Your existing training code — no changes needed
import torch
import torch.nn as nn

model = nn.Linear(10, 1)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(10):
    loss = torch.tensor(1.0 / (epoch + 1))
    # ClearML auto-detects and logs loss if using standard frameworks
    print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

task.close()
```

### Manual Metric Logging

```python
from clearml import Task, Logger

task = Task.init(project_name="MyProject", task_name="manual-logging-demo")
logger = task.get_logger()

for epoch in range(50):
    train_loss = 1.0 / (epoch + 1)
    val_accuracy = 0.95 - 0.5 / (epoch + 1)

    # Log scalars
    logger.report_scalar("Loss", "train", value=train_loss, iteration=epoch)
    logger.report_scalar("Accuracy", "validation", value=val_accuracy, iteration=epoch)

    # Log learning rate
    logger.report_scalar("Learning Rate", "lr", value=0.001 * 0.9**epoch, iteration=epoch)

print("Training complete!")
task.close()
```

### Hyperparameter Tracking

```python
from clearml import Task

task = Task.init(project_name="HPO-Demo", task_name="run-001")

# Connect hyperparameters — auto-logged and overrideable remotely
params = {
    "learning_rate": 0.001,
    "batch_size": 32,
    "num_layers": 4,
    "dropout": 0.3,
    "optimizer": "adam",
    "epochs": 100,
}
params = task.connect(params)  # Now overrideable by ClearML HPO

print(f"Training with lr={params['learning_rate']}, batch={params['batch_size']}")
```

***

## Data Management

```python
from clearml import Dataset

# Create a versioned dataset
dataset = Dataset.create(
    dataset_name="my-training-data",
    dataset_project="MyProject",
    dataset_version="1.0",
)

# Add files
dataset.add_files(path="/data/images/", recursive=True)
dataset.add_files(path="/data/labels.csv")

# Upload to ClearML server
dataset.upload()
dataset.finalize()
print(f"Dataset ID: {dataset.id}")

# Later: use the dataset in experiments
dataset = Dataset.get(dataset_name="my-training-data", dataset_version="1.0")
local_path = dataset.get_local_copy()
print(f"Dataset at: {local_path}")
```

***

## Model Registry

```python
from clearml import Task, OutputModel, InputModel
import torch

task = Task.init(project_name="ModelRegistry", task_name="training-run")

# After training, register the model
model = torch.nn.Linear(100, 10)
torch.save(model.state_dict(), "my_model.pt")

# Register output model
output_model = OutputModel(task=task, name="MyModel-v1")
output_model.update_weights("my_model.pt")
output_model.publish()  # Mark as ready to use

print(f"Model registered: {output_model.id}")

# In deployment: load model by name
input_model = InputModel(model_id="<model-id-from-dashboard>")
local_model_path = input_model.get_local_copy()
state_dict = torch.load(local_model_path)
```

***

## Pipeline Orchestration

```python
from clearml.automation import PipelineController

def step_preprocess(dataset_id: str) -> str:
    """Data preprocessing step."""
    from clearml import Task, Dataset
    task = Task.init(task_name="step-preprocess")
    # ... preprocessing logic
    return "processed_data_id"

def step_train(data_id: str, lr: float = 0.001) -> str:
    """Model training step."""
    from clearml import Task
    task = Task.init(task_name="step-train")
    # ... training logic
    return "model_id"

def step_evaluate(model_id: str) -> float:
    """Model evaluation step."""
    from clearml import Task
    task = Task.init(task_name="step-evaluate")
    # ... evaluation logic
    return 0.95

# Build pipeline
pipe = PipelineController(
    name="ML-Training-Pipeline",
    project="MyPipelines",
    version="1.0"
)

pipe.add_function_step(
    name="preprocess",
    function=step_preprocess,
    function_kwargs={"dataset_id": "raw-data-id"},
    function_return=["processed_id"],
)

pipe.add_function_step(
    name="train",
    parents=["preprocess"],
    function=step_train,
    function_kwargs={"data_id": "${preprocess.processed_id}"},
    function_return=["model_id"],
    execution_queue="gpu-queue",  # Run on GPU worker
)

pipe.add_function_step(
    name="evaluate",
    parents=["train"],
    function=step_evaluate,
    function_kwargs={"model_id": "${train.model_id}"},
    function_return=["accuracy"],
)

pipe.start()
pipe.wait()
print("Pipeline complete!")
```

***

## ClearML Agent (Worker)

Run a ClearML Agent on a GPU server to execute queued experiments:

```bash
# Install agent
pip install clearml-agent

# Configure (uses same credentials as SDK)
clearml-agent init

# Start worker on GPU
clearml-agent daemon --queue "gpu-queue" --gpus 0,1

# Start worker with Docker isolation (recommended)
clearml-agent daemon \
    --queue "gpu-queue" \
    --docker pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime \
    --gpus all
```

On Clore.ai, spin up multiple GPU nodes as ClearML agents to create a distributed compute cluster.

***

## Hyperparameter Optimization

```python
from clearml.automation import (
    HyperParameterOptimizer,
    UniformParameterRange,
    DiscreteParameterValues,
    GridSearch,
)

optimizer = HyperParameterOptimizer(
    base_task_id="<task-id-to-optimize>",
    hyper_parameters=[
        UniformParameterRange("General/learning_rate", min_value=1e-5, max_value=1e-2, step_size=1e-5),
        DiscreteParameterValues("General/batch_size", values=[16, 32, 64, 128]),
        DiscreteParameterValues("General/optimizer", values=["adam", "sgd", "adamw"]),
    ],
    objective_metric_title="Accuracy",
    objective_metric_series="validation",
    objective_metric_sign="max",  # Maximize validation accuracy
    max_number_of_concurrent_tasks=4,
    optimizer_class=GridSearch,
    execution_queue="gpu-queue",
    total_max_jobs=50,
)

optimizer.start()
top_exps = optimizer.get_top_experiments(top_k=3)
print("Best experiments:", top_exps)
```

***

## Monitoring & Alerts

```python
from clearml import Task

task = Task.init(project_name="Production", task_name="monitoring")

# Set task tags for easy filtering
task.add_tags(["production", "v2.1", "gpu"])

# Log system metrics automatically — just init the task
# ClearML captures: CPU, RAM, GPU utilization, GPU VRAM automatically

# Add custom scalar monitoring
logger = task.get_logger()
import time
for i in range(100):
    gpu_util = 85 + (i % 10)
    logger.report_scalar("GPU", "utilization_%", value=gpu_util, iteration=i)
    time.sleep(1)
```

***

## Troubleshooting

{% hint style="warning" %}
**Elasticsearch fails to start** — Set `vm.max_map_count=262144` on the host: `sysctl -w vm.max_map_count=262144`. Add to `/etc/sysctl.conf` for persistence.
{% endhint %}

{% hint style="warning" %}
**Cannot connect to server** — Verify ports 8008 and 8081 are open in Clore.ai port settings. Check `docker ps` to ensure all containers are running.
{% endhint %}

{% hint style="info" %}
**Experiments not appearing in UI** — Check that `CLEARML_API_HOST` in your SDK config points to `http://<server-ip>:8008`, not localhost.
{% endhint %}

{% hint style="info" %}
**Out of disk space** — ClearML stores all artifacts locally. Configure S3/GCS storage or increase disk allocation in Clore.ai.
{% endhint %}

| Issue                      | Fix                                                            |
| -------------------------- | -------------------------------------------------------------- |
| MongoDB connection refused | Check mongo container: `docker logs clearml_mongo_1`           |
| Task stuck in queue        | Ensure ClearML Agent is running and connected to the queue     |
| Slow UI                    | Elasticsearch needs time to index — wait 2–3 min after startup |
| API 401 Unauthorized       | Regenerate API credentials in ClearML web dashboard            |

***

## Use Cases for GPU Researchers

* **Track training runs** — never lose hyperparameters or results again
* **Compare experiments** — side-by-side metric comparison in the UI
* **Reproduce results** — ClearML captures git commit + code diff automatically
* **Share results** — collaborators see all experiments in the shared dashboard
* **Remote GPU jobs** — queue training jobs from laptop, run on Clore.ai GPU nodes
* **Automated HPO** — run hyperparameter search across multiple GPU nodes in parallel

***

## Related Tools

* [MLflow](https://docs.clore.ai/guides/mlops-and-deployment/mlflow) — experiment tracking alternative
* [Weights & Biases](https://wandb.ai/) — hosted ML experiment tracking
* [Ray](https://www.ray.io/) — distributed ML training and HPO

***

*ClearML on Clore.ai combines experiment tracking with GPU compute management — giving your ML team full MLOps capabilities without cloud vendor lock-in.*

***

## Clore.ai GPU Recommendations

| Use Case                | Recommended GPU | Est. Cost on Clore.ai |
| ----------------------- | --------------- | --------------------- |
| Development/Testing     | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production Training     | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Scale Experiments | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.
