# ClearML

{% hint style="info" %}
**ClearML** (formerly Trains) is an open-source MLOps platform for experiment tracking, data versioning, model management, pipeline orchestration, and compute resource management — all in one unified suite.
{% endhint %}

## Overview

ClearML is a comprehensive ML lifecycle management platform from Allegro AI. It automatically captures experiment parameters, metrics, artifacts, and code with minimal code changes. ClearML supports the full ML workflow: from data management and experiment tracking to model registry, automated pipelines, and distributed task execution on GPU clusters.

| Property       | Value                                                     |
| -------------- | --------------------------------------------------------- |
| **Category**   | MLOps / Experiment Tracking                               |
| **Developer**  | Allegro AI                                                |
| **License**    | Apache 2.0                                                |
| **GitHub**     | [allegroai/clearml](https://github.com/allegroai/clearml) |
| **Stars**      | 5.5K+                                                     |
| **Docker Hub** | `allegroai/clearml`                                       |
| **Ports**      | 22 (SSH), 8008 (API Server), 8081 (Web UI)                |

***

## Architecture

ClearML consists of four main components:

| Component          | Port | Description                   |
| ------------------ | ---- | ----------------------------- |
| **ClearML Server** | —    | Backend coordinator           |
| **Web UI**         | 8081 | Browser-based dashboard       |
| **API Server**     | 8008 | REST API for SDK and agents   |
| **File Server**    | 8081 | Artifact and model storage    |
| **ClearML Agent**  | —    | Worker that executes ML tasks |

***

## Key Features

* **Zero-code experiment tracking** — add 2 lines of code to capture everything automatically
* **Automatic logging** — metrics, parameters, models, console output, plots, images
* **Git integration** — auto-capture git commit, diff, and uncommitted changes
* **Data management** — versioned datasets with lineage tracking
* **Model registry** — store, version, and serve ML models
* **Pipeline orchestration** — build and run multi-step ML pipelines
* **Remote execution** — queue experiments and run on remote GPU workers (ClearML Agent)
* **Hyperparameter optimization** — automated HPO with population-based training
* **Resource monitoring** — GPU/CPU/RAM monitoring per experiment
* **Self-hosted or cloud** — run your own server or use ClearML's hosted platform

***

## Clore.ai Setup

### Option 1 — Full Self-Hosted Server

Run the ClearML server on Clore.ai for full control.

### Step 1 — Choose a Server

| Use Case                  | Recommended   | VRAM  | RAM    |
| ------------------------- | ------------- | ----- | ------ |
| Server only (no training) | CPU instance  | —     | 8 GB+  |
| Server + training         | RTX 3080      | 10 GB | 16 GB  |
| Full MLOps cluster        | Multiple GPUs | —     | 32 GB+ |

### Step 2 — Rent a Server on Clore.ai

1. Go to [clore.ai](https://clore.ai) → **Marketplace**
2. For the **server** component: CPU instances work fine
3. For **training workers**: GPU instances (RTX 3090, 4090, A100)
4. Open ports: **22**, **8008**, **8081**
5. Ensure **≥ 50 GB disk** for experiment artifacts

### Step 3 — Deploy with Docker Compose

Create `docker-compose.yml`:

```yaml
version: "3.6"

services:
  apiserver:
    image: allegroai/clearml:latest
    restart: unless-stopped
    volumes:
      - /opt/clearml/logs:/var/log/clearml
      - /opt/clearml/config:/opt/clearml/config
      - /opt/clearml/data/fileserver:/mnt/fileserver
    environment:
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_ELASTICSEARCH_SERVICE_HOST: elasticsearch
      CLEARML_ELASTICSEARCH_SERVICE_PORT: 9200
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
    ports:
      - "8008:8008"
    depends_on:
      - mongo
      - elasticsearch
      - redis

  webserver:
    image: allegroai/clearml-webserver:latest
    restart: unless-stopped
    ports:
      - "8081:80"
    environment:
      CLEARML_API_HOST: http://localhost:8008

  fileserver:
    image: allegroai/clearml-fileserver:latest
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/fileserver:/mnt/fileserver
    ports:
      - "8081:8081"

  mongo:
    image: mongo:4.4
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/mongo:/data/db
    command: --setParameter internalQueryMaxBlockingSortMemoryUsageBytes=196100200

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.6
    restart: unless-stopped
    environment:
      ES_JAVA_OPTS: "-Xms512m -Xmx2048m"
      bootstrap.memory_lock: "true"
      cluster.name: "clearml"
      discovery.type: "single-node"
      http.publish_host: "$CLEARML_HOST_IP"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - /opt/clearml/data/elastic:/usr/share/elasticsearch/data

  redis:
    image: redis:6
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/redis:/data

networks:
  default:
    name: clearml_network
```

Start the stack:

```bash
mkdir -p /opt/clearml/{logs,config,data/{fileserver,mongo,elastic,redis}}

# Set your server's public IP
export CLEARML_HOST_IP=<your-server-ip>

docker-compose up -d
```

{% hint style="warning" %}
ClearML Server requires \~4 GB RAM for the full stack (MongoDB + Elasticsearch + Redis + API server + WebUI). Make sure your Clore.ai instance has sufficient RAM.
{% endhint %}

### Option 2 — Use ClearML Hosted (Free)

For experiment tracking without running a server, use the free hosted plan:

```bash
# Install SDK
pip install clearml

# Configure with hosted server
clearml-init
# Enter: https://api.clear.ml  when prompted for API host
# Get credentials from: https://app.clear.ml/settings/workspace-configuration
```

***

## Accessing the Interface

### Web Dashboard

```
http://<server-ip>:8081
```

Default credentials: create your account on first login.

### API Server

```
http://<server-ip>:8008
```

### Via SSH

```bash
ssh root@<server-ip> -p 22
```

***

## SDK Integration

### Installation

```bash
pip install clearml
```

### Initial Configuration

```bash
clearml-init
```

Enter your server URL (`http://<server-ip>:8008`) and API credentials from the dashboard.

Or configure programmatically:

```python
from clearml import Task

Task.set_credentials(
    api_host="http://<server-ip>:8008",
    web_host="http://<server-ip>:8081",
    files_host="http://<server-ip>:8081",
    key="YOUR_ACCESS_KEY",
    secret="YOUR_SECRET_KEY"
)
```

***

## Tracking Experiments

### Minimal Integration (2 lines)

```python
from clearml import Task

# Initialize task — this captures EVERYTHING automatically
task = Task.init(project_name="MyProject", task_name="experiment-001")

# Your existing training code — no changes needed
import torch
import torch.nn as nn

model = nn.Linear(10, 1)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(10):
    loss = torch.tensor(1.0 / (epoch + 1))
    # ClearML auto-detects and logs loss if using standard frameworks
    print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

task.close()
```

### Manual Metric Logging

```python
from clearml import Task, Logger

task = Task.init(project_name="MyProject", task_name="manual-logging-demo")
logger = task.get_logger()

for epoch in range(50):
    train_loss = 1.0 / (epoch + 1)
    val_accuracy = 0.95 - 0.5 / (epoch + 1)

    # Log scalars
    logger.report_scalar("Loss", "train", value=train_loss, iteration=epoch)
    logger.report_scalar("Accuracy", "validation", value=val_accuracy, iteration=epoch)

    # Log learning rate
    logger.report_scalar("Learning Rate", "lr", value=0.001 * 0.9**epoch, iteration=epoch)

print("Training complete!")
task.close()
```

### Hyperparameter Tracking

```python
from clearml import Task

task = Task.init(project_name="HPO-Demo", task_name="run-001")

# Connect hyperparameters — auto-logged and overrideable remotely
params = {
    "learning_rate": 0.001,
    "batch_size": 32,
    "num_layers": 4,
    "dropout": 0.3,
    "optimizer": "adam",
    "epochs": 100,
}
params = task.connect(params)  # Now overrideable by ClearML HPO

print(f"Training with lr={params['learning_rate']}, batch={params['batch_size']}")
```

***

## Data Management

```python
from clearml import Dataset

# Create a versioned dataset
dataset = Dataset.create(
    dataset_name="my-training-data",
    dataset_project="MyProject",
    dataset_version="1.0",
)

# Add files
dataset.add_files(path="/data/images/", recursive=True)
dataset.add_files(path="/data/labels.csv")

# Upload to ClearML server
dataset.upload()
dataset.finalize()
print(f"Dataset ID: {dataset.id}")

# Later: use the dataset in experiments
dataset = Dataset.get(dataset_name="my-training-data", dataset_version="1.0")
local_path = dataset.get_local_copy()
print(f"Dataset at: {local_path}")
```

***

## Model Registry

```python
from clearml import Task, OutputModel, InputModel
import torch

task = Task.init(project_name="ModelRegistry", task_name="training-run")

# After training, register the model
model = torch.nn.Linear(100, 10)
torch.save(model.state_dict(), "my_model.pt")

# Register output model
output_model = OutputModel(task=task, name="MyModel-v1")
output_model.update_weights("my_model.pt")
output_model.publish()  # Mark as ready to use

print(f"Model registered: {output_model.id}")

# In deployment: load model by name
input_model = InputModel(model_id="<model-id-from-dashboard>")
local_model_path = input_model.get_local_copy()
state_dict = torch.load(local_model_path)
```

***

## Pipeline Orchestration

```python
from clearml.automation import PipelineController

def step_preprocess(dataset_id: str) -> str:
    """Data preprocessing step."""
    from clearml import Task, Dataset
    task = Task.init(task_name="step-preprocess")
    # ... preprocessing logic
    return "processed_data_id"

def step_train(data_id: str, lr: float = 0.001) -> str:
    """Model training step."""
    from clearml import Task
    task = Task.init(task_name="step-train")
    # ... training logic
    return "model_id"

def step_evaluate(model_id: str) -> float:
    """Model evaluation step."""
    from clearml import Task
    task = Task.init(task_name="step-evaluate")
    # ... evaluation logic
    return 0.95

# Build pipeline
pipe = PipelineController(
    name="ML-Training-Pipeline",
    project="MyPipelines",
    version="1.0"
)

pipe.add_function_step(
    name="preprocess",
    function=step_preprocess,
    function_kwargs={"dataset_id": "raw-data-id"},
    function_return=["processed_id"],
)

pipe.add_function_step(
    name="train",
    parents=["preprocess"],
    function=step_train,
    function_kwargs={"data_id": "${preprocess.processed_id}"},
    function_return=["model_id"],
    execution_queue="gpu-queue",  # Run on GPU worker
)

pipe.add_function_step(
    name="evaluate",
    parents=["train"],
    function=step_evaluate,
    function_kwargs={"model_id": "${train.model_id}"},
    function_return=["accuracy"],
)

pipe.start()
pipe.wait()
print("Pipeline complete!")
```

***

## ClearML Agent (Worker)

Run a ClearML Agent on a GPU server to execute queued experiments:

```bash
# Install agent
pip install clearml-agent

# Configure (uses same credentials as SDK)
clearml-agent init

# Start worker on GPU
clearml-agent daemon --queue "gpu-queue" --gpus 0,1

# Start worker with Docker isolation (recommended)
clearml-agent daemon \
    --queue "gpu-queue" \
    --docker pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime \
    --gpus all
```

On Clore.ai, spin up multiple GPU nodes as ClearML agents to create a distributed compute cluster.

***

## Hyperparameter Optimization

```python
from clearml.automation import (
    HyperParameterOptimizer,
    UniformParameterRange,
    DiscreteParameterValues,
    GridSearch,
)

optimizer = HyperParameterOptimizer(
    base_task_id="<task-id-to-optimize>",
    hyper_parameters=[
        UniformParameterRange("General/learning_rate", min_value=1e-5, max_value=1e-2, step_size=1e-5),
        DiscreteParameterValues("General/batch_size", values=[16, 32, 64, 128]),
        DiscreteParameterValues("General/optimizer", values=["adam", "sgd", "adamw"]),
    ],
    objective_metric_title="Accuracy",
    objective_metric_series="validation",
    objective_metric_sign="max",  # Maximize validation accuracy
    max_number_of_concurrent_tasks=4,
    optimizer_class=GridSearch,
    execution_queue="gpu-queue",
    total_max_jobs=50,
)

optimizer.start()
top_exps = optimizer.get_top_experiments(top_k=3)
print("Best experiments:", top_exps)
```

***

## Monitoring & Alerts

```python
from clearml import Task

task = Task.init(project_name="Production", task_name="monitoring")

# Set task tags for easy filtering
task.add_tags(["production", "v2.1", "gpu"])

# Log system metrics automatically — just init the task
# ClearML captures: CPU, RAM, GPU utilization, GPU VRAM automatically

# Add custom scalar monitoring
logger = task.get_logger()
import time
for i in range(100):
    gpu_util = 85 + (i % 10)
    logger.report_scalar("GPU", "utilization_%", value=gpu_util, iteration=i)
    time.sleep(1)
```

***

## Troubleshooting

{% hint style="warning" %}
**Elasticsearch fails to start** — Set `vm.max_map_count=262144` on the host: `sysctl -w vm.max_map_count=262144`. Add to `/etc/sysctl.conf` for persistence.
{% endhint %}

{% hint style="warning" %}
**Cannot connect to server** — Verify ports 8008 and 8081 are open in Clore.ai port settings. Check `docker ps` to ensure all containers are running.
{% endhint %}

{% hint style="info" %}
**Experiments not appearing in UI** — Check that `CLEARML_API_HOST` in your SDK config points to `http://<server-ip>:8008`, not localhost.
{% endhint %}

{% hint style="info" %}
**Out of disk space** — ClearML stores all artifacts locally. Configure S3/GCS storage or increase disk allocation in Clore.ai.
{% endhint %}

| Issue                      | Fix                                                            |
| -------------------------- | -------------------------------------------------------------- |
| MongoDB connection refused | Check mongo container: `docker logs clearml_mongo_1`           |
| Task stuck in queue        | Ensure ClearML Agent is running and connected to the queue     |
| Slow UI                    | Elasticsearch needs time to index — wait 2–3 min after startup |
| API 401 Unauthorized       | Regenerate API credentials in ClearML web dashboard            |

***

## Use Cases for GPU Researchers

* **Track training runs** — never lose hyperparameters or results again
* **Compare experiments** — side-by-side metric comparison in the UI
* **Reproduce results** — ClearML captures git commit + code diff automatically
* **Share results** — collaborators see all experiments in the shared dashboard
* **Remote GPU jobs** — queue training jobs from laptop, run on Clore.ai GPU nodes
* **Automated HPO** — run hyperparameter search across multiple GPU nodes in parallel

***

## Related Tools

* [MLflow](https://docs.clore.ai/guides/mlops-and-deployment/mlflow) — experiment tracking alternative
* [Weights & Biases](https://wandb.ai/) — hosted ML experiment tracking
* [Ray](https://www.ray.io/) — distributed ML training and HPO

***

*ClearML on Clore.ai combines experiment tracking with GPU compute management — giving your ML team full MLOps capabilities without cloud vendor lock-in.*

***

## Clore.ai GPU Recommendations

| Use Case                | Recommended GPU | Est. Cost on Clore.ai |
| ----------------------- | --------------- | --------------------- |
| Development/Testing     | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production Training     | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Scale Experiments | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/mlops-and-deployment/clearml.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
