MLflow

MLflow is an open-source platform for managing the complete Machine Learning lifecycle — from experiment tracking and model versioning to deployment and monitoring. Used by thousands of organizations worldwide, MLflow brings structure and reproducibility to ML workflows. Run it on Clore.ai's GPU cloud to get a centralized tracking server alongside your training jobs.


What is MLflow?

MLflow provides four core components:

Component
Description

Tracking

Log parameters, metrics, artifacts, and code from ML runs

Projects

Package code for reproducible runs

Models

Standard model format for deployment across frameworks

Model Registry

Centralized model store with versioning and lifecycle

Supported frameworks (built-in autologging):

  • PyTorch, TensorFlow/Keras

  • Scikit-learn, XGBoost, LightGBM

  • HuggingFace Transformers

  • Spark MLlib, statsmodels, Prophet


Prerequisites

Requirement
Value

GPU VRAM

Any (MLflow server itself is CPU-bound)

Storage

20 GB+ (for artifacts)

RAM

4 GB minimum for server

Ports

22 (SSH), 5000 (MLflow UI)

circle-info

MLflow tracking server is lightweight. You can run it on a small CPU instance and point your GPU training jobs at it. Alternatively, co-locate it with your training GPU instance.


Step 1 — Rent a Server on Clore.ai

  1. Click Marketplace.

  2. For a dedicated tracking server: filter by RAM ≥ 8 GB (GPU optional).

  3. For co-located: use your existing training instance.

  4. Set Docker image: ghcr.io/mlflow/mlflow:latest

  5. Set open ports: 22 (SSH) and 5000 (MLflow UI).

  6. Click Rent.


Step 2 — Launch the MLflow Tracking Server

The official ghcr.io/mlflow/mlflow image requires a startup command override.

In Clore.ai Docker Configuration

Set the command (or entrypoint override) to:

Alternative: Custom Dockerfile


Step 3 — Access the MLflow UI

Open your browser:

You should see the MLflow Experiments dashboard.

circle-info

The default SQLite backend (mlflow.db) stores all run metadata locally. For production or team use, switch to PostgreSQL — see Advanced Configuration below.


Step 4 — Log Your First Experiment

Connect from a Remote Training Job

On your training machine (or another Clore.ai instance), set the tracking URI:

Basic PyTorch Experiment Logging

HuggingFace Transformers Autologging


Step 5 — Scikit-learn with Autologging


Step 6 — Model Registry

Register and manage model versions via the UI or API:


Step 7 — Serve a Model

MLflow can serve any logged model as a REST API:

Test the served model:


Advanced Configuration

PostgreSQL Backend (Production)

S3 Artifact Store

Authentication (Enterprise)


Comparing Runs in the UI

  1. Open the MLflow UI at http://<clore-host>:<port>

  2. Select an experiment from the left panel

  3. Check the boxes next to multiple runs

  4. Click Compare to see side-by-side metrics and parameters

  5. Use the Charts tab for visual comparison


Troubleshooting

Cannot Connect to Tracking Server

Solutions:

  • Check that port 5000 is open and forwarded in Clore.ai

  • Verify the server is running: ps aux | grep mlflow

  • Test connectivity: curl http://<clore-host>:<port>/health

Artifact Upload Fails

Solution: Ensure the artifact directory is writable:

SQLite Locked Error (Concurrent Writes)

Solution: Switch to PostgreSQL for multi-user setups:

Model Registry Not Showing

Solution: Verify you're using a --backend-store-uri that supports the registry (SQLite or PostgreSQL — not just a local path).


Cost Estimation

Instance
Use Case
Est. Price
Notes

CPU 4-core

Tracking server only

~$0.05/hr

Very lightweight

RTX 3080

Co-located training

~$0.10/hr

Training + MLflow

RTX 4090

Heavy training + tracking

~$0.35/hr

Most common setup

circle-info

Run MLflow on a cheap CPU instance and point all your GPU training jobs at it. This way the tracking server runs continuously without burning expensive GPU credits.


Useful Resources


Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Production Training

RTX 4090 (24GB)

~$0.70/gpu/hr

Large Scale Experiments

A100 80GB

~$1.20/gpu/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?