# MLflow

**MLflow** is an open-source platform for managing the complete **Machine Learning lifecycle** — from experiment tracking and model versioning to deployment and monitoring. Used by thousands of organizations worldwide, MLflow brings structure and reproducibility to ML workflows. Run it on Clore.ai's GPU cloud to get a centralized tracking server alongside your training jobs.

***

## What is MLflow?

MLflow provides four core components:

| Component          | Description                                               |
| ------------------ | --------------------------------------------------------- |
| **Tracking**       | Log parameters, metrics, artifacts, and code from ML runs |
| **Projects**       | Package code for reproducible runs                        |
| **Models**         | Standard model format for deployment across frameworks    |
| **Model Registry** | Centralized model store with versioning and lifecycle     |

**Supported frameworks (built-in autologging):**

* PyTorch, TensorFlow/Keras
* Scikit-learn, XGBoost, LightGBM
* HuggingFace Transformers
* Spark MLlib, statsmodels, Prophet

***

## Prerequisites

| Requirement | Value                                   |
| ----------- | --------------------------------------- |
| GPU VRAM    | Any (MLflow server itself is CPU-bound) |
| Storage     | 20 GB+ (for artifacts)                  |
| RAM         | 4 GB minimum for server                 |
| Ports       | 22 (SSH), 5000 (MLflow UI)              |

{% hint style="info" %}
MLflow tracking server is lightweight. You can run it on a small CPU instance and point your GPU training jobs at it. Alternatively, co-locate it with your training GPU instance.
{% endhint %}

***

## Step 1 — Rent a Server on Clore.ai

1. Log in to [clore.ai](https://clore.ai).
2. Click **Marketplace**.
3. For a dedicated tracking server: filter by RAM ≥ 8 GB (GPU optional).
4. For co-located: use your existing training instance.
5. Set Docker image: **`ghcr.io/mlflow/mlflow:latest`**
6. Set open ports: `22` (SSH) and `5000` (MLflow UI).
7. Click **Rent**.

***

## Step 2 — Launch the MLflow Tracking Server

The official `ghcr.io/mlflow/mlflow` image requires a startup command override.

### In Clore.ai Docker Configuration

Set the **command** (or entrypoint override) to:

```bash
bash -c "apt-get update -q && apt-get install -y -q openssh-server && \
    mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
    service ssh start && \
    mlflow server \
        --host 0.0.0.0 \
        --port 5000 \
        --default-artifact-root /mlflow/artifacts \
        --backend-store-uri sqlite:////mlflow/mlflow.db"
```

### Alternative: Custom Dockerfile

```dockerfile
FROM ghcr.io/mlflow/mlflow:latest

RUN apt-get update && apt-get install -y \
    openssh-server \
    && rm -rf /var/lib/apt/lists/*

# Configure SSH
RUN mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Additional Python packages
RUN pip install boto3 psycopg2-binary

RUN mkdir -p /mlflow/artifacts

EXPOSE 22 5000

CMD service ssh start && \
    mlflow server \
        --host 0.0.0.0 \
        --port 5000 \
        --default-artifact-root /mlflow/artifacts \
        --backend-store-uri sqlite:////mlflow/mlflow.db
```

***

## Step 3 — Access the MLflow UI

Open your browser:

```
http://<clore-host>:<public-port-5000>
```

You should see the MLflow Experiments dashboard.

{% hint style="info" %}
The default SQLite backend (`mlflow.db`) stores all run metadata locally. For production or team use, switch to PostgreSQL — see Advanced Configuration below.
{% endhint %}

***

## Step 4 — Log Your First Experiment

### Connect from a Remote Training Job

On your training machine (or another Clore.ai instance), set the tracking URI:

```bash
export MLFLOW_TRACKING_URI=http://<clore-host>:<public-port-5000>
```

### Basic PyTorch Experiment Logging

```python
import mlflow
import mlflow.pytorch
import torch
import torch.nn as nn
import torch.optim as optim

# Connect to MLflow server
mlflow.set_tracking_uri("http://<clore-host>:<public-port-5000>")
mlflow.set_experiment("my-first-experiment")

# Define a simple model
class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

# Training with MLflow tracking
with mlflow.start_run(run_name="training-run-001"):
    # Log hyperparameters
    params = {
        "learning_rate": 0.001,
        "batch_size": 64,
        "epochs": 100,
        "hidden_size": 256,
        "optimizer": "adam"
    }
    mlflow.log_params(params)
    
    # Initialize model
    model = SimpleNet(784, 256, 10).cuda()
    optimizer = optim.Adam(model.parameters(), lr=params["learning_rate"])
    criterion = nn.CrossEntropyLoss()
    
    # Training loop
    for epoch in range(params["epochs"]):
        loss = torch.tensor(0.5 / (epoch + 1))  # Simulated
        accuracy = 0.7 + epoch * 0.003
        
        # Log metrics at each epoch
        mlflow.log_metrics({
            "train_loss": loss.item(),
            "train_accuracy": accuracy,
        }, step=epoch)
    
    # Log the final model
    mlflow.pytorch.log_model(model, "model")
    
    # Log final metrics
    mlflow.log_metric("final_accuracy", accuracy)
    
    print(f"Run logged to MLflow. ID: {mlflow.active_run().info.run_id}")
```

### HuggingFace Transformers Autologging

```python
import mlflow
from transformers import TrainingArguments, Trainer

mlflow.set_tracking_uri("http://<clore-host>:<public-port-5000>")
mlflow.set_experiment("llm-finetuning")

# Enable autologging — automatically logs params, metrics, and model
mlflow.transformers.autolog()

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    evaluation_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

with mlflow.start_run():
    trainer.train()
```

***

## Step 5 — Scikit-learn with Autologging

```python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

mlflow.set_tracking_uri("http://<clore-host>:<public-port-5000>")
mlflow.set_experiment("sklearn-experiments")

# Autolog everything
mlflow.sklearn.autolog()

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

with mlflow.start_run(run_name="random-forest-v1"):
    rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
    rf.fit(X_train, y_train)
    
    score = rf.score(X_test, y_test)
    print(f"Test Accuracy: {score:.4f}")
    # All params, metrics, and model automatically logged!
```

***

## Step 6 — Model Registry

Register and manage model versions via the UI or API:

```python
import mlflow

client = mlflow.MlflowClient("http://<clore-host>:<public-port-5000>")

# Register a model from a run
run_id = "your-run-id-here"
model_uri = f"runs:/{run_id}/model"

registered = mlflow.register_model(
    model_uri=model_uri,
    name="production-classifier"
)

print(f"Version: {registered.version}")

# Transition model stage
client.transition_model_version_stage(
    name="production-classifier",
    version=registered.version,
    stage="Production"
)

# Load a production model anywhere
model = mlflow.pyfunc.load_model(
    model_uri="models:/production-classifier/Production"
)
```

***

## Step 7 — Serve a Model

MLflow can serve any logged model as a REST API:

```bash
# On the MLflow server instance
export MLFLOW_TRACKING_URI=http://localhost:5000

mlflow models serve \
    --model-uri "models:/production-classifier/Production" \
    --host 0.0.0.0 \
    --port 5001 \
    --no-conda
```

Test the served model:

```bash
curl -X POST http://<clore-host>:5001/invocations \
    -H "Content-Type: application/json" \
    -d '{"inputs": [[1.0, 2.0, 3.0, ...]]}'
```

***

## Advanced Configuration

### PostgreSQL Backend (Production)

```bash
# Launch with PostgreSQL
mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri postgresql://user:password@db-host/mlflow \
    --default-artifact-root s3://my-bucket/mlflow-artifacts
```

### S3 Artifact Store

```bash
pip install boto3

export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret

mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --default-artifact-root s3://my-mlflow-bucket/artifacts \
    --backend-store-uri sqlite:////mlflow/mlflow.db
```

### Authentication (Enterprise)

```bash
pip install mlflow[auth]

mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --app-name basic-auth \
    --backend-store-uri sqlite:////mlflow/mlflow.db \
    --default-artifact-root /mlflow/artifacts
```

***

## Comparing Runs in the UI

1. Open the MLflow UI at `http://<clore-host>:<port>`
2. Select an experiment from the left panel
3. Check the boxes next to multiple runs
4. Click **Compare** to see side-by-side metrics and parameters
5. Use the **Charts** tab for visual comparison

***

## Troubleshooting

### Cannot Connect to Tracking Server

```
mlflow.exceptions.MlflowException: API request failed with status code 503
```

**Solutions:**

* Check that port 5000 is open and forwarded in Clore.ai
* Verify the server is running: `ps aux | grep mlflow`
* Test connectivity: `curl http://<clore-host>:<port>/health`

### Artifact Upload Fails

**Solution:** Ensure the artifact directory is writable:

```bash
chmod 777 /mlflow/artifacts
```

### SQLite Locked Error (Concurrent Writes)

**Solution:** Switch to PostgreSQL for multi-user setups:

```bash
pip install psycopg2-binary
```

### Model Registry Not Showing

**Solution:** Verify you're using a `--backend-store-uri` that supports the registry (SQLite or PostgreSQL — not just a local path).

***

## Cost Estimation

| Instance   | Use Case                  | Est. Price | Notes             |
| ---------- | ------------------------- | ---------- | ----------------- |
| CPU 4-core | Tracking server only      | \~$0.05/hr | Very lightweight  |
| RTX 3080   | Co-located training       | \~$0.10/hr | Training + MLflow |
| RTX 4090   | Heavy training + tracking | \~$0.35/hr | Most common setup |

{% hint style="info" %}
Run MLflow on a cheap CPU instance and point all your GPU training jobs at it. This way the tracking server runs continuously without burning expensive GPU credits.
{% endhint %}

***

## Useful Resources

* [MLflow Official Documentation](https://mlflow.org/docs/latest/index.html)
* [MLflow GitHub](https://github.com/mlflow/mlflow)
* [MLflow Docker Hub](https://github.com/mlflow/mlflow/pkgs/container/mlflow)
* [MLflow Model Registry Guide](https://mlflow.org/docs/latest/model-registry.html)
* [MLflow Tracking API Reference](https://mlflow.org/docs/latest/python_api/mlflow.html)

***

## Clore.ai GPU Recommendations

| Use Case                | Recommended GPU | Est. Cost on Clore.ai |
| ----------------------- | --------------- | --------------------- |
| Development/Testing     | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production Training     | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Scale Experiments | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/mlops-and-deployment/mlflow.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
