# Monitoring with Prometheus + Grafana

## What We're Building

A complete monitoring stack with Prometheus and Grafana to track your Clore.ai GPU usage, costs, performance metrics, and create beautiful dashboards with alerting.

**Key Features:**

* GPU utilization and memory metrics
* Cost tracking per workload
* Order status monitoring
* Price history visualization
* Alert rules for cost thresholds
* Beautiful Grafana dashboards

## Prerequisites

* Clore.ai account with API key
* Docker and Docker Compose
* Python 3.10+

```bash
pip install prometheus_client requests flask
```

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                    Monitoring Stack                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌──────────────┐     ┌──────────────┐     ┌──────────────┐   │
│   │   Grafana    │◀────│  Prometheus  │◀────│   Exporter   │   │
│   │  (Dashboards)│     │  (Metrics DB)│     │ (Clore.ai)   │   │
│   └──────────────┘     └──────────────┘     └──────────────┘   │
│          ▲                                         │            │
│          │                                         ▼            │
│   ┌──────────────┐                         ┌──────────────┐    │
│   │ Alertmanager │                         │  Clore.ai    │    │
│   │  (Notifications)                       │     API      │    │
│   └──────────────┘                         └──────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

## Step 1: Prometheus Exporter for Clore.ai

```python
#!/usr/bin/env python3
"""
Prometheus Exporter for Clore.ai Metrics

Exposes metrics about GPU marketplace, orders, and costs.
"""

import time
import requests
import threading
from flask import Flask, Response
from prometheus_client import (
    Counter, Gauge, Histogram, Info,
    generate_latest, CONTENT_TYPE_LATEST,
    CollectorRegistry
)

# Create custom registry
registry = CollectorRegistry()

# --- Marketplace Metrics ---
gpu_available = Gauge(
    'clore_gpu_available_total',
    'Number of available GPUs by type',
    ['gpu_type'],
    registry=registry
)

gpu_price_spot = Gauge(
    'clore_gpu_price_spot_usd',
    'Minimum spot price per GPU type in USD',
    ['gpu_type'],
    registry=registry
)

gpu_price_ondemand = Gauge(
    'clore_gpu_price_ondemand_usd',
    'Minimum on-demand price per GPU type in USD',
    ['gpu_type'],
    registry=registry
)

marketplace_total_servers = Gauge(
    'clore_marketplace_servers_total',
    'Total number of servers in marketplace',
    registry=registry
)

marketplace_available_servers = Gauge(
    'clore_marketplace_servers_available',
    'Number of available servers',
    registry=registry
)

# --- Order Metrics ---
active_orders = Gauge(
    'clore_orders_active_total',
    'Number of active orders',
    registry=registry
)

orders_by_status = Gauge(
    'clore_orders_by_status',
    'Orders by status',
    ['status'],
    registry=registry
)

order_hourly_cost = Gauge(
    'clore_order_hourly_cost_usd',
    'Hourly cost of order in USD',
    ['order_id', 'gpu_type'],
    registry=registry
)

order_runtime_seconds = Gauge(
    'clore_order_runtime_seconds',
    'Runtime of order in seconds',
    ['order_id'],
    registry=registry
)

# --- Cost Metrics ---
total_daily_cost = Gauge(
    'clore_daily_cost_usd',
    'Estimated daily cost of all active orders',
    registry=registry
)

# --- Wallet Metrics ---
wallet_balance = Gauge(
    'clore_wallet_balance',
    'Wallet balance by currency',
    ['currency'],
    registry=registry
)

# --- Scrape Metrics ---
scrape_duration = Histogram(
    'clore_scrape_duration_seconds',
    'Time spent scraping Clore.ai API',
    registry=registry
)

scrape_errors = Counter(
    'clore_scrape_errors_total',
    'Number of scrape errors',
    registry=registry
)


class CloreExporter:
    """Prometheus exporter for Clore.ai metrics."""
    
    BASE_URL = "https://api.clore.ai"
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.headers = {"auth": api_key}
    
    def _request(self, endpoint: str):
        """Make API request."""
        response = requests.get(
            f"{self.BASE_URL}{endpoint}",
            headers=self.headers,
            timeout=30
        )
        data = response.json()
        if data.get("code") != 0:
            raise Exception(f"API Error: {data}")
        return data
    
    def _normalize_gpu(self, name: str) -> str:
        """Normalize GPU name for consistent labels."""
        patterns = [
            ("RTX_4090", ["4090"]),
            ("RTX_4080", ["4080"]),
            ("RTX_3090", ["3090"]),
            ("RTX_3080", ["3080"]),
            ("RTX_3070", ["3070"]),
            ("A100", ["a100"]),
            ("A6000", ["a6000"]),
            ("A5000", ["a5000"]),
        ]
        
        name_lower = name.lower()
        for normalized, matches in patterns:
            if any(m in name_lower for m in matches):
                return normalized
        return name.replace(" ", "_")
    
    def collect_marketplace_metrics(self):
        """Collect marketplace metrics."""
        data = self._request("/v1/marketplace")
        servers = data.get("servers", [])
        
        # Track by GPU type
        gpu_data = {}
        total_servers = len(servers)
        available_servers = 0
        
        for server in servers:
            is_available = not server.get("rented", True)
            if is_available:
                available_servers += 1
            
            gpu_array = server.get("gpu_array", [])
            for gpu in gpu_array:
                gpu_type = self._normalize_gpu(gpu)
                
                if gpu_type not in gpu_data:
                    gpu_data[gpu_type] = {
                        "available": 0,
                        "spot_min": float('inf'),
                        "ondemand_min": float('inf')
                    }
                
                if is_available:
                    gpu_data[gpu_type]["available"] += 1
                    
                    usd = server.get("price", {}).get("usd", {})
                    spot = usd.get("spot")
                    ondemand = usd.get("on_demand_clore")
                    
                    if spot:
                        gpu_data[gpu_type]["spot_min"] = min(
                            gpu_data[gpu_type]["spot_min"], spot
                        )
                    if ondemand:
                        gpu_data[gpu_type]["ondemand_min"] = min(
                            gpu_data[gpu_type]["ondemand_min"], ondemand
                        )
        
        # Set metrics
        marketplace_total_servers.set(total_servers)
        marketplace_available_servers.set(available_servers)
        
        for gpu_type, data in gpu_data.items():
            gpu_available.labels(gpu_type=gpu_type).set(data["available"])
            
            if data["spot_min"] != float('inf'):
                gpu_price_spot.labels(gpu_type=gpu_type).set(data["spot_min"])
            
            if data["ondemand_min"] != float('inf'):
                gpu_price_ondemand.labels(gpu_type=gpu_type).set(data["ondemand_min"])
    
    def collect_order_metrics(self):
        """Collect order metrics."""
        data = self._request("/v1/my_orders")
        orders = data.get("orders", [])
        
        # Count by status
        status_counts = {}
        total_hourly_cost = 0
        
        for order in orders:
            status = order.get("status", "unknown")
            status_counts[status] = status_counts.get(status, 0) + 1
            
            order_id = str(order.get("order_id", ""))
            
            # Get GPU type from order
            gpu_type = "unknown"
            if order.get("gpu_array"):
                gpu_type = self._normalize_gpu(order["gpu_array"][0])
            
            # Calculate hourly cost (price is per minute)
            price_per_minute = order.get("price", 0)
            hourly = price_per_minute * 60
            
            if status == "running":
                total_hourly_cost += hourly
                order_hourly_cost.labels(
                    order_id=order_id,
                    gpu_type=gpu_type
                ).set(hourly)
                
                # Runtime
                started = order.get("started", 0)
                if started:
                    runtime = time.time() - started
                    order_runtime_seconds.labels(order_id=order_id).set(runtime)
        
        # Set status metrics
        for status, count in status_counts.items():
            orders_by_status.labels(status=status).set(count)
        
        active_orders.set(status_counts.get("running", 0))
        total_daily_cost.set(total_hourly_cost * 24)
    
    def collect_wallet_metrics(self):
        """Collect wallet balance metrics."""
        data = self._request("/v1/wallets")
        wallets = data.get("wallets", [])
        
        for wallet in wallets:
            currency = wallet.get("name", "unknown")
            balance = wallet.get("balance", 0)
            wallet_balance.labels(currency=currency).set(balance)
    
    def collect(self):
        """Collect all metrics."""
        start = time.time()
        
        try:
            self.collect_marketplace_metrics()
            self.collect_order_metrics()
            self.collect_wallet_metrics()
        except Exception as e:
            scrape_errors.inc()
            print(f"Error collecting metrics: {e}")
        
        duration = time.time() - start
        scrape_duration.observe(duration)


# Flask app for metrics endpoint
app = Flask(__name__)
exporter = None


@app.route('/metrics')
def metrics():
    """Prometheus metrics endpoint."""
    if exporter:
        exporter.collect()
    return Response(
        generate_latest(registry),
        mimetype=CONTENT_TYPE_LATEST
    )


@app.route('/health')
def health():
    return 'OK'


def start_exporter(api_key: str, port: int = 9090):
    """Start the exporter."""
    global exporter
    exporter = CloreExporter(api_key)
    app.run(host='0.0.0.0', port=port)


if __name__ == '__main__':
    import os
    api_key = os.environ.get('CLORE_API_KEY')
    if not api_key:
        print("Set CLORE_API_KEY environment variable")
        exit(1)
    
    start_exporter(api_key, port=9090)
```

## Step 2: Prometheus Configuration

```yaml
# prometheus.yml
global:
  scrape_interval: 60s
  evaluation_interval: 60s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - /etc/prometheus/alerts/*.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'clore_exporter'
    static_configs:
      - targets: ['clore-exporter:9090']
    scrape_interval: 60s
    scrape_timeout: 30s
```

## Step 3: Alert Rules

```yaml
# alerts/clore_alerts.yml
groups:
  - name: clore_costs
    rules:
      - alert: HighDailyCost
        expr: clore_daily_cost_usd > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High daily cost detected"
          description: "Daily cost is ${{ $value | printf \"%.2f\" }}"
      
      - alert: CriticalDailyCost
        expr: clore_daily_cost_usd > 100
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical daily cost"
          description: "Daily cost is ${{ $value | printf \"%.2f\" }}!"

  - name: clore_availability
    rules:
      - alert: NoGPUsAvailable
        expr: clore_marketplace_servers_available == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "No GPUs available"
          description: "No servers available in marketplace"
      
      - alert: LowGPUAvailability
        expr: clore_gpu_available_total{gpu_type="RTX_4090"} < 5
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "Low RTX 4090 availability"
          description: "Only {{ $value }} RTX 4090 GPUs available"

  - name: clore_orders
    rules:
      - alert: OrderFailed
        expr: increase(clore_orders_by_status{status="expired"}[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Order expired/failed"
          description: "An order has expired in the last 5 minutes"
      
      - alert: LowWalletBalance
        expr: clore_wallet_balance{currency=~".*CLORE.*"} < 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Low CLORE balance"
          description: "CLORE balance is {{ $value }}"

  - name: clore_prices
    rules:
      - alert: PriceDrop
        expr: |
          (clore_gpu_price_spot_usd{gpu_type="RTX_4090"} 
          - clore_gpu_price_spot_usd{gpu_type="RTX_4090"} offset 1h) 
          / clore_gpu_price_spot_usd{gpu_type="RTX_4090"} offset 1h < -0.2
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "RTX 4090 price dropped"
          description: "RTX 4090 spot price dropped by more than 20%"
```

## Step 4: Grafana Dashboard

```json
{
  "dashboard": {
    "title": "Clore.ai GPU Monitoring",
    "panels": [
      {
        "title": "Daily Cost (USD)",
        "type": "stat",
        "targets": [
          {
            "expr": "clore_daily_cost_usd",
            "legendFormat": "Daily Cost"
          }
        ],
        "options": {
          "colorMode": "value",
          "graphMode": "area"
        },
        "fieldConfig": {
          "defaults": {
            "unit": "currencyUSD",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 25},
                {"color": "red", "value": 50}
              ]
            }
          }
        }
      },
      {
        "title": "Active Orders",
        "type": "stat",
        "targets": [
          {
            "expr": "clore_orders_active_total"
          }
        ]
      },
      {
        "title": "GPU Availability",
        "type": "bargauge",
        "targets": [
          {
            "expr": "clore_gpu_available_total",
            "legendFormat": "{{ gpu_type }}"
          }
        ]
      },
      {
        "title": "Spot Prices (USD/hr)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "clore_gpu_price_spot_usd",
            "legendFormat": "{{ gpu_type }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "currencyUSD"
          }
        }
      },
      {
        "title": "Wallet Balances",
        "type": "table",
        "targets": [
          {
            "expr": "clore_wallet_balance",
            "format": "table"
          }
        ]
      },
      {
        "title": "Orders by Status",
        "type": "piechart",
        "targets": [
          {
            "expr": "clore_orders_by_status",
            "legendFormat": "{{ status }}"
          }
        ]
      },
      {
        "title": "Available Servers",
        "type": "timeseries",
        "targets": [
          {
            "expr": "clore_marketplace_servers_available"
          }
        ]
      },
      {
        "title": "Order Runtime",
        "type": "table",
        "targets": [
          {
            "expr": "clore_order_runtime_seconds / 3600",
            "format": "table",
            "legendFormat": "{{ order_id }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "h"
          }
        }
      }
    ]
  }
}
```

## Step 5: Docker Compose Stack

```yaml
# docker-compose.yml
version: '3.8'

services:
  clore-exporter:
    build:
      context: .
      dockerfile: Dockerfile.exporter
    environment:
      - CLORE_API_KEY=${CLORE_API_KEY}
    ports:
      - "9090:9090"
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v2.47.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts:/etc/prometheus/alerts
      - prometheus-data:/prometheus
    ports:
      - "9091:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.1.0
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.26.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:
```

## Alertmanager Configuration

```yaml
# alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'your-app-password'

route:
  receiver: 'default'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'critical'

receivers:
  - name: 'default'
    email_configs:
      - to: 'you@example.com'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts'
  
  - name: 'critical'
    email_configs:
      - to: 'you@example.com'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#critical-alerts'
```

## Running the Stack

```bash
# Set API key
export CLORE_API_KEY=YOUR_API_KEY

# Start everything
docker-compose up -d

# Access:
# - Prometheus: http://localhost:9091
# - Grafana: http://localhost:3000 (admin/admin)
# - Alertmanager: http://localhost:9093
```

## Key Metrics to Monitor

| Metric                      | Description           | Alert Threshold |
| --------------------------- | --------------------- | --------------- |
| `clore_daily_cost_usd`      | Estimated daily spend | > $50           |
| `clore_orders_active_total` | Running orders        | N/A             |
| `clore_gpu_available_total` | Available GPUs        | < 5             |
| `clore_gpu_price_spot_usd`  | Current spot price    | Price drops     |
| `clore_wallet_balance`      | Wallet balance        | < 10 CLORE      |

## Next Steps

* [Cost Optimization](https://docs.clore.ai/dev/devops-and-automation/cost-optimization)
* [GitHub Actions Integration](https://docs.clore.ai/dev/devops-and-automation/github-actions)
* [Spot Manager](https://docs.clore.ai/dev/devops-and-automation/spot-manager)
