# Deploying a REST API for Model Inference

## Deploy Model as REST API on Clore.ai

### What We're Building

A complete production-ready pipeline to:

1. **Download or train** a machine learning model
2. **Rent a Clore.ai GPU** programmatically
3. **Deploy Flask/FastAPI** inference endpoint
4. **Handle requests** with proper authentication
5. **Monitor** health, performance, and costs

By the end, you'll have a live REST API serving your model on a rented GPU — for a fraction of cloud provider costs.

### Prerequisites

* Clore.ai account with **20+ CLORE** balance
* Python 3.10+
* A model to deploy (we'll use Hugging Face models)

```bash
pip install requests fastapi uvicorn torch transformers flask gunicorn pydantic prometheus-client
```

┌─────────────────────────────────────────────────────────────┐ │ Your Application │ └─────────────────────────────────────┬───────────────────────┘ │ HTTPS ▼ ┌─────────────────────────────────────────────────────────────┐ │ Clore.ai Rented GPU Server │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Docker Container │ │ │ │ ┌───────────┐ ┌────────────┐ ┌──────────────┐ │ │ │ │ │ FastAPI │──│ Model │──│ GPU (CUDA) │ │ │ │ │ │ Server │ │ Pipeline │ │ RTX 4090 │ │ │ │ │ └─────┬─────┘ └────────────┘ └──────────────┘ │ │ │ │ │ │ │ │ │ ┌─────┴─────┐ ┌────────────┐ │ │ │ │ │ Auth │ │ Metrics │ │ │ │ │ │ Middleware│ │ Prometheus │ │ │ │ │ └───────────┘ └────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Ports: 22 (SSH), 8000 (API), 9090 (Metrics) │ └─────────────────────────────────────────────────────────────┘

````

## Step 1: Set Up the Clore Client

> 📦 **Using the standard Clore API client.** See [Clore API Client Reference](../reference/clore-client.md) for the full implementation and setup instructions. Save it as `clore_client.py` in your project.

```python
from clore_client import CloreClient

client = CloreClient(api_key="your-api-key")
````

## clore\_client.py

import requests import time from typing import Dict, Any, List, Optional from dataclasses import dataclass

@dataclass class RentalInfo: """Information about an active rental.""" order\_id: int server\_id: int status: str ssh\_host: str ssh\_port: int http\_endpoint: str cost\_per\_hour: float started\_at: int

### Step 2: FastAPI Inference Server

```python
# inference_server.py
"""
Production-ready inference server with:
- Model loading and caching
- API key authentication
- Request validation
- Health checks
- Prometheus metrics
- Error handling
"""

import os
import time
import torch
import logging
from typing import Optional, List
from contextlib import asynccontextmanager

from fastapi import FastAPI, HTTPException, Depends, Security, Request
from fastapi.security import APIKeyHeader
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import Response
from pydantic import BaseModel, Field
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from prometheus_client import Counter, Histogram, Gauge, generate_latest

# Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Prometheus metrics
REQUEST_COUNT = Counter(
    "inference_requests_total",
    "Total inference requests",
    ["endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
    "inference_latency_seconds",
    "Request latency in seconds",
    ["endpoint"]
)
GPU_MEMORY_USED = Gauge(
    "gpu_memory_used_bytes",
    "GPU memory used in bytes"
)
MODEL_LOADED = Gauge(
    "model_loaded",
    "Whether model is loaded (1) or not (0)"
)

# Configuration
API_KEYS = set(os.environ.get("API_KEYS", "demo-key-12345").split(","))
MODEL_ID = os.environ.get("MODEL_ID", "microsoft/DialoGPT-medium")
MAX_LENGTH = int(os.environ.get("MAX_LENGTH", "256"))
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Global model state
model = None
tokenizer = None
generator = None

class CompletionRequest(BaseModel):
    """Text completion request."""
    prompt: str = Field(..., min_length=1, max_length=4096)
    max_tokens: int = Field(default=128, ge=1, le=2048)
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    top_p: float = Field(default=0.9, ge=0.0, le=1.0)
    stop: Optional[List[str]] = None

class CompletionResponse(BaseModel):
    """Text completion response."""
    id: str
    object: str = "text_completion"
    created: int
    model: str
    choices: List[dict]
    usage: dict

class HealthResponse(BaseModel):
    """Health check response."""
    status: str
    model: str
    device: str
    gpu_available: bool
    gpu_memory_used_mb: Optional[float]
    uptime_seconds: float

# Security
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)

async def verify_api_key(api_key: str = Security(api_key_header)):
    if api_key not in API_KEYS:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return api_key

def load_model():
    """Load the model into GPU memory."""
    global model, tokenizer, generator
    
    logger.info(f"Loading model: {MODEL_ID}")
    logger.info(f"Device: {DEVICE}")
    
    try:
        tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
        
        # Set padding token if not set
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        
        model = AutoModelForCausalLM.from_pretrained(
            MODEL_ID,
            torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32,
            device_map="auto" if DEVICE == "cuda" else None,
            low_cpu_mem_usage=True
        )
        
        generator = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            device=0 if DEVICE == "cuda" else -1
        )
        
        MODEL_LOADED.set(1)
        logger.info("Model loaded successfully")
        
        # Log GPU memory
        if torch.cuda.is_available():
            mem = torch.cuda.memory_allocated() / 1024**2
            logger.info(f"GPU memory used: {mem:.2f} MB")
        
    except Exception as e:
        MODEL_LOADED.set(0)
        logger.error(f"Failed to load model: {e}")
        raise

# Track startup time
startup_time = time.time()

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Application lifespan handler."""
    load_model()
    yield
    # Cleanup
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

app = FastAPI(
    title="Model Inference API",
    description="Production ML inference endpoint on Clore.ai GPU",
    version="1.0.0",
    lifespan=lifespan
)

# CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.post("/v1/completions", response_model=CompletionResponse)
async def create_completion(
    request: CompletionRequest,
    api_key: str = Depends(verify_api_key)
):
    """Generate text completion."""
    
    start_time = time.time()
    
    try:
        # Generate
        outputs = generator(
            request.prompt,
            max_new_tokens=request.max_tokens,
            temperature=request.temperature,
            top_p=request.top_p,
            do_sample=request.temperature > 0,
            pad_token_id=tokenizer.eos_token_id,
            return_full_text=False
        )
        
        generated_text = outputs[0]["generated_text"]
        
        # Handle stop sequences
        if request.stop:
            for stop_seq in request.stop:
                if stop_seq in generated_text:
                    generated_text = generated_text.split(stop_seq)[0]
        
        # Calculate tokens
        prompt_tokens = len(tokenizer.encode(request.prompt))
        completion_tokens = len(tokenizer.encode(generated_text))
        
        latency = time.time() - start_time
        
        # Update metrics
        REQUEST_COUNT.labels(endpoint="/v1/completions", status="success").inc()
        REQUEST_LATENCY.labels(endpoint="/v1/completions").observe(latency)
        
        if torch.cuda.is_available():
            GPU_MEMORY_USED.set(torch.cuda.memory_allocated())
        
        return CompletionResponse(
            id=f"cmpl-{int(time.time()*1000)}",
            created=int(time.time()),
            model=MODEL_ID,
            choices=[{
                "text": generated_text,
                "index": 0,
                "finish_reason": "stop" if request.stop else "length"
            }],
            usage={
                "prompt_tokens": prompt_tokens,
                "completion_tokens": completion_tokens,
                "total_tokens": prompt_tokens + completion_tokens
            }
        )
    
    except Exception as e:
        REQUEST_COUNT.labels(endpoint="/v1/completions", status="error").inc()
        logger.error(f"Inference error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/v1/chat/completions")
async def create_chat_completion(
    request: dict,
    api_key: str = Depends(verify_api_key)
):
    """Chat completion endpoint (OpenAI compatible)."""
    
    messages = request.get("messages", [])
    
    # Convert messages to prompt
    prompt = ""
    for msg in messages:
        role = msg.get("role", "user")
        content = msg.get("content", "")
        if role == "system":
            prompt += f"System: {content}\n"
        elif role == "user":
            prompt += f"User: {content}\n"
        elif role == "assistant":
            prompt += f"Assistant: {content}\n"
    
    prompt += "Assistant:"
    
    # Generate
    completion_request = CompletionRequest(
        prompt=prompt,
        max_tokens=request.get("max_tokens", 128),
        temperature=request.get("temperature", 0.7),
        top_p=request.get("top_p", 0.9)
    )
    
    result = await create_completion(completion_request, api_key)
    
    # Convert to chat format
    return {
        "id": result.id,
        "object": "chat.completion",
        "created": result.created,
        "model": result.model,
        "choices": [{
            "index": 0,
            "message": {
                "role": "assistant",
                "content": result.choices[0]["text"]
            },
            "finish_reason": result.choices[0]["finish_reason"]
        }],
        "usage": result.usage
    }

@app.get("/health", response_model=HealthResponse)
async def health_check():
    """Health check endpoint."""
    
    gpu_memory = None
    if torch.cuda.is_available():
        gpu_memory = torch.cuda.memory_allocated() / 1024**2
    
    return HealthResponse(
        status="healthy" if MODEL_LOADED._value._value == 1 else "unhealthy",
        model=MODEL_ID,
        device=DEVICE,
        gpu_available=torch.cuda.is_available(),
        gpu_memory_used_mb=gpu_memory,
        uptime_seconds=time.time() - startup_time
    )

@app.get("/v1/models")
async def list_models():
    """List available models."""
    return {
        "object": "list",
        "data": [{
            "id": MODEL_ID,
            "object": "model",
            "created": int(startup_time),
            "owned_by": "clore-deployment"
        }]
    }

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint."""
    if torch.cuda.is_available():
        GPU_MEMORY_USED.set(torch.cuda.memory_allocated())
    
    return Response(
        content=generate_latest(),
        media_type="text/plain"
    )

@app.get("/")
async def root():
    """Root endpoint."""
    return {
        "name": "Model Inference API",
        "version": "1.0.0",
        "model": MODEL_ID,
        "docs": "/docs"
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
```

### Step 3: Docker Configuration

```dockerfile
# Dockerfile
FROM nvidia/cuda:12.8.0-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y \
    python3 python3-pip git \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy application
COPY inference_server.py .

# Environment
ENV MODEL_ID=microsoft/DialoGPT-medium
ENV MAX_LENGTH=256
ENV API_KEYS=demo-key-12345
ENV NVIDIA_VISIBLE_DEVICES=all

# Pre-download model (optional, speeds up startup)
RUN python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
    AutoTokenizer.from_pretrained('${MODEL_ID}'); \
    AutoModelForCausalLM.from_pretrained('${MODEL_ID}')"

EXPOSE 8000

CMD ["uvicorn", "inference_server:app", "--host", "0.0.0.0", "--port", "8000"]
```

```txt
# requirements.txt
fastapi>=0.100.0
uvicorn>=0.23.0
torch>=2.0.0
transformers>=4.30.0
pydantic>=2.0.0
prometheus-client>=0.17.0
accelerate>=0.21.0
```

### Step 4: Deployment Script

```python
#!/usr/bin/env python3
"""
Complete Deployment Script for Model REST API on Clore.ai

Usage:
    export CLORE_API_KEY=your_api_key
    python deploy_model_api.py

This script will:
1. Find a suitable GPU
2. Rent the server
3. Deploy the inference container
4. Wait for service to be ready
5. Test the endpoint
"""

import os
import sys
import time
import requests
from clore_client import CloreClient, RentalInfo

# Configuration
MODEL_CONFIG = {
    "model_id": "microsoft/DialoGPT-medium",  # Change to your model
    "docker_image": "ghcr.io/huggingface/text-generation-inference:latest",
    "gpu_type": "RTX",
    "max_price_usd": 0.50,
    "min_vram_gb": 16,
    "api_keys": "my-secret-key-123,backup-key-456"
}

SSH_PASSWORD = "SecureClorePass123!"

def print_step(step: str, message: str):
    """Print formatted step."""
    print(f"\n{'='*60}")
    print(f"📋 Step {step}: {message}")
    print('='*60)

def check_balance(client: CloreClient) -> float:
    """Check and display balance."""
    print_step("1", "Checking Balance")
    
    balances = client.get_balance()
    clore = balances.get("CLORE-Blockchain", 0)
    
    print(f"   💰 CLORE: {clore:.2f}")
    print(f"   💵 BTC: {balances.get('bitcoin', 0):.8f}")
    print(f"   💲 USD: {balances.get('USD-Blockchain', 0):.2f}")
    
    if clore < 10:
        print("\n❌ Insufficient CLORE balance (need 10+)")
        sys.exit(1)
    
    return clore

def find_gpu(client: CloreClient) -> dict:
    """Find suitable GPU."""
    print_step("2", "Finding GPU")
    
    gpu = client.find_gpu(
        gpu_type=MODEL_CONFIG["gpu_type"],
        max_price=MODEL_CONFIG["max_price_usd"],
        min_vram=MODEL_CONFIG["min_vram_gb"]
    )
    
    if not gpu:
        print(f"\n❌ No {MODEL_CONFIG['gpu_type']} GPU available under ${MODEL_CONFIG['max_price_usd']}/hr")
        sys.exit(1)
    
    print(f"   🖥️  Server ID: {gpu['id']}")
    print(f"   🎮 GPU: {', '.join(gpu['gpus'])}")
    print(f"   💵 Price: ${gpu['price_usd']:.2f}/hr (${gpu['price_usd']*24:.2f}/day)")
    print(f"   ⭐ Reliability: {gpu['reliability']}%")
    
    return gpu

def deploy_server(client: CloreClient, server_id: int) -> RentalInfo:
    """Deploy the inference server."""
    print_step("3", "Deploying Server")
    
    print(f"   🐳 Image: {MODEL_CONFIG['docker_image']}")
    print(f"   🤖 Model: {MODEL_CONFIG['model_id']}")
    print("   ⏳ Creating order...")
    
    # Environment variables for the container
    env = {
        "NVIDIA_VISIBLE_DEVICES": "all",
        "MODEL_ID": MODEL_CONFIG["model_id"],
        "API_KEYS": MODEL_CONFIG["api_keys"],
        "HF_TOKEN": os.environ.get("HF_TOKEN", ""),  # For gated models
    }
    
    # Ports to expose
    ports = {
        "22": "tcp",      # SSH
        "8000": "http",   # API
        "9090": "http"    # Metrics (optional)
    }
    
    rental = client.rent_gpu(
        server_id=server_id,
        docker_image=MODEL_CONFIG["docker_image"],
        ports=ports,
        env=env,
        ssh_password=SSH_PASSWORD
    )
    
    print(f"   ✅ Order created: {rental.order_id}")
    print(f"   🔗 SSH: ssh root@{rental.ssh_host} -p {rental.ssh_port}")
    print(f"   🌐 API: {rental.http_endpoint}")
    
    return rental

def wait_for_service(endpoint: str, timeout: int = 300) -> bool:
    """Wait for the inference service to be ready."""
    print_step("4", "Waiting for Service")
    
    # Clean up endpoint
    if not endpoint.startswith("http"):
        endpoint = f"http://{endpoint}"
    
    health_url = f"{endpoint}/health"
    print(f"   🔍 Checking: {health_url}")
    
    start = time.time()
    last_error = None
    
    while time.time() - start < timeout:
        try:
            response = requests.get(health_url, timeout=5)
            if response.status_code == 200:
                data = response.json()
                if data.get("status") == "healthy":
                    print(f"\n   ✅ Service is healthy!")
                    print(f"   🤖 Model: {data.get('model', 'unknown')}")
                    print(f"   🎮 GPU: {'available' if data.get('gpu_available') else 'not available'}")
                    if data.get("gpu_memory_used_mb"):
                        print(f"   💾 GPU Memory: {data['gpu_memory_used_mb']:.0f} MB")
                    return True
        except requests.exceptions.ConnectionError:
            last_error = "Connection refused (server starting...)"
        except requests.exceptions.Timeout:
            last_error = "Timeout"
        except Exception as e:
            last_error = str(e)
        
        elapsed = int(time.time() - start)
        print(f"   ⏳ Waiting... {elapsed}s (last: {last_error})", end="\r")
        time.sleep(5)
    
    print(f"\n   ❌ Timeout waiting for service")
    return False

def test_inference(endpoint: str, api_key: str) -> bool:
    """Test the inference endpoint."""
    print_step("5", "Testing Inference")
    
    if not endpoint.startswith("http"):
        endpoint = f"http://{endpoint}"
    
    url = f"{endpoint}/v1/completions"
    
    headers = {
        "Content-Type": "application/json",
        "X-API-Key": api_key
    }
    
    payload = {
        "prompt": "Hello, how are you today?",
        "max_tokens": 50,
        "temperature": 0.7
    }
    
    print(f"   📤 Request: {payload['prompt']}")
    
    try:
        start = time.time()
        response = requests.post(url, json=payload, headers=headers, timeout=60)
        latency = time.time() - start
        
        if response.status_code == 200:
            data = response.json()
            generated = data.get("choices", [{}])[0].get("text", "")
            tokens = data.get("usage", {}).get("total_tokens", 0)
            
            print(f"   📥 Response: {generated[:100]}...")
            print(f"   ⏱️  Latency: {latency:.2f}s")
            print(f"   🔢 Tokens: {tokens}")
            return True
        else:
            print(f"   ❌ Error: {response.status_code} - {response.text}")
            return False
    
    except Exception as e:
        print(f"   ❌ Error: {e}")
        return False

def print_summary(rental: RentalInfo, endpoint: str, api_key: str):
    """Print deployment summary."""
    print(f"\n{'='*60}")
    print("🎉 DEPLOYMENT SUCCESSFUL!")
    print('='*60)
    
    print(f"""
📋 Deployment Details:
   Order ID: {rental.order_id}
   Server ID: {rental.server_id}
   
🔗 Connection:
   SSH: ssh root@{rental.ssh_host} -p {rental.ssh_port}
   Password: {SSH_PASSWORD}
   
🌐 API Endpoint:
   Base URL: {endpoint}
   Health: {endpoint}/health
   Docs: {endpoint}/docs
   
🔑 Authentication:
   Header: X-API-Key: {api_key}
   
💰 Cost:
   ${rental.cost_per_hour:.2f}/hour
   ~${rental.cost_per_hour * 24:.2f}/day
   ~${rental.cost_per_hour * 720:.2f}/month

📝 Example Usage:

curl -X POST {endpoint}/v1/completions \\
  -H "Content-Type: application/json" \\
  -H "X-API-Key: {api_key}" \\
  -d '{{"prompt": "Hello!", "max_tokens": 100}}'

Python:

import requests

response = requests.post(
    "{endpoint}/v1/completions",
    headers={{"X-API-Key": "{api_key}"}},
    json={{"prompt": "Hello!", "max_tokens": 100}}
)
print(response.json())
""")

def cleanup(client: CloreClient, order_id: int):
    """Cleanup resources."""
    print("\n🧹 Cleaning up...")
    try:
        client.cancel_order(order_id)
        print(f"   ✅ Order {order_id} cancelled")
    except Exception as e:
        print(f"   ⚠️ Cleanup warning: {e}")

def main():
    # Get API key
    api_key = os.environ.get("CLORE_API_KEY")
    if not api_key:
        print("❌ Set CLORE_API_KEY environment variable")
        print("   export CLORE_API_KEY=your_api_key")
        sys.exit(1)
    
    # Initialize client
    client = CloreClient(api_key)
    rental = None
    
    try:
        # Step 1: Check balance
        check_balance(client)
        
        # Step 2: Find GPU
        gpu = find_gpu(client)
        
        # Confirm
        confirm = input("\n🚀 Proceed with deployment? (y/n): ").strip().lower()
        if confirm != 'y':
            print("Cancelled.")
            sys.exit(0)
        
        # Step 3: Deploy
        rental = deploy_server(client, gpu["id"])
        
        # Step 4: Wait for service
        endpoint = rental.http_endpoint
        if not endpoint:
            endpoint = f"{rental.ssh_host}:8000"
        
        if not wait_for_service(endpoint):
            print("\n⚠️ Service may still be starting. Check manually.")
        
        # Step 5: Test
        test_api_key = MODEL_CONFIG["api_keys"].split(",")[0]
        if wait_for_service(endpoint):
            test_inference(endpoint, test_api_key)
        
        # Summary
        print_summary(rental, endpoint, test_api_key)
        
        # Keep running
        print("\n⏸️ Press Enter to cancel the order and cleanup...")
        input()
        
    except KeyboardInterrupt:
        print("\n\n⚠️ Interrupted!")
    finally:
        if rental:
            cleanup(client, rental.order_id)
    
    print("\n✅ Done!")

if __name__ == "__main__":
    main()
```

### Step 5: Monitoring Dashboard

```python
# monitoring.py
"""
Simple monitoring dashboard for deployed model.
Displays real-time metrics and cost tracking.
"""

import os
import time
import requests
from datetime import datetime
from clore_client import CloreClient

class ModelMonitor:
    """Monitor deployed model health and costs."""
    
    def __init__(self, endpoint: str, api_key: str, order_id: int,
                 cost_per_hour: float, clore_client: CloreClient):
        self.endpoint = endpoint
        self.api_key = api_key
        self.order_id = order_id
        self.cost_per_hour = cost_per_hour
        self.client = clore_client
        self.start_time = time.time()
        self.request_count = 0
        self.error_count = 0
    
    def get_health(self) -> dict:
        """Get service health."""
        try:
            url = f"{self.endpoint}/health"
            response = requests.get(url, timeout=5)
            return response.json() if response.status_code == 200 else {"status": "error"}
        except:
            return {"status": "unreachable"}
    
    def get_metrics(self) -> dict:
        """Get Prometheus metrics."""
        try:
            url = f"{self.endpoint}/metrics"
            response = requests.get(url, timeout=5)
            return {"raw": response.text} if response.status_code == 200 else {}
        except:
            return {}
    
    def calculate_cost(self) -> dict:
        """Calculate current costs."""
        runtime_hours = (time.time() - self.start_time) / 3600
        current_cost = runtime_hours * self.cost_per_hour
        
        return {
            "runtime_hours": runtime_hours,
            "cost_usd": current_cost,
            "cost_per_request": current_cost / max(self.request_count, 1),
            "projected_daily": self.cost_per_hour * 24,
            "projected_monthly": self.cost_per_hour * 24 * 30
        }
    
    def check_order_status(self) -> dict:
        """Check order status on Clore."""
        try:
            order = self.client.get_order_status(self.order_id)
            return order if order else {"status": "not_found"}
        except:
            return {"status": "error"}
    
    def print_status(self):
        """Print current status dashboard."""
        health = self.get_health()
        costs = self.calculate_cost()
        order = self.check_order_status()
        
        os.system('clear' if os.name != 'nt' else 'cls')
        
        print("="*60)
        print("📊 MODEL DEPLOYMENT MONITOR")
        print("="*60)
        print(f"Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        print()
        
        # Health
        status_icon = "✅" if health.get("status") == "healthy" else "❌"
        print(f"🏥 Health: {status_icon} {health.get('status', 'unknown')}")
        print(f"   Model: {health.get('model', 'unknown')}")
        print(f"   Device: {health.get('device', 'unknown')}")
        print(f"   GPU: {'✓' if health.get('gpu_available') else '✗'}")
        if health.get("gpu_memory_used_mb"):
            print(f"   GPU Memory: {health['gpu_memory_used_mb']:.0f} MB")
        print()
        
        # Order
        order_status = order.get("status", "unknown")
        print(f"📦 Order: {self.order_id}")
        print(f"   Status: {order_status}")
        print()
        
        # Costs
        print(f"💰 Costs:")
        print(f"   Runtime: {costs['runtime_hours']:.2f} hours")
        print(f"   Current: ${costs['cost_usd']:.4f}")
        print(f"   Per Request: ${costs['cost_per_request']:.6f}")
        print(f"   Daily (proj): ${costs['projected_daily']:.2f}")
        print(f"   Monthly (proj): ${costs['projected_monthly']:.2f}")
        print()
        
        # Stats
        print(f"📈 Stats:")
        print(f"   Requests: {self.request_count}")
        print(f"   Errors: {self.error_count}")
        if self.request_count > 0:
            print(f"   Error Rate: {self.error_count / self.request_count * 100:.1f}%")
        print()
        
        print("="*60)
        print("Press Ctrl+C to stop monitoring")
    
    def run(self, interval: int = 10):
        """Run monitoring loop."""
        print("Starting monitor...")
        
        try:
            while True:
                self.print_status()
                time.sleep(interval)
        except KeyboardInterrupt:
            print("\n\nMonitoring stopped.")

def main():
    """Run monitoring for a deployed model."""
    import argparse
    
    parser = argparse.ArgumentParser(description="Monitor deployed model")
    parser.add_argument("--endpoint", required=True, help="Model endpoint URL")
    parser.add_argument("--order-id", type=int, required=True, help="Clore order ID")
    parser.add_argument("--cost", type=float, required=True, help="Cost per hour in USD")
    parser.add_argument("--interval", type=int, default=10, help="Update interval in seconds")
    
    args = parser.parse_args()
    
    api_key = os.environ.get("CLORE_API_KEY")
    if not api_key:
        print("Set CLORE_API_KEY environment variable")
        return
    
    client = CloreClient(api_key)
    
    monitor = ModelMonitor(
        endpoint=args.endpoint,
        api_key="",  # Not needed for health checks
        order_id=args.order_id,
        cost_per_hour=args.cost,
        clore_client=client
    )
    
    monitor.run(interval=args.interval)

if __name__ == "__main__":
    main()
```

### Running the Deployment

```bash
# 1. Set environment variables
export CLORE_API_KEY=your_api_key
export HF_TOKEN=your_huggingface_token  # Optional, for gated models

# 2. Run deployment
python deploy_model_api.py

# Expected output:
# ============================================================
# 📋 Step 1: Checking Balance
# ============================================================
#    💰 CLORE: 150.00
#    💵 BTC: 0.00123456
#    💲 USD: 50.00
#
# ============================================================
# 📋 Step 2: Finding GPU
# ============================================================
#    🖥️  Server ID: 12345
#    🎮 GPU: RTX 4090
#    💵 Price: $0.35/hr ($8.40/day)
#    ⭐ Reliability: 98%
#
# 🚀 Proceed with deployment? (y/n): y
# ...

# 3. Monitor the deployment
python monitoring.py \
  --endpoint http://your-endpoint:8000 \
  --order-id 12345 \
  --cost 0.35

# 4. Test the API
curl -X POST http://your-endpoint:8000/v1/completions \
  -H "Content-Type: application/json" \
  -H "X-API-Key: my-secret-key-123" \
  -d '{"prompt": "Write a poem about AI:", "max_tokens": 100}'
```

### Using Pre-built Images

Instead of building your own Docker image, use these pre-built inference servers:

```python
# Popular inference images for Clore
INFERENCE_IMAGES = {
    # Hugging Face Text Generation Inference (TGI)
    "tgi": "ghcr.io/huggingface/text-generation-inference:latest",
    
    # vLLM - High-throughput inference
    "vllm": "vllm/vllm-openai:latest",
    
    # Ollama - Easy model serving
    "ollama": "ollama/ollama:latest",
    
    # Basic CUDA with Python
    "cuda": "nvidia/cuda:12.8.0-runtime-ubuntu22.04",
}

# Example: Deploy with vLLM
env = {
    "NVIDIA_VISIBLE_DEVICES": "all",
    "MODEL": "meta-llama/Llama-2-7b-chat-hf",
    "TENSOR_PARALLEL_SIZE": "1"
}

ports = {"22": "tcp", "8000": "http"}

rental = client.rent_gpu(
    server_id=gpu["id"],
    docker_image=INFERENCE_IMAGES["vllm"],
    ports=ports,
    env=env,
    ssh_password="MyPassword123!"
)
```

### Cost Comparison

| Model Size      | GPU       | Clore.ai   | AWS p3.2xlarge | Savings |
| --------------- | --------- | ---------- | -------------- | ------- |
| 7B params       | RTX 3090  | \~$0.20/hr | $3.06/hr       | 93%     |
| 13B params      | RTX 4090  | \~$0.35/hr | $3.06/hr       | 89%     |
| 70B params      | A100 40GB | \~$1.20/hr | $4.10/hr       | 71%     |
| 70B params (Q4) | RTX 4090  | \~$0.35/hr | N/A            | ∞       |

**Monthly savings for a 70B model:** \~$2,100/month compared to AWS.

### Next Steps

* [Multi-Model Inference Router](https://docs.clore.ai/dev/inference-and-deployment/model-router) — A/B testing and canary deployments
* [Real-Time Video Processing](https://docs.clore.ai/dev/inference-and-deployment/video-processing) — GPU-accelerated video pipelines
* [Spot Market Strategies](https://github.com/defiocean/dev/blob/main/cost-optimization/spot-bidding.md) — Further cost reduction
