Multi-Model Inference Router

Multi-Model Inference Router on Clore.ai

What We're Building

A production-ready multi-model inference routing system that:

  • Routes inference requests to different models across multiple Clore.ai GPUs

  • Supports A/B testing for model comparison

  • Enables canary deployments for gradual rollouts

  • Tracks performance metrics (latency, cost, accuracy) per model

  • Dynamically adjusts routing based on model performance and cost

Think of it as your own load balancer for AI models — deploy multiple versions across cheap Clore.ai GPUs and let the router decide which one handles each request.

Use cases:

  • Compare Llama 3.1 70B vs Llama 3.2 90B in production

  • Roll out a fine-tuned model to 10% of traffic first

  • Route expensive requests to cheaper GPU tiers

  • Failover between models when one goes down

Prerequisites

  • Clore.ai account with 50+ CLORE balance

  • Python 3.10+

  • Docker (for model server images)

  • Basic understanding of FastAPI

Inference Request ──▶ │ Strategy │ │ Model Registry │ │ │ │ Engine │ │ │ │ │ └────┬─────┘ │ ┌────────────┐ │ │ │ │ │ │ model-a │ │ │ │ ▼ │ │ model-b │ │ │ │ ┌─────────┐ │ │ model-c │ │ │ │ │ Health │ │ └────────────┘ │ │ │ │ Checker │ └──────────────────┘ │ └───┴────┬────┴────────────────────────┘ │ ┌────────────────────┼────────────────────┐ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Clore GPU 1 │ │ Clore GPU 2 │ │ Clore GPU 3 │ │ RTX 4090 │ │ RTX 3090 │ │ A100 │ │ Llama 70B │ │ Llama 70B │ │ Llama 90B │ │ (v1.0.0) │ │ (v1.1.0) │ │ (canary) │ └───────────────┘ └───────────────┘ └───────────────┘

clore_client.py

import requests import time from typing import Optional, Dict, Any, List from dataclasses import dataclass

@dataclass class DeployedModel: """Represents a deployed model on Clore.ai.""" order_id: int server_id: int model_name: str model_version: str endpoint: str gpu_type: str cost_per_hour: float weight: float = 1.0 # For A/B testing is_canary: bool = False health_status: str = "unknown" avg_latency_ms: float = 0.0 request_count: int = 0 error_count: int = 0

Step 2: Routing Strategies

Step 3: Model Registry and Health Checker

Step 4: Model Deployer

Step 5: FastAPI Router Service

Step 6: Complete Deployment Script

Running the Router

Cost Comparison

Setup
GPU Config
Hourly Cost
Monthly Cost
Typical Cloud

Single Model

1x RTX 4090

~$0.40

~$288

AWS g5.xlarge: $720

A/B Test (2 models)

2x RTX 4090

~$0.80

~$576

AWS (2x): $1,440

Canary (3 replicas)

2x RTX 4090 + 1x A100

~$1.80

~$1,296

AWS (3x): $2,160

Cost-Optimized

Mixed tier

~$1.20

~$864

AWS mixed: $1,500+

Savings: 50-60% compared to major cloud providers, with more GPU variety.

Next Steps

Last updated

Was this helpful?