# ONNX Runtime GPU

> **Cross-platform, hardware-accelerated ML inference — deploy any model from any framework**

ONNX Runtime (ORT) is Microsoft's open-source inference engine for ONNX (Open Neural Network Exchange) models. It provides hardware-accelerated inference across CPUs, GPUs, and specialized accelerators through a unified API. Whether your model was trained in PyTorch, TensorFlow, Scikit-learn, or XGBoost — if you can export it to ONNX format, ORT can run it faster.

**GitHub:** [microsoft/onnxruntime](https://github.com/microsoft/onnxruntime) — 14K+ ⭐

***

## Why ONNX Runtime?

| Feature                | ONNX Runtime      | TorchScript    | TensorFlow Serving |
| ---------------------- | ----------------- | -------------- | ------------------ |
| Framework-agnostic     | ✅                 | ❌ PyTorch only | ❌ TF only          |
| GPU acceleration       | ✅ CUDA/TensorRT   | ✅              | ✅                  |
| INT8/FP16 quantization | ✅                 | Partial        | Partial            |
| Mobile/Edge deploy     | ✅                 | Limited        | Limited            |
| Operator fusion        | ✅                 | Partial        | ✅                  |
| Easy integration       | ✅ Python/C++/Java | Python         | Python/gRPC        |

{% hint style="success" %}
**Key benefit:** ONNX Runtime with CUDA execution provider typically delivers **1.5–3x speedup** over native PyTorch inference for computer vision and NLP models.
{% endhint %}

***

## Supported Execution Providers

ONNX Runtime supports multiple hardware backends (Execution Providers):

| Provider                    | Hardware      | Use Case              |
| --------------------------- | ------------- | --------------------- |
| `CUDAExecutionProvider`     | NVIDIA GPUs   | General GPU inference |
| `TensorrtExecutionProvider` | NVIDIA GPUs   | Maximum throughput    |
| `CPUExecutionProvider`      | CPU           | Fallback / edge       |
| `ROCMExecutionProvider`     | AMD GPUs      | AMD hardware          |
| `CoreMLExecutionProvider`   | Apple Silicon | macOS/iOS             |
| `OpenVINOExecutionProvider` | Intel         | Intel CPUs/GPUs       |

***

## Prerequisites

* Clore.ai account with a GPU rental
* Basic Python knowledge
* A trained model (PyTorch, TensorFlow, or pre-exported ONNX)

***

## Step 1 — Rent a GPU on Clore.ai

1. Go to [clore.ai](https://clore.ai) → **Marketplace**
2. Any NVIDIA GPU works — from RTX 3070 for small models to A100 for large transformers
3. **For transformer models:** RTX 4090 or A100 recommended
4. **For computer vision:** RTX 3090 or RTX 4090 is sufficient

***

## Step 2 — Deploy Your Container

ONNX Runtime doesn't have an official pre-built container, but the NVIDIA CUDA base is ideal:

**Docker Image:**

```
nvcr.io/nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04
```

**Ports:**

```
22
```

**Environment Variables:**

```
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
```

{% hint style="info" %}
Alternatively, use `pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime` which includes CUDA and a Python environment ready for ORT installation.
{% endhint %}

***

## Step 3 — Install ONNX Runtime with GPU Support

```bash
ssh root@<server-ip> -p <ssh-port>

# Update packages
apt-get update && apt-get install -y \
    python3-pip \
    python3-dev \
    wget \
    git \
    libgomp1

# Install ONNX Runtime with CUDA support
pip install onnxruntime-gpu

# Install supporting packages
pip install \
    onnx \
    numpy \
    Pillow \
    transformers \
    torch \
    torchvision \
    fastapi \
    uvicorn

# Verify installation
python3 << 'EOF'
import onnxruntime as ort
print(f"ORT Version: {ort.__version__}")
print(f"Available providers: {ort.get_available_providers()}")
# Should include: CUDAExecutionProvider, TensorrtExecutionProvider, CPUExecutionProvider
EOF
```

***

## Step 4 — Export Your Model to ONNX

### PyTorch Model Export

```python
import torch
import torch.nn as nn
import onnx

# Example: Export ResNet50
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True)
model.eval()

# Create dummy input (batch=1, RGB image 224x224)
dummy_input = torch.randn(1, 3, 224, 224)

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "resnet50.onnx",
    export_params=True,
    opset_version=17,              # Use latest stable opset
    do_constant_folding=True,      # Optimize constant ops
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},    # Dynamic batch
        "output": {0: "batch_size"}
    }
)
print("Model exported successfully!")

# Verify the exported model
onnx_model = onnx.load("resnet50.onnx")
onnx.checker.check_model(onnx_model)
print("ONNX model is valid!")
```

### HuggingFace Transformers Export

```bash
# Install optimum for HuggingFace ONNX export
pip install optimum[exporters]

# Export BERT for text classification
optimum-cli export onnx \
    --model bert-base-uncased \
    --task text-classification \
    ./bert_onnx/

# Export with optimization
optimum-cli export onnx \
    --model microsoft/phi-2 \
    --task text-generation \
    --optimize O2 \
    ./phi2_onnx/
```

### Export with ORT Optimization

```python
from optimum.onnxruntime import ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import OptimizationConfig, ORTConfig
from optimum.onnxruntime import ORTOptimizer

# Load and optimize
model = ORTModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
    export=True
)

optimizer = ORTOptimizer.from_pretrained(model)
optimization_config = OptimizationConfig(
    optimization_level=2,
    optimize_for_gpu=True,
    fp16=True
)

optimizer.optimize(
    save_dir="./distilbert_optimized",
    optimization_config=optimization_config
)
```

***

## Step 5 — Run Inference with ONNX Runtime

### Basic GPU Inference

```python
import onnxruntime as ort
import numpy as np
from PIL import Image
import torchvision.transforms as transforms

# Configure session with GPU execution providers
# Providers are tried in order — CUDA first, then CPU fallback
providers = [
    ("CUDAExecutionProvider", {
        "device_id": 0,
        "arena_extend_strategy": "kNextPowerOfTwo",
        "gpu_mem_limit": 4 * 1024 * 1024 * 1024,  # 4GB limit
        "cudnn_conv_algo_search": "EXHAUSTIVE",
        "do_copy_in_default_stream": True,
    }),
    "CPUExecutionProvider"
]

# Session options for performance
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
opts.intra_op_num_threads = 8
opts.execution_mode = ort.ExecutionMode.ORT_PARALLEL

# Load model
session = ort.InferenceSession(
    "resnet50.onnx",
    sess_options=opts,
    providers=providers
)

print(f"Running on: {session.get_providers()}")

# Prepare input
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

img = Image.open("test_image.jpg").convert("RGB")
img_tensor = transform(img).unsqueeze(0).numpy()

# Run inference
outputs = session.run(None, {"input": img_tensor})
probabilities = outputs[0][0]
top5_idx = probabilities.argsort()[-5:][::-1]
print("Top 5 predictions:", top5_idx, probabilities[top5_idx])
```

### Batch Inference for Throughput

```python
import onnxruntime as ort
import numpy as np
import time

session = ort.InferenceSession(
    "resnet50.onnx",
    providers=["CUDAExecutionProvider"]
)

# Warm up GPU
dummy = np.random.randn(1, 3, 224, 224).astype(np.float32)
for _ in range(10):
    session.run(None, {"input": dummy})

# Benchmark batch sizes
for batch_size in [1, 4, 8, 16, 32, 64]:
    inputs = np.random.randn(batch_size, 3, 224, 224).astype(np.float32)
    
    start = time.time()
    n_iter = 100
    for _ in range(n_iter):
        session.run(None, {"input": inputs})
    elapsed = time.time() - start
    
    throughput = (batch_size * n_iter) / elapsed
    latency = (elapsed / n_iter) * 1000  # ms
    
    print(f"Batch {batch_size:3d}: {throughput:7.1f} img/sec, {latency:.1f}ms/batch")
```

***

## Step 6 — TensorRT Execution Provider (Maximum Performance)

For NVIDIA GPUs, TensorRT EP provides even better performance:

```python
import onnxruntime as ort
import numpy as np

# TensorRT execution provider configuration
tensorrt_provider_options = {
    "trt_max_workspace_size": 4 * 1024 * 1024 * 1024,  # 4GB
    "trt_fp16_enable": True,          # Enable FP16 for faster inference
    "trt_int8_enable": False,
    "trt_engine_cache_enable": True,   # Cache compiled engines
    "trt_engine_cache_path": "/tmp/trt_cache",
    "trt_max_partition_iterations": 1000,
    "trt_min_subgraph_size": 1,
    "trt_timing_cache_enable": True,
}

providers = [
    ("TensorrtExecutionProvider", tensorrt_provider_options),
    ("CUDAExecutionProvider", {"device_id": 0}),
    "CPUExecutionProvider"
]

session = ort.InferenceSession("resnet50.onnx", providers=providers)
print("Active provider:", session.get_providers()[0])

# First run compiles TensorRT engine (may take 1-3 minutes)
# Subsequent runs use cached engine and are very fast
```

{% hint style="warning" %}
**TensorRT engine compilation** happens on the first inference and can take 1–5 minutes. Enable caching (`trt_engine_cache_enable: True`) so the compiled engine is reused across sessions.
{% endhint %}

***

## Step 7 — INT8 Quantization for Maximum Speed

```python
from onnxruntime.quantization import quantize_dynamic, quantize_static, QuantType
import onnxruntime as ort
import numpy as np

# Dynamic INT8 quantization (no calibration data needed)
quantize_dynamic(
    model_input="resnet50.onnx",
    model_output="resnet50_int8_dynamic.onnx",
    weight_type=QuantType.QInt8
)

# Static INT8 quantization (requires calibration data)
from onnxruntime.quantization import CalibrationDataReader

class ImageCalibrationReader(CalibrationDataReader):
    def __init__(self, data_dir, input_name="input"):
        self.data_dir = data_dir
        self.input_name = input_name
        self.images = self._load_images()
        self.idx = 0
    
    def _load_images(self):
        # Load 100 calibration images
        import glob, torchvision.transforms as T
        from PIL import Image
        transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor()])
        images = []
        for path in glob.glob(f"{self.data_dir}/*.jpg")[:100]:
            img = Image.open(path).convert("RGB")
            images.append(transform(img).numpy())
        return images
    
    def get_next(self):
        if self.idx >= len(self.images):
            return None
        data = {self.input_name: self.images[self.idx:self.idx+1]}
        self.idx += 1
        return data

from onnxruntime.quantization import quantize_static, QuantFormat
quantize_static(
    model_input="resnet50.onnx",
    model_output="resnet50_int8_static.onnx",
    calibration_data_reader=ImageCalibrationReader("/data/calibration_images"),
    quant_format=QuantFormat.QDQ,
    weight_type=QuantType.QInt8
)
```

***

## Step 8 — Build an Inference API

```bash
cat > /workspace/onnx_api.py << 'EOF'
from fastapi import FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
import onnxruntime as ort
import numpy as np
from PIL import Image
import io
import torchvision.transforms as transforms
import json

app = FastAPI(title="ONNX Runtime Inference API")

# Load model at startup
session = ort.InferenceSession(
    "resnet50.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Load ImageNet class labels
with open("imagenet_classes.json") as f:
    classes = json.load(f)

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

@app.get("/health")
async def health():
    return {"status": "ok", "providers": session.get_providers()}

@app.post("/predict")
async def predict(file: UploadFile = File(...), topk: int = 5):
    image_data = await file.read()
    img = Image.open(io.BytesIO(image_data)).convert("RGB")
    tensor = transform(img).unsqueeze(0).numpy()
    
    outputs = session.run(None, {"input": tensor})[0][0]
    top_indices = outputs.argsort()[-topk:][::-1]
    
    results = [
        {"label": classes[str(i)], "score": float(outputs[i])}
        for i in top_indices
    ]
    return JSONResponse({"predictions": results})

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
EOF

python3 /workspace/onnx_api.py &

# Test the API
curl -X POST "http://localhost:8080/predict" \
    -H "accept: application/json" \
    -F "file=@test_image.jpg"
```

***

## Step 9 — Monitor GPU Usage

```bash
# Real-time GPU monitoring during inference
watch -n 0.5 nvidia-smi

# Or use nvitop for a better UI
pip install nvitop
nvitop
```

***

## Performance Benchmarks

| Model     | GPU      | Provider      | Throughput (inf/sec) |
| --------- | -------- | ------------- | -------------------- |
| ResNet50  | RTX 4090 | CUDA          | \~4,200              |
| ResNet50  | RTX 4090 | TensorRT FP16 | \~8,500              |
| BERT Base | RTX 4090 | CUDA          | \~380                |
| BERT Base | RTX 4090 | TensorRT FP16 | \~720                |
| YOLOv8n   | RTX 3090 | CUDA          | \~1,800              |
| YOLOv8x   | A100     | TensorRT FP16 | \~920                |

***

## Troubleshooting

### CUDA Provider Not Available

```bash
# Check CUDA ORT is installed (not CPU-only version)
pip uninstall onnxruntime
pip install onnxruntime-gpu

python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"
```

### TensorRT Compilation Errors

```bash
# Check TensorRT version compatibility
python3 -c "import tensorrt; print(tensorrt.__version__)"

# Use CUDA EP instead
providers = ["CUDAExecutionProvider"]  # Skip TensorRT EP
```

### Shape Mismatch Errors

```python
# Check model input/output shapes
for input in session.get_inputs():
    print(f"Input: {input.name}, shape: {input.shape}, type: {input.type}")

for output in session.get_outputs():
    print(f"Output: {output.name}, shape: {output.shape}, type: {output.type}")
```

***

## Advanced: Multi-Model Pipeline

```python
import onnxruntime as ort
import numpy as np

class MultiModelPipeline:
    def __init__(self):
        providers = ["CUDAExecutionProvider"]
        self.detector = ort.InferenceSession("detector.onnx", providers=providers)
        self.classifier = ort.InferenceSession("classifier.onnx", providers=providers)
    
    def run(self, image: np.ndarray) -> list:
        # Stage 1: Object detection
        boxes = self.detector.run(None, {"image": image})[0]
        
        results = []
        for box in boxes:
            # Crop detected region
            crop = self._crop(image, box)
            
            # Stage 2: Classify each region
            label = self.classifier.run(None, {"input": crop})[0]
            results.append({"box": box.tolist(), "label": int(label.argmax())})
        
        return results
    
    def _crop(self, image, box):
        x1, y1, x2, y2 = box.astype(int)
        return image[:, :, y1:y2, x1:x2]

pipeline = MultiModelPipeline()
```

***

## Additional Resources

* [ONNX Runtime GitHub](https://github.com/microsoft/onnxruntime)
* [ONNX Runtime Documentation](https://onnxruntime.ai/docs/)
* [Hugging Face Optimum](https://huggingface.co/docs/optimum/)
* [ONNX Model Zoo](https://github.com/onnx/models) — Pre-exported models
* [Netron](https://netron.app/) — ONNX model visualizer
* [ONNX Runtime Python API](https://onnxruntime.ai/docs/api/python/)

***

*ONNX Runtime on Clore.ai is the ideal choice for production inference services that need to serve models from different frameworks with maximum GPU efficiency.*

***

## Clore.ai GPU Recommendations

| Use Case               | Recommended GPU | Est. Cost on Clore.ai |
| ---------------------- | --------------- | --------------------- |
| Development/Testing    | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production Inference   | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Scale Deployment | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.
