# ONNX Runtime GPU > **Cross-platform, hardware-accelerated ML inference — deploy any model from any framework** ONNX Runtime (ORT) is Microsoft's open-source inference engine for ONNX (Open Neural Network Exchange) models. It provides hardware-accelerated inference across CPUs, GPUs, and specialized accelerators through a unified API. Whether your model was trained in PyTorch, TensorFlow, Scikit-learn, or XGBoost — if you can export it to ONNX format, ORT can run it faster. **GitHub:** [microsoft/onnxruntime](https://github.com/microsoft/onnxruntime) — 14K+ ⭐ *** ## Why ONNX Runtime? | Feature | ONNX Runtime | TorchScript | TensorFlow Serving | | ---------------------- | ----------------- | -------------- | ------------------ | | Framework-agnostic | ✅ | ❌ PyTorch only | ❌ TF only | | GPU acceleration | ✅ CUDA/TensorRT | ✅ | ✅ | | INT8/FP16 quantization | ✅ | Partial | Partial | | Mobile/Edge deploy | ✅ | Limited | Limited | | Operator fusion | ✅ | Partial | ✅ | | Easy integration | ✅ Python/C++/Java | Python | Python/gRPC | {% hint style="success" %} **Key benefit:** ONNX Runtime with CUDA execution provider typically delivers **1.5–3x speedup** over native PyTorch inference for computer vision and NLP models. {% endhint %} *** ## Supported Execution Providers ONNX Runtime supports multiple hardware backends (Execution Providers): | Provider | Hardware | Use Case | | --------------------------- | ------------- | --------------------- | | `CUDAExecutionProvider` | NVIDIA GPUs | General GPU inference | | `TensorrtExecutionProvider` | NVIDIA GPUs | Maximum throughput | | `CPUExecutionProvider` | CPU | Fallback / edge | | `ROCMExecutionProvider` | AMD GPUs | AMD hardware | | `CoreMLExecutionProvider` | Apple Silicon | macOS/iOS | | `OpenVINOExecutionProvider` | Intel | Intel CPUs/GPUs | *** ## Prerequisites * Clore.ai account with a GPU rental * Basic Python knowledge * A trained model (PyTorch, TensorFlow, or pre-exported ONNX) *** ## Step 1 — Rent a GPU on Clore.ai 1. Go to [clore.ai](https://clore.ai) → **Marketplace** 2. Any NVIDIA GPU works — from RTX 3070 for small models to A100 for large transformers 3. **For transformer models:** RTX 4090 or A100 recommended 4. **For computer vision:** RTX 3090 or RTX 4090 is sufficient *** ## Step 2 — Deploy Your Container ONNX Runtime doesn't have an official pre-built container, but the NVIDIA CUDA base is ideal: **Docker Image:** ``` nvcr.io/nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04 ``` **Ports:** ``` 22 ``` **Environment Variables:** ``` NVIDIA_VISIBLE_DEVICES=all NVIDIA_DRIVER_CAPABILITIES=compute,utility ``` {% hint style="info" %} Alternatively, use `pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime` which includes CUDA and a Python environment ready for ORT installation. {% endhint %} *** ## Step 3 — Install ONNX Runtime with GPU Support ```bash ssh root@ -p # Update packages apt-get update && apt-get install -y \ python3-pip \ python3-dev \ wget \ git \ libgomp1 # Install ONNX Runtime with CUDA support pip install onnxruntime-gpu # Install supporting packages pip install \ onnx \ numpy \ Pillow \ transformers \ torch \ torchvision \ fastapi \ uvicorn # Verify installation python3 << 'EOF' import onnxruntime as ort print(f"ORT Version: {ort.__version__}") print(f"Available providers: {ort.get_available_providers()}") # Should include: CUDAExecutionProvider, TensorrtExecutionProvider, CPUExecutionProvider EOF ``` *** ## Step 4 — Export Your Model to ONNX ### PyTorch Model Export ```python import torch import torch.nn as nn import onnx # Example: Export ResNet50 model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True) model.eval() # Create dummy input (batch=1, RGB image 224x224) dummy_input = torch.randn(1, 3, 224, 224) # Export to ONNX torch.onnx.export( model, dummy_input, "resnet50.onnx", export_params=True, opset_version=17, # Use latest stable opset do_constant_folding=True, # Optimize constant ops input_names=["input"], output_names=["output"], dynamic_axes={ "input": {0: "batch_size"}, # Dynamic batch "output": {0: "batch_size"} } ) print("Model exported successfully!") # Verify the exported model onnx_model = onnx.load("resnet50.onnx") onnx.checker.check_model(onnx_model) print("ONNX model is valid!") ``` ### HuggingFace Transformers Export ```bash # Install optimum for HuggingFace ONNX export pip install optimum[exporters] # Export BERT for text classification optimum-cli export onnx \ --model bert-base-uncased \ --task text-classification \ ./bert_onnx/ # Export with optimization optimum-cli export onnx \ --model microsoft/phi-2 \ --task text-generation \ --optimize O2 \ ./phi2_onnx/ ``` ### Export with ORT Optimization ```python from optimum.onnxruntime import ORTModelForSequenceClassification from optimum.onnxruntime.configuration import OptimizationConfig, ORTConfig from optimum.onnxruntime import ORTOptimizer # Load and optimize model = ORTModelForSequenceClassification.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english", export=True ) optimizer = ORTOptimizer.from_pretrained(model) optimization_config = OptimizationConfig( optimization_level=2, optimize_for_gpu=True, fp16=True ) optimizer.optimize( save_dir="./distilbert_optimized", optimization_config=optimization_config ) ``` *** ## Step 5 — Run Inference with ONNX Runtime ### Basic GPU Inference ```python import onnxruntime as ort import numpy as np from PIL import Image import torchvision.transforms as transforms # Configure session with GPU execution providers # Providers are tried in order — CUDA first, then CPU fallback providers = [ ("CUDAExecutionProvider", { "device_id": 0, "arena_extend_strategy": "kNextPowerOfTwo", "gpu_mem_limit": 4 * 1024 * 1024 * 1024, # 4GB limit "cudnn_conv_algo_search": "EXHAUSTIVE", "do_copy_in_default_stream": True, }), "CPUExecutionProvider" ] # Session options for performance opts = ort.SessionOptions() opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL opts.intra_op_num_threads = 8 opts.execution_mode = ort.ExecutionMode.ORT_PARALLEL # Load model session = ort.InferenceSession( "resnet50.onnx", sess_options=opts, providers=providers ) print(f"Running on: {session.get_providers()}") # Prepare input transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) img = Image.open("test_image.jpg").convert("RGB") img_tensor = transform(img).unsqueeze(0).numpy() # Run inference outputs = session.run(None, {"input": img_tensor}) probabilities = outputs[0][0] top5_idx = probabilities.argsort()[-5:][::-1] print("Top 5 predictions:", top5_idx, probabilities[top5_idx]) ``` ### Batch Inference for Throughput ```python import onnxruntime as ort import numpy as np import time session = ort.InferenceSession( "resnet50.onnx", providers=["CUDAExecutionProvider"] ) # Warm up GPU dummy = np.random.randn(1, 3, 224, 224).astype(np.float32) for _ in range(10): session.run(None, {"input": dummy}) # Benchmark batch sizes for batch_size in [1, 4, 8, 16, 32, 64]: inputs = np.random.randn(batch_size, 3, 224, 224).astype(np.float32) start = time.time() n_iter = 100 for _ in range(n_iter): session.run(None, {"input": inputs}) elapsed = time.time() - start throughput = (batch_size * n_iter) / elapsed latency = (elapsed / n_iter) * 1000 # ms print(f"Batch {batch_size:3d}: {throughput:7.1f} img/sec, {latency:.1f}ms/batch") ``` *** ## Step 6 — TensorRT Execution Provider (Maximum Performance) For NVIDIA GPUs, TensorRT EP provides even better performance: ```python import onnxruntime as ort import numpy as np # TensorRT execution provider configuration tensorrt_provider_options = { "trt_max_workspace_size": 4 * 1024 * 1024 * 1024, # 4GB "trt_fp16_enable": True, # Enable FP16 for faster inference "trt_int8_enable": False, "trt_engine_cache_enable": True, # Cache compiled engines "trt_engine_cache_path": "/tmp/trt_cache", "trt_max_partition_iterations": 1000, "trt_min_subgraph_size": 1, "trt_timing_cache_enable": True, } providers = [ ("TensorrtExecutionProvider", tensorrt_provider_options), ("CUDAExecutionProvider", {"device_id": 0}), "CPUExecutionProvider" ] session = ort.InferenceSession("resnet50.onnx", providers=providers) print("Active provider:", session.get_providers()[0]) # First run compiles TensorRT engine (may take 1-3 minutes) # Subsequent runs use cached engine and are very fast ``` {% hint style="warning" %} **TensorRT engine compilation** happens on the first inference and can take 1–5 minutes. Enable caching (`trt_engine_cache_enable: True`) so the compiled engine is reused across sessions. {% endhint %} *** ## Step 7 — INT8 Quantization for Maximum Speed ```python from onnxruntime.quantization import quantize_dynamic, quantize_static, QuantType import onnxruntime as ort import numpy as np # Dynamic INT8 quantization (no calibration data needed) quantize_dynamic( model_input="resnet50.onnx", model_output="resnet50_int8_dynamic.onnx", weight_type=QuantType.QInt8 ) # Static INT8 quantization (requires calibration data) from onnxruntime.quantization import CalibrationDataReader class ImageCalibrationReader(CalibrationDataReader): def __init__(self, data_dir, input_name="input"): self.data_dir = data_dir self.input_name = input_name self.images = self._load_images() self.idx = 0 def _load_images(self): # Load 100 calibration images import glob, torchvision.transforms as T from PIL import Image transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor()]) images = [] for path in glob.glob(f"{self.data_dir}/*.jpg")[:100]: img = Image.open(path).convert("RGB") images.append(transform(img).numpy()) return images def get_next(self): if self.idx >= len(self.images): return None data = {self.input_name: self.images[self.idx:self.idx+1]} self.idx += 1 return data from onnxruntime.quantization import quantize_static, QuantFormat quantize_static( model_input="resnet50.onnx", model_output="resnet50_int8_static.onnx", calibration_data_reader=ImageCalibrationReader("/data/calibration_images"), quant_format=QuantFormat.QDQ, weight_type=QuantType.QInt8 ) ``` *** ## Step 8 — Build an Inference API ```bash cat > /workspace/onnx_api.py << 'EOF' from fastapi import FastAPI, File, UploadFile from fastapi.responses import JSONResponse import onnxruntime as ort import numpy as np from PIL import Image import io import torchvision.transforms as transforms import json app = FastAPI(title="ONNX Runtime Inference API") # Load model at startup session = ort.InferenceSession( "resnet50.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"] ) # Load ImageNet class labels with open("imagenet_classes.json") as f: classes = json.load(f) transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) @app.get("/health") async def health(): return {"status": "ok", "providers": session.get_providers()} @app.post("/predict") async def predict(file: UploadFile = File(...), topk: int = 5): image_data = await file.read() img = Image.open(io.BytesIO(image_data)).convert("RGB") tensor = transform(img).unsqueeze(0).numpy() outputs = session.run(None, {"input": tensor})[0][0] top_indices = outputs.argsort()[-topk:][::-1] results = [ {"label": classes[str(i)], "score": float(outputs[i])} for i in top_indices ] return JSONResponse({"predictions": results}) if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8080) EOF python3 /workspace/onnx_api.py & # Test the API curl -X POST "http://localhost:8080/predict" \ -H "accept: application/json" \ -F "file=@test_image.jpg" ``` *** ## Step 9 — Monitor GPU Usage ```bash # Real-time GPU monitoring during inference watch -n 0.5 nvidia-smi # Or use nvitop for a better UI pip install nvitop nvitop ``` *** ## Performance Benchmarks | Model | GPU | Provider | Throughput (inf/sec) | | --------- | -------- | ------------- | -------------------- | | ResNet50 | RTX 4090 | CUDA | \~4,200 | | ResNet50 | RTX 4090 | TensorRT FP16 | \~8,500 | | BERT Base | RTX 4090 | CUDA | \~380 | | BERT Base | RTX 4090 | TensorRT FP16 | \~720 | | YOLOv8n | RTX 3090 | CUDA | \~1,800 | | YOLOv8x | A100 | TensorRT FP16 | \~920 | *** ## Troubleshooting ### CUDA Provider Not Available ```bash # Check CUDA ORT is installed (not CPU-only version) pip uninstall onnxruntime pip install onnxruntime-gpu python3 -c "import onnxruntime as ort; print(ort.get_available_providers())" ``` ### TensorRT Compilation Errors ```bash # Check TensorRT version compatibility python3 -c "import tensorrt; print(tensorrt.__version__)" # Use CUDA EP instead providers = ["CUDAExecutionProvider"] # Skip TensorRT EP ``` ### Shape Mismatch Errors ```python # Check model input/output shapes for input in session.get_inputs(): print(f"Input: {input.name}, shape: {input.shape}, type: {input.type}") for output in session.get_outputs(): print(f"Output: {output.name}, shape: {output.shape}, type: {output.type}") ``` *** ## Advanced: Multi-Model Pipeline ```python import onnxruntime as ort import numpy as np class MultiModelPipeline: def __init__(self): providers = ["CUDAExecutionProvider"] self.detector = ort.InferenceSession("detector.onnx", providers=providers) self.classifier = ort.InferenceSession("classifier.onnx", providers=providers) def run(self, image: np.ndarray) -> list: # Stage 1: Object detection boxes = self.detector.run(None, {"image": image})[0] results = [] for box in boxes: # Crop detected region crop = self._crop(image, box) # Stage 2: Classify each region label = self.classifier.run(None, {"input": crop})[0] results.append({"box": box.tolist(), "label": int(label.argmax())}) return results def _crop(self, image, box): x1, y1, x2, y2 = box.astype(int) return image[:, :, y1:y2, x1:x2] pipeline = MultiModelPipeline() ``` *** ## Additional Resources * [ONNX Runtime GitHub](https://github.com/microsoft/onnxruntime) * [ONNX Runtime Documentation](https://onnxruntime.ai/docs/) * [Hugging Face Optimum](https://huggingface.co/docs/optimum/) * [ONNX Model Zoo](https://github.com/onnx/models) — Pre-exported models * [Netron](https://netron.app/) — ONNX model visualizer * [ONNX Runtime Python API](https://onnxruntime.ai/docs/api/python/) *** *ONNX Runtime on Clore.ai is the ideal choice for production inference services that need to serve models from different frameworks with maximum GPU efficiency.* *** ## Clore.ai GPU Recommendations | Use Case | Recommended GPU | Est. Cost on Clore.ai | | ---------------------- | --------------- | --------------------- | | Development/Testing | RTX 3090 (24GB) | \~$0.12/gpu/hr | | Production Inference | RTX 4090 (24GB) | \~$0.70/gpu/hr | | Large Scale Deployment | A100 80GB | \~$1.20/gpu/hr | > 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access. --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://docs.clore.ai/guides/gpu-devops/onnx-runtime.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.