ONNX Runtime GPU

Cross-platform, hardware-accelerated ML inference — deploy any model from any framework

ONNX Runtime (ORT) is Microsoft's open-source inference engine for ONNX (Open Neural Network Exchange) models. It provides hardware-accelerated inference across CPUs, GPUs, and specialized accelerators through a unified API. Whether your model was trained in PyTorch, TensorFlow, Scikit-learn, or XGBoost — if you can export it to ONNX format, ORT can run it faster.

GitHub: microsoft/onnxruntimearrow-up-right — 14K+ ⭐


Why ONNX Runtime?

Feature
ONNX Runtime
TorchScript
TensorFlow Serving

Framework-agnostic

❌ PyTorch only

❌ TF only

GPU acceleration

✅ CUDA/TensorRT

INT8/FP16 quantization

Partial

Partial

Mobile/Edge deploy

Limited

Limited

Operator fusion

Partial

Easy integration

✅ Python/C++/Java

Python

Python/gRPC

circle-check

Supported Execution Providers

ONNX Runtime supports multiple hardware backends (Execution Providers):

Provider
Hardware
Use Case

CUDAExecutionProvider

NVIDIA GPUs

General GPU inference

TensorrtExecutionProvider

NVIDIA GPUs

Maximum throughput

CPUExecutionProvider

CPU

Fallback / edge

ROCMExecutionProvider

AMD GPUs

AMD hardware

CoreMLExecutionProvider

Apple Silicon

macOS/iOS

OpenVINOExecutionProvider

Intel

Intel CPUs/GPUs


Prerequisites

  • Clore.ai account with a GPU rental

  • Basic Python knowledge

  • A trained model (PyTorch, TensorFlow, or pre-exported ONNX)


Step 1 — Rent a GPU on Clore.ai

  1. Go to clore.aiarrow-up-rightMarketplace

  2. Any NVIDIA GPU works — from RTX 3070 for small models to A100 for large transformers

  3. For transformer models: RTX 4090 or A100 recommended

  4. For computer vision: RTX 3090 or RTX 4090 is sufficient


Step 2 — Deploy Your Container

ONNX Runtime doesn't have an official pre-built container, but the NVIDIA CUDA base is ideal:

Docker Image:

Ports:

Environment Variables:

circle-info

Alternatively, use pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime which includes CUDA and a Python environment ready for ORT installation.


Step 3 — Install ONNX Runtime with GPU Support


Step 4 — Export Your Model to ONNX

PyTorch Model Export

HuggingFace Transformers Export

Export with ORT Optimization


Step 5 — Run Inference with ONNX Runtime

Basic GPU Inference

Batch Inference for Throughput


Step 6 — TensorRT Execution Provider (Maximum Performance)

For NVIDIA GPUs, TensorRT EP provides even better performance:

circle-exclamation

Step 7 — INT8 Quantization for Maximum Speed


Step 8 — Build an Inference API


Step 9 — Monitor GPU Usage


Performance Benchmarks

Model
GPU
Provider
Throughput (inf/sec)

ResNet50

RTX 4090

CUDA

~4,200

ResNet50

RTX 4090

TensorRT FP16

~8,500

BERT Base

RTX 4090

CUDA

~380

BERT Base

RTX 4090

TensorRT FP16

~720

YOLOv8n

RTX 3090

CUDA

~1,800

YOLOv8x

A100

TensorRT FP16

~920


Troubleshooting

CUDA Provider Not Available

TensorRT Compilation Errors

Shape Mismatch Errors


Advanced: Multi-Model Pipeline


Additional Resources


ONNX Runtime on Clore.ai is the ideal choice for production inference services that need to serve models from different frameworks with maximum GPU efficiency.


Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Production Inference

RTX 4090 (24GB)

~$0.70/gpu/hr

Large Scale Deployment

A100 80GB

~$1.20/gpu/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?