Triton Inference Server

NVIDIA Triton Inference Server is a production-grade, open-source inference serving platform that supports virtually every major ML framework. Designed for high-throughput, low-latency serving, Triton handles PyTorch, TensorFlow, ONNX, TensorRT, OpenVINO, and more — all from a single server process. Deploy it on Clore.ai's GPU cloud for scalable, cost-efficient inference infrastructure.


What is Triton Inference Server?

Triton is NVIDIA's answer to the challenge of serving ML models at scale:

  • Multi-framework: PyTorch, TensorFlow, TensorRT, ONNX, OpenVINO, Python custom backends

  • Concurrent execution: Multiple models, multiple instances per GPU

  • Dynamic batching: Automatically batch requests for higher throughput

  • gRPC + HTTP: Industry-standard protocols out of the box

  • Metrics: Prometheus-compatible metrics endpoint

  • Model repository: File-system based model management

Ports used:

Port
Protocol
Purpose

8000

HTTP

REST inference API

8001

gRPC

gRPC inference API

8002

HTTP

Prometheus metrics


Prerequisites

Requirement
Minimum
Recommended

GPU VRAM

8 GB

16–24 GB

GPU

Any NVIDIA with CUDA 11+

RTX 4090 / A100

RAM

16 GB

32 GB

Storage

20 GB

50 GB

circle-info

Triton also supports CPU-only inference for non-CUDA workloads. Use the cpu-only variant of the Docker image for cost savings on batch jobs that don't require GPU.


Step 1 — Rent a GPU on Clore.ai

  1. Click Marketplace and filter by VRAM ≥ 16 GB.

  2. Select a server and click Configure.

  3. Set Docker image: nvcr.io/nvidia/tritonserver:24.01-py3

  4. Set open ports: 22 (SSH), 8000 (HTTP), 8001 (gRPC), 8002 (metrics).

  5. Click Rent.

circle-exclamation

Step 2 — Custom Dockerfile (with SSH)

The official Triton image doesn't include an SSH server. Use this Dockerfile:


Step 3 — Understand the Model Repository

Triton loads models from a model repository — a directory with a specific structure:

Each model needs:

  1. A directory with the model name

  2. A config.pbtxt configuration file

  3. At least one version subdirectory (e.g., 1/) with the model file


Step 4 — Deploy a PyTorch Model

Export Model to TorchScript

Set Up Model Repository

Create config.pbtxt


Step 5 — Deploy an ONNX Model

Export to ONNX

ONNX Config


Step 6 — Deploy a Python Custom Backend

For models that don't fit standard backends (custom preprocessing, ensemble logic):


Step 7 — Start Triton and Test

Start Triton Server

Check Available Models

Run Inference via HTTP

Run Inference via gRPC


Monitoring with Prometheus

Triton exposes metrics at port 8002:

Key metrics:


Dynamic Batching Configuration


Troubleshooting

Model Load Failure

Solution: Check directory structure and permissions:

CUDA Incompatibility

Solution: Match Triton image version to your CUDA driver:

Port Not Reachable

Solution: Verify all three ports (8000, 8001, 8002) are forwarded in Clore.ai. Test each:

OOM During Model Loading

Solution: Reduce instance count or use CPU instances for some models:


Cost Estimation

GPU
VRAM
Est. Price
Throughput (ResNet50)

RTX 3080

10 GB

~$0.10/hr

~500 req/sec

RTX 4090

24 GB

~$0.35/hr

~1500 req/sec

A100 40GB

40 GB

~$0.80/hr

~3000 req/sec

H100

80 GB

~$2.50/hr

~8000 req/sec


Useful Resources


Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Production Inference

RTX 4090 (24GB)

~$0.70/gpu/hr

Large Models (70B+)

A100 80GB

~$1.20/gpu/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?