# Triton Inference Server

**NVIDIA Triton Inference Server** is a production-grade, open-source inference serving platform that supports virtually every major ML framework. Designed for high-throughput, low-latency serving, Triton handles PyTorch, TensorFlow, ONNX, TensorRT, OpenVINO, and more — all from a single server process. Deploy it on Clore.ai's GPU cloud for scalable, cost-efficient inference infrastructure.

***

## What is Triton Inference Server?

Triton is NVIDIA's answer to the challenge of serving ML models at scale:

* **Multi-framework:** PyTorch, TensorFlow, TensorRT, ONNX, OpenVINO, Python custom backends
* **Concurrent execution:** Multiple models, multiple instances per GPU
* **Dynamic batching:** Automatically batch requests for higher throughput
* **gRPC + HTTP:** Industry-standard protocols out of the box
* **Metrics:** Prometheus-compatible metrics endpoint
* **Model repository:** File-system based model management

**Ports used:**

| Port | Protocol | Purpose            |
| ---- | -------- | ------------------ |
| 8000 | HTTP     | REST inference API |
| 8001 | gRPC     | gRPC inference API |
| 8002 | HTTP     | Prometheus metrics |

***

## Prerequisites

| Requirement | Minimum                  | Recommended     |
| ----------- | ------------------------ | --------------- |
| GPU VRAM    | 8 GB                     | 16–24 GB        |
| GPU         | Any NVIDIA with CUDA 11+ | RTX 4090 / A100 |
| RAM         | 16 GB                    | 32 GB           |
| Storage     | 20 GB                    | 50 GB           |

{% hint style="info" %}
Triton also supports CPU-only inference for non-CUDA workloads. Use the `cpu-only` variant of the Docker image for cost savings on batch jobs that don't require GPU.
{% endhint %}

***

## Step 1 — Rent a GPU on Clore.ai

1. Log in to [clore.ai](https://clore.ai).
2. Click **Marketplace** and filter by VRAM ≥ 16 GB.
3. Select a server and click **Configure**.
4. Set Docker image: **`nvcr.io/nvidia/tritonserver:24.01-py3`**
   * (Replace `24.01` with the latest version — check [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver))
5. Set open ports: `22` (SSH), `8000` (HTTP), `8001` (gRPC), `8002` (metrics).
6. Click **Rent**.

{% hint style="warning" %}
Triton Docker images are large (\~15–20 GB). Allow 3–5 minutes for initial pull on first launch. Subsequent starts are fast.
{% endhint %}

***

## Step 2 — Custom Dockerfile (with SSH)

The official Triton image doesn't include an SSH server. Use this Dockerfile:

```dockerfile
FROM nvcr.io/nvidia/tritonserver:24.01-py3

RUN apt-get update && apt-get install -y \
    openssh-server \
    wget curl \
    && rm -rf /var/lib/apt/lists/*

# Configure SSH
RUN mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Install Python client library
RUN pip install tritonclient[all] numpy Pillow

RUN mkdir -p /models

EXPOSE 22 8000 8001 8002

CMD service ssh start && \
    tritonserver \
        --model-repository=/models \
        --log-verbose=0 \
        --http-port=8000 \
        --grpc-port=8001 \
        --metrics-port=8002
```

***

## Step 3 — Understand the Model Repository

Triton loads models from a **model repository** — a directory with a specific structure:

```
/models/
├── model_name_1/
│   ├── config.pbtxt          # Model configuration
│   ├── 1/                    # Version 1
│   │   └── model.pt          # Model file
│   └── 2/                    # Version 2 (optional)
│       └── model.pt
├── model_name_2/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
```

Each model needs:

1. A directory with the model name
2. A `config.pbtxt` configuration file
3. At least one version subdirectory (e.g., `1/`) with the model file

***

## Step 4 — Deploy a PyTorch Model

### Export Model to TorchScript

```python
import torch
import torchvision

# Load a pretrained ResNet50
model = torchvision.models.resnet50(pretrained=True)
model.eval()

# Export to TorchScript
example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# Save
traced_model.save("/tmp/resnet50.pt")
print("Model exported successfully")
```

### Set Up Model Repository

```bash
# SSH into your Clore.ai instance
ssh root@<clore-host> -p <port>

# Create directory structure
mkdir -p /models/resnet50/1

# Upload model
# (from your local machine)
scp -P <port> /tmp/resnet50.pt root@<clore-host>:/models/resnet50/1/model.pt
```

### Create config.pbtxt

```bash
cat > /models/resnet50/config.pbtxt << 'EOF'
name: "resnet50"
platform: "pytorch_libtorch"
max_batch_size: 32

input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]
EOF
```

***

## Step 5 — Deploy an ONNX Model

### Export to ONNX

```python
import torch
import torchvision
import torch.onnx

model = torchvision.models.resnet50(pretrained=True)
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "/tmp/resnet50.onnx",
    opset_version=13,
    input_names=["images"],
    output_names=["logits"],
    dynamic_axes={
        "images": {0: "batch_size"},
        "logits": {0: "batch_size"}
    }
)
```

### ONNX Config

```bash
mkdir -p /models/resnet50_onnx/1
scp -P <port> /tmp/resnet50.onnx root@<clore-host>:/models/resnet50_onnx/1/model.onnx

cat > /models/resnet50_onnx/config.pbtxt << 'EOF'
name: "resnet50_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 32

input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 100
}
EOF
```

***

## Step 6 — Deploy a Python Custom Backend

For models that don't fit standard backends (custom preprocessing, ensemble logic):

```bash
mkdir -p /models/custom_model/1

cat > /models/custom_model/1/model.py << 'EOF'
import triton_python_backend_utils as pb_utils
import numpy as np
import torch

class TritonPythonModel:
    def initialize(self, args):
        self.model = torch.nn.Linear(10, 5).cuda()
        self.model.eval()
    
    def execute(self, requests):
        responses = []
        for request in requests:
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT")
            input_np = input_tensor.as_numpy()
            
            with torch.no_grad():
                inp = torch.from_numpy(input_np).float().cuda()
                out = self.model(inp).cpu().numpy()
            
            output_tensor = pb_utils.Tensor("OUTPUT", out.astype(np.float32))
            responses.append(pb_utils.InferenceResponse(output_tensors=[output_tensor]))
        
        return responses
    
    def finalize(self):
        pass
EOF

cat > /models/custom_model/config.pbtxt << 'EOF'
name: "custom_model"
backend: "python"
max_batch_size: 64

input [
  {
    name: "INPUT"
    data_type: TYPE_FP32
    dims: [10]
  }
]

output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [5]
  }
]
EOF
```

***

## Step 7 — Start Triton and Test

### Start Triton Server

```bash
# Start (if using the Dockerfile CMD, it auto-starts)
tritonserver \
    --model-repository=/models \
    --http-port=8000 \
    --grpc-port=8001 \
    --metrics-port=8002 \
    --log-verbose=0 &

# Wait for server to be ready
sleep 5
curl -s http://localhost:8000/v2/health/ready
# Expected: {"live": true}
```

### Check Available Models

```bash
curl http://<clore-host>:<public-8000>/v2/models
```

### Run Inference via HTTP

```python
import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(
    url="<clore-host>:<public-port-8000>",
    ssl=False
)

# Check server health
print("Server ready:", client.is_server_ready())
print("Model ready:", client.is_model_ready("resnet50_onnx"))

# Create input
image = np.random.rand(1, 3, 224, 224).astype(np.float32)
input_tensor = httpclient.InferInput("images", image.shape, "FP32")
input_tensor.set_data_from_numpy(image)

# Run inference
outputs = [httpclient.InferRequestedOutput("logits")]
response = client.infer("resnet50_onnx", [input_tensor], outputs=outputs)

logits = response.as_numpy("logits")
predicted_class = np.argmax(logits[0])
print(f"Predicted class: {predicted_class}")
```

### Run Inference via gRPC

```python
import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(
    url="<clore-host>:<public-port-8001>"
)

image = np.random.rand(1, 3, 224, 224).astype(np.float32)
input_tensor = grpcclient.InferInput("images", image.shape, "FP32")
input_tensor.set_data_from_numpy(image)

outputs = [grpcclient.InferRequestedOutput("logits")]
response = client.infer("resnet50_onnx", [input_tensor], outputs=outputs)

logits = response.as_numpy("logits")
print(f"Output shape: {logits.shape}")
```

***

## Monitoring with Prometheus

Triton exposes metrics at port 8002:

```bash
curl http://<clore-host>:<public-port-8002>/metrics
```

Key metrics:

```
# Inference throughput
nv_inference_request_success{model="resnet50_onnx", version="1"}
# Average inference time
nv_inference_compute_infer_duration_us{model="resnet50_onnx", version="1"}
# GPU utilization
nv_gpu_utilization{gpu_uuid="..."}
# GPU memory
nv_gpu_memory_used_bytes{gpu_uuid="..."}
```

***

## Dynamic Batching Configuration

```protobuf
dynamic_batching {
  preferred_batch_size: [4, 8, 16, 32]
  max_queue_delay_microseconds: 5000
  preserve_ordering: true
  
  priority_levels: 3
  default_priority_level: 2
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 10000
    allow_timeout_override: true
    max_queue_size: 100
  }
}
```

***

## Troubleshooting

### Model Load Failure

```
Failed to load model: could not find model file
```

**Solution:** Check directory structure and permissions:

```bash
ls -la /models/resnet50/1/
# Must contain model.pt (PyTorch) or model.onnx (ONNX)
chmod -R 755 /models/
```

### CUDA Incompatibility

**Solution:** Match Triton image version to your CUDA driver:

```bash
nvidia-smi  # Note CUDA version
# Use matching tritonserver tag, e.g., 23.10 for CUDA 12.2
```

### Port Not Reachable

**Solution:** Verify all three ports (8000, 8001, 8002) are forwarded in Clore.ai. Test each:

```bash
curl http://<host>:<port>/v2/health/live
```

### OOM During Model Loading

**Solution:** Reduce instance count or use CPU instances for some models:

```protobuf
instance_group [
  {
    count: 1       # Reduce from default
    kind: KIND_GPU
  }
]
```

***

## Cost Estimation

| GPU       | VRAM  | Est. Price | Throughput (ResNet50) |
| --------- | ----- | ---------- | --------------------- |
| RTX 3080  | 10 GB | \~$0.10/hr | \~500 req/sec         |
| RTX 4090  | 24 GB | \~$0.35/hr | \~1500 req/sec        |
| A100 40GB | 40 GB | \~$0.80/hr | \~3000 req/sec        |
| H100      | 80 GB | \~$2.50/hr | \~8000 req/sec        |

***

## Useful Resources

* [Triton GitHub](https://github.com/triton-inference-server/server)
* [NGC Container Registry](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
* [Triton Client Libraries](https://github.com/triton-inference-server/client)
* [Triton Model Configuration Reference](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html)
* [Triton Python Backend](https://github.com/triton-inference-server/python_backend)
* [Triton Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/README.md)

***

## Clore.ai GPU Recommendations

| Use Case             | Recommended GPU | Est. Cost on Clore.ai |
| -------------------- | --------------- | --------------------- |
| Development/Testing  | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production Inference | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Models (70B+)  | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.
