# Triton Inference Server

**NVIDIA Triton Inference Server** is a production-grade, open-source inference serving platform that supports virtually every major ML framework. Designed for high-throughput, low-latency serving, Triton handles PyTorch, TensorFlow, ONNX, TensorRT, OpenVINO, and more — all from a single server process. Deploy it on Clore.ai's GPU cloud for scalable, cost-efficient inference infrastructure.

***

## What is Triton Inference Server?

Triton is NVIDIA's answer to the challenge of serving ML models at scale:

* **Multi-framework:** PyTorch, TensorFlow, TensorRT, ONNX, OpenVINO, Python custom backends
* **Concurrent execution:** Multiple models, multiple instances per GPU
* **Dynamic batching:** Automatically batch requests for higher throughput
* **gRPC + HTTP:** Industry-standard protocols out of the box
* **Metrics:** Prometheus-compatible metrics endpoint
* **Model repository:** File-system based model management

**Ports used:**

| Port | Protocol | Purpose            |
| ---- | -------- | ------------------ |
| 8000 | HTTP     | REST inference API |
| 8001 | gRPC     | gRPC inference API |
| 8002 | HTTP     | Prometheus metrics |

***

## Prerequisites

| Requirement | Minimum                  | Recommended     |
| ----------- | ------------------------ | --------------- |
| GPU VRAM    | 8 GB                     | 16–24 GB        |
| GPU         | Any NVIDIA with CUDA 11+ | RTX 4090 / A100 |
| RAM         | 16 GB                    | 32 GB           |
| Storage     | 20 GB                    | 50 GB           |

{% hint style="info" %}
Triton also supports CPU-only inference for non-CUDA workloads. Use the `cpu-only` variant of the Docker image for cost savings on batch jobs that don't require GPU.
{% endhint %}

***

## Step 1 — Rent a GPU on Clore.ai

1. Log in to [clore.ai](https://clore.ai).
2. Click **Marketplace** and filter by VRAM ≥ 16 GB.
3. Select a server and click **Configure**.
4. Set Docker image: **`nvcr.io/nvidia/tritonserver:24.01-py3`**
   * (Replace `24.01` with the latest version — check [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver))
5. Set open ports: `22` (SSH), `8000` (HTTP), `8001` (gRPC), `8002` (metrics).
6. Click **Rent**.

{% hint style="warning" %}
Triton Docker images are large (\~15–20 GB). Allow 3–5 minutes for initial pull on first launch. Subsequent starts are fast.
{% endhint %}

***

## Step 2 — Custom Dockerfile (with SSH)

The official Triton image doesn't include an SSH server. Use this Dockerfile:

```dockerfile
FROM nvcr.io/nvidia/tritonserver:24.01-py3

RUN apt-get update && apt-get install -y \
    openssh-server \
    wget curl \
    && rm -rf /var/lib/apt/lists/*

# Configure SSH
RUN mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Install Python client library
RUN pip install tritonclient[all] numpy Pillow

RUN mkdir -p /models

EXPOSE 22 8000 8001 8002

CMD service ssh start && \
    tritonserver \
        --model-repository=/models \
        --log-verbose=0 \
        --http-port=8000 \
        --grpc-port=8001 \
        --metrics-port=8002
```

***

## Step 3 — Understand the Model Repository

Triton loads models from a **model repository** — a directory with a specific structure:

```
/models/
├── model_name_1/
│   ├── config.pbtxt          # Model configuration
│   ├── 1/                    # Version 1
│   │   └── model.pt          # Model file
│   └── 2/                    # Version 2 (optional)
│       └── model.pt
├── model_name_2/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
```

Each model needs:

1. A directory with the model name
2. A `config.pbtxt` configuration file
3. At least one version subdirectory (e.g., `1/`) with the model file

***

## Step 4 — Deploy a PyTorch Model

### Export Model to TorchScript

```python
import torch
import torchvision

# Load a pretrained ResNet50
model = torchvision.models.resnet50(pretrained=True)
model.eval()

# Export to TorchScript
example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# Save
traced_model.save("/tmp/resnet50.pt")
print("Model exported successfully")
```

### Set Up Model Repository

```bash
# SSH into your Clore.ai instance
ssh root@<clore-host> -p <port>

# Create directory structure
mkdir -p /models/resnet50/1

# Upload model
# (from your local machine)
scp -P <port> /tmp/resnet50.pt root@<clore-host>:/models/resnet50/1/model.pt
```

### Create config.pbtxt

```bash
cat > /models/resnet50/config.pbtxt << 'EOF'
name: "resnet50"
platform: "pytorch_libtorch"
max_batch_size: 32

input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]
EOF
```

***

## Step 5 — Deploy an ONNX Model

### Export to ONNX

```python
import torch
import torchvision
import torch.onnx

model = torchvision.models.resnet50(pretrained=True)
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "/tmp/resnet50.onnx",
    opset_version=13,
    input_names=["images"],
    output_names=["logits"],
    dynamic_axes={
        "images": {0: "batch_size"},
        "logits": {0: "batch_size"}
    }
)
```

### ONNX Config

```bash
mkdir -p /models/resnet50_onnx/1
scp -P <port> /tmp/resnet50.onnx root@<clore-host>:/models/resnet50_onnx/1/model.onnx

cat > /models/resnet50_onnx/config.pbtxt << 'EOF'
name: "resnet50_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 32

input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 100
}
EOF
```

***

## Step 6 — Deploy a Python Custom Backend

For models that don't fit standard backends (custom preprocessing, ensemble logic):

```bash
mkdir -p /models/custom_model/1

cat > /models/custom_model/1/model.py << 'EOF'
import triton_python_backend_utils as pb_utils
import numpy as np
import torch

class TritonPythonModel:
    def initialize(self, args):
        self.model = torch.nn.Linear(10, 5).cuda()
        self.model.eval()
    
    def execute(self, requests):
        responses = []
        for request in requests:
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT")
            input_np = input_tensor.as_numpy()
            
            with torch.no_grad():
                inp = torch.from_numpy(input_np).float().cuda()
                out = self.model(inp).cpu().numpy()
            
            output_tensor = pb_utils.Tensor("OUTPUT", out.astype(np.float32))
            responses.append(pb_utils.InferenceResponse(output_tensors=[output_tensor]))
        
        return responses
    
    def finalize(self):
        pass
EOF

cat > /models/custom_model/config.pbtxt << 'EOF'
name: "custom_model"
backend: "python"
max_batch_size: 64

input [
  {
    name: "INPUT"
    data_type: TYPE_FP32
    dims: [10]
  }
]

output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [5]
  }
]
EOF
```

***

## Step 7 — Start Triton and Test

### Start Triton Server

```bash
# Start (if using the Dockerfile CMD, it auto-starts)
tritonserver \
    --model-repository=/models \
    --http-port=8000 \
    --grpc-port=8001 \
    --metrics-port=8002 \
    --log-verbose=0 &

# Wait for server to be ready
sleep 5
curl -s http://localhost:8000/v2/health/ready
# Expected: {"live": true}
```

### Check Available Models

```bash
curl http://<clore-host>:<public-8000>/v2/models
```

### Run Inference via HTTP

```python
import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(
    url="<clore-host>:<public-port-8000>",
    ssl=False
)

# Check server health
print("Server ready:", client.is_server_ready())
print("Model ready:", client.is_model_ready("resnet50_onnx"))

# Create input
image = np.random.rand(1, 3, 224, 224).astype(np.float32)
input_tensor = httpclient.InferInput("images", image.shape, "FP32")
input_tensor.set_data_from_numpy(image)

# Run inference
outputs = [httpclient.InferRequestedOutput("logits")]
response = client.infer("resnet50_onnx", [input_tensor], outputs=outputs)

logits = response.as_numpy("logits")
predicted_class = np.argmax(logits[0])
print(f"Predicted class: {predicted_class}")
```

### Run Inference via gRPC

```python
import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(
    url="<clore-host>:<public-port-8001>"
)

image = np.random.rand(1, 3, 224, 224).astype(np.float32)
input_tensor = grpcclient.InferInput("images", image.shape, "FP32")
input_tensor.set_data_from_numpy(image)

outputs = [grpcclient.InferRequestedOutput("logits")]
response = client.infer("resnet50_onnx", [input_tensor], outputs=outputs)

logits = response.as_numpy("logits")
print(f"Output shape: {logits.shape}")
```

***

## Monitoring with Prometheus

Triton exposes metrics at port 8002:

```bash
curl http://<clore-host>:<public-port-8002>/metrics
```

Key metrics:

```
# Inference throughput
nv_inference_request_success{model="resnet50_onnx", version="1"}
# Average inference time
nv_inference_compute_infer_duration_us{model="resnet50_onnx", version="1"}
# GPU utilization
nv_gpu_utilization{gpu_uuid="..."}
# GPU memory
nv_gpu_memory_used_bytes{gpu_uuid="..."}
```

***

## Dynamic Batching Configuration

```protobuf
dynamic_batching {
  preferred_batch_size: [4, 8, 16, 32]
  max_queue_delay_microseconds: 5000
  preserve_ordering: true
  
  priority_levels: 3
  default_priority_level: 2
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 10000
    allow_timeout_override: true
    max_queue_size: 100
  }
}
```

***

## Troubleshooting

### Model Load Failure

```
Failed to load model: could not find model file
```

**Solution:** Check directory structure and permissions:

```bash
ls -la /models/resnet50/1/
# Must contain model.pt (PyTorch) or model.onnx (ONNX)
chmod -R 755 /models/
```

### CUDA Incompatibility

**Solution:** Match Triton image version to your CUDA driver:

```bash
nvidia-smi  # Note CUDA version
# Use matching tritonserver tag, e.g., 23.10 for CUDA 12.2
```

### Port Not Reachable

**Solution:** Verify all three ports (8000, 8001, 8002) are forwarded in Clore.ai. Test each:

```bash
curl http://<host>:<port>/v2/health/live
```

### OOM During Model Loading

**Solution:** Reduce instance count or use CPU instances for some models:

```protobuf
instance_group [
  {
    count: 1       # Reduce from default
    kind: KIND_GPU
  }
]
```

***

## Cost Estimation

| GPU       | VRAM  | Est. Price | Throughput (ResNet50) |
| --------- | ----- | ---------- | --------------------- |
| RTX 3080  | 10 GB | \~$0.10/hr | \~500 req/sec         |
| RTX 4090  | 24 GB | \~$0.35/hr | \~1500 req/sec        |
| A100 40GB | 40 GB | \~$0.80/hr | \~3000 req/sec        |
| H100      | 80 GB | \~$2.50/hr | \~8000 req/sec        |

***

## Useful Resources

* [Triton GitHub](https://github.com/triton-inference-server/server)
* [NGC Container Registry](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
* [Triton Client Libraries](https://github.com/triton-inference-server/client)
* [Triton Model Configuration Reference](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html)
* [Triton Python Backend](https://github.com/triton-inference-server/python_backend)
* [Triton Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/README.md)

***

## Clore.ai GPU Recommendations

| Use Case             | Recommended GPU | Est. Cost on Clore.ai |
| -------------------- | --------------- | --------------------- |
| Development/Testing  | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production Inference | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Models (70B+)  | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/mlops-and-deployment/triton-inference-server.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
