> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-ru/mlops-i-razvyortyvanie/triton-inference-server.md).

# Triton Inference Server

**NVIDIA Triton Inference Server** это промышленная, открытая платформа для развертывания инференса, которая поддерживает практически все основные ML-фреймворки. Разработанная для высокой пропускной способности и низкой задержки, Triton обрабатывает PyTorch, TensorFlow, ONNX, TensorRT, OpenVINO и другие — всё из одного процесса сервера. Разверните её в GPU‑облаке Clore.ai для масштабируемой и экономичной инфраструктуры инференса.

***

## Что такое Triton Inference Server?

Triton — ответ NVIDIA на задачу обслуживания ML-моделей в масштабе:

* **Мультиифреймворк:** PyTorch, TensorFlow, TensorRT, ONNX, OpenVINO, пользовательские бэкенды на Python
* **Параллельное выполнение:** Несколько моделей, несколько экземпляров на GPU
* **Динамическая пакетизация:** Автоматическая пакетизация запросов для повышения пропускной способности
* **gRPC + HTTP:** Стандартные в отрасли протоколы из коробки
* **Метрики:** Эндпоинт метрик совместимый с Prometheus
* **Репозиторий моделей:** Управление моделями на основе файловой системы

**Используемые порты:**

| Порт | Протокол | Назначение             |
| ---- | -------- | ---------------------- |
| 8000 | HTTP     | REST API для инференса |
| 8001 | gRPC     | gRPC API для инференса |
| 8002 | HTTP     | Метрики Prometheus     |

***

## Требования

| Требование | Минимум                 | Рекомендуется   |
| ---------- | ----------------------- | --------------- |
| VRAM GPU   | 8 ГБ                    | 16–24 ГБ        |
| GPU        | Любая NVIDIA с CUDA 11+ | RTX 4090 / A100 |
| ОЗУ        | 16 ГБ                   | 32 ГБ           |
| Хранилище  | 20 ГБ                   | 50 ГБ           |

{% hint style="info" %}
Triton также поддерживает инференс только на CPU для задач без CUDA. Используйте `вариант cpu-only` образа Docker для экономии при пакетных заданиях, которым не требуется GPU.
{% endhint %}

***

## Шаг 1 — Арендуйте GPU на Clore.ai

1. Войдите в [clore.ai](https://clore.ai).
2. Нажмите **Маркетплейс** и отфильтруйте по видеопамяти VRAM ≥ 16 ГБ.
3. Выберите сервер и нажмите **Настроить**.
4. Установите Docker-образ: **`nvcr.io/nvidia/tritonserver:24.01-py3`**
   * (Замените `24.01` на последнюю версию — проверьте [каталог NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver))
5. Установите открытые порты: `22` (SSH), `8000` (HTTP), `8001` (gRPC), `8002` (метрики).
6. Нажмите **Арендовать**.

{% hint style="warning" %}
Docker-образы Triton большие (\~15–20 ГБ). Заливка при первом запуске может занять 3–5 минут. Последующие запускаются быстро.
{% endhint %}

***

## Шаг 2 — Пользовательский Dockerfile (с SSH)

Официальный образ Triton не включает SSH-сервер. Используйте этот Dockerfile:

```dockerfile
FROM nvcr.io/nvidia/tritonserver:24.01-py3

RUN apt-get update && apt-get install -y \
    openssh-server \
    wget curl \
    && rm -rf /var/lib/apt/lists/*

# Настроить SSH
RUN mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Установить клиентскую библиотеку Python
RUN pip install tritonclient[all] numpy Pillow

RUN mkdir -p /models

EXPOSE 22 8000 8001 8002

CMD service ssh start && \
    tritonserver \
        --model-repository=/models \
        --log-verbose=0 \
        --http-port=8000 \
        --grpc-port=8001 \
        --metrics-port=8002
```

***

## Шаг 3 — Понимание репозитория моделей

Triton загружает модели из **репозитория моделей** — директории со специфической структурой:

```
/models/
├── model_name_1/
│   ├── config.pbtxt          # Конфигурация модели
│   ├── 1/                    # Версия 1
│   │   └── model.pt          # Файл модели
│   └── 2/                    # Версия 2 (опционально)
│       └── model.pt
├── model_name_2/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
```

Каждой модели требуется:

1. Каталог с именем модели
2. Файл `config.pbtxt` файл конфигурации
3. Как минимум одна подпапка версии (например, `1/`) с файлом модели

***

## Шаг 4 — Разверните модель PyTorch

### Экспорт модели в TorchScript

```python
import torch
import torchvision

# Загрузить предобученный ResNet50
model = torchvision.models.resnet50(pretrained=True)
model.eval()

# Экспорт в TorchScript
example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# Сохранить
traced_model.save("/tmp/resnet50.pt")
print("Модель успешно экспортирована")
```

### Настройка репозитория моделей

```bash
# SSH на ваш экземпляр Clore.ai
ssh root@<clore-host> -p <port>

# Создать структуру директорий
mkdir -p /models/resnet50/1

# Загрузить модель
# (с вашей локальной машины)
scp -P <port> /tmp/resnet50.pt root@<clore-host>:/models/resnet50/1/model.pt
```

### Создать config.pbtxt

```bash
cat > /models/resnet50/config.pbtxt << 'EOF'
name: "resnet50"
platform: "pytorch_libtorch"
max_batch_size: 32

input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]
EOF
```

***

## Шаг 5 — Разверните модель ONNX

### Экспорт в ONNX

```python
import torch
import torchvision
import torch.onnx

model = torchvision.models.resnet50(pretrained=True)
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "/tmp/resnet50.onnx",
    opset_version=13,
    input_names=["images"],
    output_names=["logits"],
    dynamic_axes={
        "images": {0: "batch_size"},
        "logits": {0: "batch_size"}
    }
)
```

### Конфигурация ONNX

```bash
mkdir -p /models/resnet50_onnx/1
scp -P <port> /tmp/resnet50.onnx root@<clore-host>:/models/resnet50_onnx/1/model.onnx

cat > /models/resnet50_onnx/config.pbtxt << 'EOF'
name: "resnet50_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 32

input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 100
}
EOF
```

***

## Шаг 6 — Разверните пользовательский бэкенд на Python

Для моделей, которые не подходят стандартным бэкендам (пользовательская предобработка, логика ансамбля):

```bash
mkdir -p /models/custom_model/1

cat > /models/custom_model/1/model.py << 'EOF'
import triton_python_backend_utils as pb_utils
import numpy as np
import torch

class TritonPythonModel:
    def initialize(self, args):
        self.model = torch.nn.Linear(10, 5).cuda()
        self.model.eval()
    
    def execute(self, requests):
        responses = []
        for request in requests:
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT")
            input_np = input_tensor.as_numpy()
            
            with torch.no_grad():
                inp = torch.from_numpy(input_np).float().cuda()
                out = self.model(inp).cpu().numpy()
            
            output_tensor = pb_utils.Tensor("OUTPUT", out.astype(np.float32))
            responses.append(pb_utils.InferenceResponse(output_tensors=[output_tensor]))
        
        return responses
    
    def finalize(self):
        pass
EOF

cat > /models/custom_model/config.pbtxt << 'EOF'
name: "custom_model"
backend: "python"
max_batch_size: 64

input [
  {
    name: "INPUT"
    data_type: TYPE_FP32
    dims: [10]
  }
]

output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [5]
  }
]
EOF
```

***

## Шаг 7 — Запустите Triton и протестируйте

### Запустить Triton Server

```bash
# Запуск (если используется CMD из Dockerfile, он запускается автоматически)
tritonserver \
    --model-repository=/models \
    --http-port=8000 \
    --grpc-port=8001 \
    --metrics-port=8002 \
    --log-verbose=0 &

# Подождите, пока сервер не будет готов
sleep 5
curl -s http://localhost:8000/v2/health/ready
# Ожидается: {"live": true}
```

### Проверить доступные модели

```bash
curl http://<clore-host>:<public-8000>/v2/models
```

### Выполнить инференс через HTTP

```python
import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(
    url="<clore-host>:<public-port-8000>",
    ssl=False
)

# Проверить состояние сервера
print("Сервер готов:", client.is_server_ready())
print("Модель готова:", client.is_model_ready("resnet50_onnx"))

# Создать вход
image = np.random.rand(1, 3, 224, 224).astype(np.float32)
input_tensor = httpclient.InferInput("images", image.shape, "FP32")
input_tensor.set_data_from_numpy(image)

# Запустить инференс
outputs = [httpclient.InferRequestedOutput("logits")]
response = client.infer("resnet50_onnx", [input_tensor], outputs=outputs)

logits = response.as_numpy("logits")
predicted_class = np.argmax(logits[0])
print(f"Предсказанный класс: {predicted_class}")
```

### Выполнить инференс через gRPC

```python
import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(
    url="<clore-host>:<public-port-8001>"
)

image = np.random.rand(1, 3, 224, 224).astype(np.float32)
input_tensor = grpcclient.InferInput("images", image.shape, "FP32")
input_tensor.set_data_from_numpy(image)

outputs = [grpcclient.InferRequestedOutput("logits")]
response = client.infer("resnet50_onnx", [input_tensor], outputs=outputs)

logits = response.as_numpy("logits")
print(f"Форма вывода: {logits.shape}")
```

***

## Мониторинг с Prometheus

Triton выставляет метрики на порту 8002:

```bash
curl http://<clore-host>:<public-port-8002>/metrics
```

Ключевые метрики:

```
# Пропускная способность инференса
nv_inference_request_success{model="resnet50_onnx", version="1"}
# Среднее время инференса
nv_inference_compute_infer_duration_us{model="resnet50_onnx", version="1"}
# Использование GPU
nv_gpu_utilization{gpu_uuid="..."}
# Память GPU
nv_gpu_memory_used_bytes{gpu_uuid="..."}
```

***

## Конфигурация динамической пакетизации

```protobuf
dynamic_batching {
  preferred_batch_size: [4, 8, 16, 32]
  max_queue_delay_microseconds: 5000
  preserve_ordering: true
  
  priority_levels: 3
  default_priority_level: 2
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 10000
    allow_timeout_override: true
    max_queue_size: 100
  }
}
```

***

## Устранение неполадок

### Ошибка загрузки модели

```
Не удалось загрузить модель: не найден файл модели
```

**Решение:** Проверьте структуру директорий и права доступа:

```bash
ls -la /models/resnet50/1/
# Должен содержать model.pt (PyTorch) или model.onnx (ONNX)
chmod -R 755 /models/
```

### Несовместимость CUDA

**Решение:** Соответствуйте версию образа Triton версии вашего драйвера CUDA:

```bash
nvidia-smi  # Узнать версию CUDA
# Используйте соответствующий тег tritonserver, например 23.10 для CUDA 12.2
```

### Порт недоступен

**Решение:** Убедитесь, что все три порта (8000, 8001, 8002) проброшены в Clore.ai. Проверьте каждый:

```bash
curl http://<host>:<port>/v2/health/live
```

### OOM при загрузке модели

**Решение:** Уменьшите количество экземпляров или используйте CPU‑инстансы для некоторых моделей:

```protobuf
instance_group [
  {
    count: 1       # Уменьшить по сравнению с умолчанием
    kind: KIND_GPU
  }
]
```

***

## Оценка стоимости

| GPU       | VRAM  | Примерная цена | Пропускная способность (ResNet50) |
| --------- | ----- | -------------- | --------------------------------- |
| RTX 3080  | 10 ГБ | \~$0.10/час    | \~500 запросов/сек                |
| RTX 4090  | 24 ГБ | \~$0.35/час    | \~1500 запросов/сек               |
| A100 40GB | 40 ГБ | \~$0.80/час    | \~3000 запросов/сек               |
| H100      | 80 ГБ | \~$2.50/час    | \~8000 запросов/сек               |

***

## Полезные ресурсы

* [Triton GitHub](https://github.com/triton-inference-server/server)
* [NGC Container Registry](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
* [Клиентские библиотеки Triton](https://github.com/triton-inference-server/client)
* [Справочник по конфигурации моделей Triton](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html)
* [Python-бэкенд Triton](https://github.com/triton-inference-server/python_backend)
* [Анализатор производительности Triton](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/README.md)

***

## Рекомендации по GPU для Clore.ai

| Сценарий использования    | Рекомендуемый GPU | Примерная стоимость на Clore.ai |
| ------------------------- | ----------------- | ------------------------------- |
| Разработка/Тестирование   | RTX 3090 (24GB)   | \~$0.12/gpu/hr                  |
| Производственный инференс | RTX 4090 (24GB)   | \~$0.70/gpu/hr                  |
| Крупные модели (70B+)     | A100 80GB         | \~$1.20/gpu/hr                  |

> 💡 Все примеры в этом руководстве можно развернуть на [Clore.ai](https://clore.ai/marketplace) GPU-серверах. Просматривайте доступные GPU и арендуйте по часам — без обязательств, с полным root-доступом.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-ru/mlops-i-razvyortyvanie/triton-inference-server.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.