> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-hi/mlops-and-deployment/triton-inference-server.md).

# Triton Inference Server

**NVIDIA ट्राइटन इन्फरेंस सर्वर** एक उत्पादन-ग्रेड, ओपन-सोर्स इन्फरेंस सर्विंग प्लेटफ़ॉर्म है जो लगभग हर प्रमुख ML फ़्रेमवर्क का समर्थन करता है। उच्च थ्रूपुट और कम विलंबता के लिए डिज़ाइन किया गया, Triton PyTorch, TensorFlow, ONNX, TensorRT, OpenVINO और अधिक को संभालता है — वह भी एक ही सर्वर प्रक्रिया से। स्केलेबल, लागत-कुशल इन्फरेंस इंफ्रास्ट्रक्चर के लिए इसे Clore.ai के GPU क्लाउड पर तैनात करें।

***

## Triton इन्फरेंस सर्वर क्या है?

Triton बड़े पैमाने पर ML मॉडल सर्व करने की चुनौती के लिए NVIDIA का उत्तर है:

* **बहु-फ़्रेमवर्क:** PyTorch, TensorFlow, TensorRT, ONNX, OpenVINO, Python कस्टम बैकएंड
* **समवर्ती निष्पादन:** एक से अधिक मॉडल, GPU पर एक से अधिक उदाहरण
* **डायनामिक बैचिंग:** उच्च थ्रूपुट के लिए अनुरोधों को स्वचालित रूप से बैच करें
* **gRPC + HTTP:** बॉक्स से बाहर इंडस्ट्री-स्टैंडर्ड प्रोटोकॉल
* **मेट्रिक्स:** Prometheus-संगत मेट्रिक्स एंडपॉइंट
* **मॉडल रिपोजिटरी:** फ़ाइल-सिस्टम आधारित मॉडल प्रबंधन

**उपयोग किए जाने वाले पोर्ट:**

| पोर्ट | प्रोटोकॉल | उद्देश्य             |
| ----- | --------- | -------------------- |
| 8000  | HTTP      | REST इन्फरेंस API    |
| 8001  | gRPC      | gRPC इन्फरेंस API    |
| 8002  | HTTP      | Prometheus मेट्रिक्स |

***

## पूर्व-आवश्यकताएँ

| आवश्यकता | न्यूनतम                             | सिफारिश की गई   |
| -------- | ----------------------------------- | --------------- |
| GPU VRAM | 8 GB                                | 16–24 GB        |
| GPU      | कोई भी NVIDIA जो CUDA 11+ के साथ हो | RTX 4090 / A100 |
| RAM      | 16 GB                               | 32 GB           |
| स्टोरेज  | 20 GB                               | 50 GB           |

{% hint style="info" %}
Triton गैर-CUDA वर्कलोड के लिए CPU-केवल इन्फरेंस को भी समर्थन देता है। लागत-बचत के लिए `cpu-only` Docker इमेज के उस वेरिएंट का उपयोग करें उन बैच जॉब्स के लिए जिन्हें GPU की आवश्यकता नहीं है।
{% endhint %}

***

## चरण 1 — Clore.ai पर एक GPU किराए पर लें

1. लॉग इन करें [clore.ai](https://clore.ai).
2. पर क्लिक करें **मार्केटप्लेस** और VRAM ≥ 16 GB द्वारा फ़िल्टर करें।
3. एक सर्वर चुनें और क्लिक करें **कॉन्फ़िगर**.
4. Docker इमेज सेट करें: **`nvcr.io/nvidia/tritonserver:24.01-py3`**
   * (बदलें `24.01` के साथ नवीनतम संस्करण — जाँच करें [NGC कैटलॉग](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver))
5. खुले पोर्ट सेट करें: `22` (SSH), `8000` (HTTP), `8001` (gRPC), `8002` (मेट्रिक्स)।
6. पर क्लिक करें **किराए पर लें**.

{% hint style="warning" %}
Triton Docker इमेज बड़ी होती हैं (\~15–20 GB)। पहली बार लॉन्च पर प्रारंभिक पुल के लिए 3–5 मिनट की अनुमति दें। बाद के स्टार्ट तेज़ होते हैं।
{% endhint %}

***

## चरण 2 — कस्टम Dockerfile (SSH के साथ)

आधिकारिक Triton इमेज में SSH सर्वर शामिल नहीं होता। इस Dockerfile का उपयोग करें:

```dockerfile
FROM nvcr.io/nvidia/tritonserver:24.01-py3

RUN apt-get update && apt-get install -y \
    openssh-server \
    wget curl \
    && rm -rf /var/lib/apt/lists/*

# SSH कॉन्फ़िगर करें
RUN mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Python क्लाइंट लाइब्रेरी इंस्टॉल करें
RUN pip install tritonclient[all] numpy Pillow

RUN mkdir -p /models

EXPOSE 22 8000 8001 8002

CMD service ssh start && \
    tritonserver \
        --model-repository=/models \
        --log-verbose=0 \
        --http-port=8000 \
        --grpc-port=8001 \
        --metrics-port=8002
```

***

## चरण 3 — मॉडल रिपोजिटरी को समझें

Triton एक से मॉडल लोड करता है **मॉडल रिपोजिटरी** — एक निर्देशिका है जिसकी एक विशिष्ट संरचना होती है:

```
/models/
├── model_name_1/
│   ├── config.pbtxt          # मॉडल कॉन्फ़िगरेशन
│   ├── 1/                    # संस्करण 1
│   │   └── model.pt          # मॉडल फ़ाइल
│   └── 2/                    # संस्करण 2 (वैकल्पिक)
│       └── model.pt
├── model_name_2/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
```

प्रत्येक मॉडल को चाहिए:

1. मॉडल नाम वाली एक निर्देशिका
2. एक `config.pbtxt` कॉन्फ़िगरेशन फ़ाइल
3. कम से कम एक संस्करण उप-निर्देशिका (उदा., `1/`) जिसमें मॉडल फ़ाइल हो

***

## चरण 4 — एक PyTorch मॉडल तैनात करें

### मॉडल को TorchScript में एक्सपोर्ट करें

```python
import torch
import torchvision

# एक pretrained ResNet50 लोड करें
model = torchvision.models.resnet50(pretrained=True)
model.eval()

# TorchScript में एक्सपोर्ट करें
example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# सहेजें
traced_model.save("/tmp/resnet50.pt")
print("मॉडल सफलतापूर्वक निर्यात किया गया")
```

### मॉडल रिपोजिटरी सेट अप करें

```bash
# अपने Clore.ai इंस्टेंस में SSH करें
ssh root@<clore-host> -p <port>

# निर्देशिका संरचना बनाएँ
mkdir -p /models/resnet50/1

# मॉडल अपलोड करें
# (आपकी लोकल मशीन से)
scp -P <port> /tmp/resnet50.pt root@<clore-host>:/models/resnet50/1/model.pt
```

### config.pbtxt बनाएँ

```bash
cat > /models/resnet50/config.pbtxt << 'EOF'
name: "resnet50"
platform: "pytorch_libtorch"
max_batch_size: 32

input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]
EOF
```

***

## चरण 5 — एक ONNX मॉडल तैनात करें

### ONNX में एक्सपोर्ट करें

```python
import torch
import torchvision
import torch.onnx

model = torchvision.models.resnet50(pretrained=True)
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "/tmp/resnet50.onnx",
    opset_version=13,
    input_names=["images"],
    output_names=["logits"],
    dynamic_axes={
        "images": {0: "batch_size"},
        "logits": {0: "batch_size"}
    }
)
```

### ONNX कॉन्फ़िग

```bash
mkdir -p /models/resnet50_onnx/1
scp -P <port> /tmp/resnet50.onnx root@<clore-host>:/models/resnet50_onnx/1/model.onnx

cat > /models/resnet50_onnx/config.pbtxt << 'EOF'
name: "resnet50_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 32

input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 100
}
EOF
```

***

## चरण 6 — एक Python कस्टम बैकएंड तैनात करें

उन मॉडलों के लिए जो स्टैंडर्ड बैकएंड्स में फिट नहीं बैठते (कस्टम प्रीप्रोसेसिंग, एन्सेम्बल लॉजिक):

```bash
mkdir -p /models/custom_model/1

cat > /models/custom_model/1/model.py << 'EOF'
import triton_python_backend_utils as pb_utils
import numpy as np
import torch

class TritonPythonModel:
    def initialize(self, args):
        self.model = torch.nn.Linear(10, 5).cuda()
        self.model.eval()
    
    def execute(self, requests):
        responses = []
        for request in requests:
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT")
            input_np = input_tensor.as_numpy()
            
            with torch.no_grad():
                inp = torch.from_numpy(input_np).float().cuda()
                out = self.model(inp).cpu().numpy()
            
            output_tensor = pb_utils.Tensor("OUTPUT", out.astype(np.float32))
            responses.append(pb_utils.InferenceResponse(output_tensors=[output_tensor]))
        
        return responses
    
    def finalize(self):
        pass
EOF

cat > /models/custom_model/config.pbtxt << 'EOF'
name: "custom_model"
backend: "python"
max_batch_size: 64

input [
  {
    name: "INPUT"
    data_type: TYPE_FP32
    dims: [10]
  }
]

output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [5]
  }
]
EOF
```

***

## चरण 7 — Triton शुरू करें और परीक्षण करें

### Triton सर्वर शुरू करें

```bash
# प्रारंभ करें (यदि Dockerfile CMD का उपयोग कर रहे हैं, तो यह स्वतः-शुरू हो जाता है)
tritonserver \
    --model-repository=/models \
    --http-port=8000 \
    --grpc-port=8001 \
    --metrics-port=8002 \
    --log-verbose=0 &

# सर्वर तैयार होने तक प्रतीक्षा करें
sleep 5
curl -s http://localhost:8000/v2/health/ready
# अपेक्षित: {"live": true}
```

### उपलब्ध मॉडलों की जाँच करें

```bash
curl http://<clore-host>:<public-8000>/v2/models
```

### HTTP के माध्यम से इन्फरेंस चलाएँ

```python
import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(
    url="<clore-host>:<public-port-8000>",
    ssl=False
)

# सर्वर स्वास्थ्य जाँचें
print("सर्वर तैयार:", client.is_server_ready())
print("मॉडल तैयार:", client.is_model_ready("resnet50_onnx"))

# इनपुट बनाएँ
image = np.random.rand(1, 3, 224, 224).astype(np.float32)
input_tensor = httpclient.InferInput("images", image.shape, "FP32")
input_tensor.set_data_from_numpy(image)

# इन्फरेंस चलाएँ
outputs = [httpclient.InferRequestedOutput("logits")]
response = client.infer("resnet50_onnx", [input_tensor], outputs=outputs)

logits = response.as_numpy("logits")
predicted_class = np.argmax(logits[0])
print(f"अनुमानित वर्ग: {predicted_class}")
```

### gRPC के माध्यम से इन्फरेंस चलाएँ

```python
import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(
    url="<clore-host>:<public-port-8001>"
)

image = np.random.rand(1, 3, 224, 224).astype(np.float32)
input_tensor = grpcclient.InferInput("images", image.shape, "FP32")
input_tensor.set_data_from_numpy(image)

outputs = [grpcclient.InferRequestedOutput("logits")]
response = client.infer("resnet50_onnx", [input_tensor], outputs=outputs)

logits = response.as_numpy("logits")
print(f"आउटपुट आकार: {logits.shape}")
```

***

## Prometheus के साथ निगरानी

Triton पोर्ट 8002 पर मेट्रिक्स एक्सपोज़ करता है:

```bash
curl http://<clore-host>:<public-port-8002>/metrics
```

मुख्य मेट्रिक्स:

```
# इन्फरेंस थ्रूपुट
nv_inference_request_success{model="resnet50_onnx", version="1"}
# औसत इन्फरेंस समय
nv_inference_compute_infer_duration_us{model="resnet50_onnx", version="1"}
# GPU उपयोगिता
nv_gpu_utilization{gpu_uuid="..."}
# GPU मेमोरी
nv_gpu_memory_used_bytes{gpu_uuid="..."}
```

***

## डायनामिक बैचिंग कॉन्फ़िगरेशन

```protobuf
dynamic_batching {
  preferred_batch_size: [4, 8, 16, 32]
  max_queue_delay_microseconds: 5000
  preserve_ordering: true
  
  priority_levels: 3
  default_priority_level: 2
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 10000
    allow_timeout_override: true
    max_queue_size: 100
  }
}
```

***

## समस्या निवारण

### मॉडल लोड विफलता

```
मॉडल लोड करने में विफल: मॉडल फ़ाइल नहीं मिली
```

**समाधान:** निर्देशिका संरचना और अनुमतियाँ जाँचें:

```bash
ls -la /models/resnet50/1/
# इसमें model.pt (PyTorch) या model.onnx (ONNX) होना चाहिए
chmod -R 755 /models/
```

### CUDA असंगतता

**समाधान:** अपने CUDA ड्राइवर से मेल खाने के लिए Triton इमेज संस्करण मिलाएँ:

```bash
nvidia-smi  # CUDA संस्करण नोट करें
# मिलते-जुलते tritonserver टैग का उपयोग करें, उदाहरण के लिए CUDA 12.2 के लिए 23.10
```

### पोर्ट पहुँचा नहीं जा सकता

**समाधान:** सुनिश्चित करें कि तीनों पोर्ट (8000, 8001, 8002) Clore.ai में फॉरवर्ड किए गए हैं। प्रत्येक का परीक्षण करें:

```bash
curl http://<host>:<port>/v2/health/live
```

### मॉडल लोड करते समय OOM

**समाधान:** इंस्टेंस काउंट घटाएँ या कुछ मॉडलों के लिए CPU इंस्टेंस का उपयोग करें:

```protobuf
instance_group [
  {
    count: 1       # डिफ़ॉल्ट से घटाएँ
    kind: KIND_GPU
  }
]
```

***

## लागत अनुमान

| GPU       | VRAM  | अनुमानित मूल्य | थ्रूपुट (ResNet50) |
| --------- | ----- | -------------- | ------------------ |
| RTX 3080  | 10 GB | \~$0.10/घंटा   | \~500 req/sec      |
| RTX 4090  | 24 GB | \~$0.35/घंटा   | \~1500 req/sec     |
| A100 40GB | 40 GB | \~$0.80/घंटा   | \~3000 req/sec     |
| H100      | 80 GB | \~$2.50/घंटा   | \~8000 req/sec     |

***

## उपयोगी संसाधन

* [Triton GitHub](https://github.com/triton-inference-server/server)
* [NGC कंटेनर रजिस्ट्री](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
* [Triton क्लाइंट लाइब्रेरीज़](https://github.com/triton-inference-server/client)
* [Triton मॉडल कॉन्फ़िगरेशन संदर्भ](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html)
* [Triton Python बैकएंड](https://github.com/triton-inference-server/python_backend)
* [Triton प्रदर्शन विश्लेषक](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/README.md)

***

## Clore.ai GPU सिफारिशें

| उपयोग केस          | सिफारिश की गई GPU | Clore.ai पर अनुमानित लागत |
| ------------------ | ----------------- | ------------------------- |
| डेवलपमेंट/टेस्टिंग | RTX 3090 (24GB)   | \~$0.12/gpu/hr            |
| उत्पादन इन्फरेंस   | RTX 4090 (24GB)   | \~$0.70/gpu/hr            |
| बड़े मॉडल (70B+)   | A100 80GB         | \~$1.20/gpu/hr            |

> 💡 इस गाइड के सभी उदाहरण तैनात किए जा सकते हैं [Clore.ai](https://clore.ai/marketplace) GPU सर्वरों पर। उपलब्ध GPUs ब्राउज़ करें और घंटे के हिसाब से किराए पर लें — कोई प्रतिबद्धता नहीं, पूर्ण रूट एक्सेस।


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-hi/mlops-and-deployment/triton-inference-server.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.