# BentoML

**Triton Inference Server** आधुनिक, ओपन-सोर्स फ्रेमवर्क है जो **एआई अनुप्रयोगों का निर्माण, तैनाती और स्केलिंग करने के लिए**। यह ML परीक्षण और प्रोडक्शन डिप्लॉयमेंट के बीच की खाई को पाटता है, जिससे आप किसी भी फ्रेमवर्क के किसी भी मॉडल को मिनटों में प्रोडक्शन-रेडी API सेवा में पैकेज कर सकते हैं। लागत-कुशल एआई एप्लिकेशन होस्टिंग के लिए Clore.ai के GPU क्लाउड पर BentoML चलाएँ।

***

## BentoML क्या है?

BentoML एक प्रशिक्षित मॉडल को आसानी से एक स्केलेबल API सेवा में बदलना आसान बनाता है:

* **फ्रेमवर्क-एग्नॉस्टिक:** PyTorch, TensorFlow, JAX, scikit-learn, HuggingFace, XGBoost, LightGBM, और अन्य
* **Bento:** एक स्वयं-संचालित, पुनरुत्पादन योग्य आर्टिफैक्ट (मॉडल + कोड + डिपेंडेंसीज़)
* **रनर:** ऑटोमैटिक बैचिंग के साथ स्केलेबल मॉडल इन्फ़रेंस यूनिट
* **सर्विस:** FastAPI-जैसी HTTP/gRPC सेवा परिभाषा
* **BentoCloud:** वैकल्पिक प्रबंधित डिप्लॉयमेंट प्लेटफ़ॉर्म
* **Docker-प्रथम:** हर Bento को एक कमांड से कंटेनराइज़ किया जा सकता है

**मुख्य विशेषताएँ:**

* थ्रूपुट अनुकूलन के लिए अनुकुलित माइक्रो-बैचिंग
* Pydantic के साथ इन-बिल्ट इनपुट/आउटपुट सत्यापन
* OpenAPI स्पेक स्वचालित रूप से उत्पन्न
* Prometheus मेट्रिक्स इन-बिल्ट
* स्ट्रीमिंग प्रतिक्रिया समर्थन (LLMs)

***

## पूर्व-आवश्यकताएँ

| आवश्यकता | न्यूनतम       | अनुशंसित        |
| -------- | ------------- | --------------- |
| GPU VRAM | 8 GB          | 16–24 GB        |
| GPU      | कोई भी NVIDIA | RTX 4090 / A100 |
| RAM      | 8 GB          | 16 GB           |
| स्टोरेज  | 20 GB         | 40 GB           |
| Python   | 3.9+          | 3.11+           |

***

## चरण 1 — Clore.ai पर एक GPU किराए पर लें

1. लॉग इन करें [clore.ai](https://clore.ai).
2. पर क्लिक करें **मार्केटप्लेस** और ≥ 16 GB VRAM वाले GPU इंस्टेंस का चयन करें।
3. Docker इमेज सेट करें: हम एक कस्टम बिल्ड का उपयोग करेंगे (स्टेप 2 देखें)।
4. खुले पोर्ट सेट करें: `22` (SSH) और `3000` (BentoML सेवा)।
5. पर क्लिक करें **किराए पर लें**.

***

## स्टेप 2 — Dockerfile

BentoML का आधिकारिक GPU Docker इमेज नहीं है, इसलिए हम एक बनाते हैं:

```dockerfile
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    git wget curl \
    openssh-server \
    libgl1 libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# SSH कॉन्फ़िगर करें
RUN mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# BentoML और सामान्य ML लाइब्रेरी इंस्टॉल करें
RUN pip install --upgrade pip && \
    pip install \
        bentoml \
        transformers \
        accelerate \
        diffusers \
        Pillow \
        numpy \
        scipy \
        tritonclient[all]

WORKDIR /workspace

EXPOSE 22 3000

CMD service ssh start && tail -f /dev/null
```

### बिल्ड और पुश करें

इमेज बनाकर इसे अपने Docker Hub अकाउंट पर पुश करें (हमें बदलें `YOUR_DOCKERHUB_USERNAME` अपने वास्तविक उपयोगकर्ता नाम से):

```bash
docker build -t YOUR_DOCKERHUB_USERNAME/bentoml-gpu:latest .
docker push YOUR_DOCKERHUB_USERNAME/bentoml-gpu:latest
```

{% hint style="info" %}
BentoML Docker Hub पर आधिकारिक GPU Docker इमेज प्रदान नहीं करता है। `bentoml/bento-server` Docker Hub पर इमेज पहले से पैकेज किए गए Bentos को सर्व करने के लिए हैं और इनमें CUDA समर्थन शामिल नहीं है। Clore.ai पर GPU-सक्षम डिप्लॉयमेंट के लिए ऊपर दिए Dockerfile से इमेज बनाएँ।
{% endhint %}

***

## स्टेप 3 — SSH के माध्यम से कनेक्ट करें

```bash
ssh root@<clore-host> -p <assigned-ssh-port>
```

BentoML सत्यापित करें:

```bash
bentoml --version
# अपेक्षित: bentoml, version 1.x.x
```

***

## स्टेप 4 — आपकी पहली BentoML सेवा

### सरल टेक्स्ट क्लासीफ़ायर

एक सर्विस फ़ाइल बनाएँ:

```bash
mkdir -p /workspace/my-service
cat > /workspace/my-service/service.py << 'EOF'
import bentoml
from bentoml.io import JSON, Text
import numpy as np

# एक रनर परिभाषित करें (मॉडल यूनिट)
class TextClassifierRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True
    
    def __init__(self):
        import torch
        from transformers import pipeline
        
        self.classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=0 if torch.cuda.is_available() else -1,
        )
    
    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def classify(self, texts: list[str]) -> list[dict]:
        results = self.classifier(texts)
        return results

# रनर बनाएँ
classifier_runner = bentoml.Runner(
    TextClassifierRunnable,
    name="text_classifier",
    max_batch_size=32,
    max_latency_ms=100,
)

# सेवा परिभाषित करें
svc = bentoml.Service(
    name="text_classifier_service",
    runners=[classifier_runner],
)

@svc.api(input=Text(), output=JSON())
async def classify(text: str) -> dict:
    """इनपुट टेक्स्ट की सेंटिमेंट क्लासिफाई करें."""
    results = await classifier_runner.classify.async_run([text])
    return results[0]
EOF
```

### सर्विस शुरू करें

```bash
cd /workspace/my-service

bentoml serve service:svc \
    --host 0.0.0.0 \
    --port 3000 \
    --reload
```

{% hint style="info" %}
The `--reload` फ़्लैग विकास के दौरान हॉट-रिलोड सक्षम करता है। स्थिरता के लिए प्रोडक्शन में इसे हटाएँ।
{% endhint %}

***

## स्टेप 5 — सर्विस तक पहुँच

आटो-जनरेटेड Swagger UI खोलें:

```
http://<clore-host>:<public-port-3000>
```

या इसके द्वारा परीक्षण करें `curl`:

```bash
curl -X POST http://<clore-host>:<public-port-3000>/classify \
    -H "Content-Type: text/plain" \
    -d "This GPU cloud service is amazing!"
```

अपेक्षित प्रतिक्रिया:

```json
{"label": "POSITIVE", "score": 0.9986}
```

***

## स्टेप 6 — इमेज क्लासीफिकेशन सर्विस

### विजन मॉडल सर्विस

```python
# /workspace/vision-service/service.py
import bentoml
from bentoml.io import Image, JSON
from PIL import Image as PILImage
import numpy as np

class ImageClassifierRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu",)
    SUPPORTS_CPU_MULTI_THREADING = False
    
    def __init__(self):
        import torch
        import torchvision.transforms as transforms
        from torchvision.models import resnet50, ResNet50_Weights
        
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        weights = ResNet50_Weights.DEFAULT
        self.model = resnet50(weights=weights).to(self.device)
        self.model.eval()
        self.preprocess = weights.transforms()
        self.categories = weights.meta["categories"]
    
    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def predict(self, images: list) -> list[dict]:
        import torch
        
        batch = torch.stack([self.preprocess(img) for img in images]).to(self.device)
        
        with torch.no_grad():
            predictions = self.model(batch).softmax(dim=1)
        
        results = []
        for pred in predictions:
            top5 = pred.topk(5)
            results.append({
                "predictions": [
                    {"label": self.categories[idx], "score": round(score.item(), 4)}
                    for score, idx in zip(top5.values, top5.indices)
                ]
            })
        return results


image_runner = bentoml.Runner(
    ImageClassifierRunnable,
    name="image_classifier",
    max_batch_size=16,
)

svc = bentoml.Service(
    name="image_classifier_service",
    runners=[image_runner],
)

@svc.api(input=Image(), output=JSON())
async def classify(image: PILImage.Image) -> dict:
    """ResNet50 से एक छवि को वर्गीकृत करें."""
    results = await image_runner.predict.async_run([image])
    return results[0]
```

```bash
bentoml serve service:svc --host 0.0.0.0 --port 3000
```

एक इमेज के साथ परीक्षण करें:

```bash
curl -X POST http://<clore-host>:<public-port-3000>/classify \
    -H "Content-Type: image/jpeg" \
    --data-binary @/path/to/image.jpg
```

***

## स्टेप 7 — LLM स्ट्रीमिंग सर्विस

स्ट्रीमिंग प्रतिक्रियाओं वाले भाषा मॉडलों के लिए:

```python
# /workspace/llm-service/service.py
import bentoml
from bentoml.io import JSON, Text
from typing import AsyncGenerator

class LLMRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu",)
    SUPPORTS_CPU_MULTI_THREADING = False
    
    def __init__(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch
        
        model_name = "microsoft/phi-2"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
    
    @bentoml.Runnable.method(batchable=False)
    def generate(self, prompt: str, max_tokens: int = 200) -> str:
        import torch
        
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                do_sample=True,
                temperature=0.7,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)


llm_runner = bentoml.Runner(LLMRunnable, name="llm")

svc = bentoml.Service("llm_service", runners=[llm_runner])

@svc.api(input=JSON(), output=Text())
async def generate(body: dict) -> str:
    prompt = body.get("prompt", "")
    max_tokens = body.get("max_tokens", 200)
    return await llm_runner.generate.async_run(prompt, max_tokens)
```

***

## स्टेप 8 — Bento को सेव और बिल्ड करें

एक **Bento** एक पैकेज्ड, पुनरुत्पादन योग्य आर्टिफैक्ट है:

```python
# /workspace/build_bento.py
import bentoml

# BentoML मॉडल स्टोर में मॉडल सेव करें
import torch
from torchvision.models import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.DEFAULT)
model.eval()

saved_model = bentoml.pytorch.save_model(
    name="resnet50",
    model=model,
    labels={"framework": "pytorch", "task": "image-classification"},
    metadata={"accuracy": 0.80, "dataset": "ImageNet"}
)
print(f"Model saved: {saved_model.tag}")
```

```bash
python /workspace/build_bento.py

# सेव किए गए मॉडलों की सूची
bentoml models list

# Bento बनायें (bentofile.yaml आवश्यक है)
bentoml build
```

### bentofile.yaml

```yaml
service: "service:svc"
labels:
  owner: "ml-team"
  stage: "production"
include:
  - "*.py"
python:
  packages:
    - torch
    - torchvision
    - transformers
    - Pillow
    - numpy
docker:
  python_version: "3.11"
  cuda_version: "12.1"
  system_packages:
    - libgl1
```

```bash
bentoml build

# बने हुए bentos की सूची
bentoml list

# कंटेनराइज़ करें
bentoml containerize image_classifier_service:latest \
    --image-tag YOUR_DOCKERHUB_USERNAME/my-bento:latest
```

***

## मॉनिटरिंग और मेट्रिक्स

BentoML Prometheus मेट्रिक्स को परोक्ष करता है `/metrics`:

```bash
curl http://<clore-host>:<public-port-3000>/metrics
```

मुख्य मेट्रिक्स:

```
# अनुरोध दर
bentoml_service_request_total{endpoint="classify", http_status_code="200"}
# विलंबता
bentoml_service_request_duration_seconds{endpoint="classify"}
# रनर थ्रूपुट  
bentoml_runner_request_total{runner_name="image_classifier"}
```

***

## एडैप्टिव बैचिंग कॉन्फ़िगरेशन

```python
# बैचिंग व्यवहार को फाइन-ट्यून करें
image_runner = bentoml.Runner(
    ImageClassifierRunnable,
    name="image_classifier",
    max_batch_size=64,          # प्रति बैच अधिकतम अनुरोध
    max_latency_ms=50,          # डिस्पैच करने से पहले अधिकतम प्रतीक्षा
)
```

***

## समस्या निवारण

### सर्विस शुरू नहीं होगी

```
ERROR - रनर प्रारंभ करने में विफल रहा
```

**समाधान:**

* CUDA उपलब्धता जांचें: `python -c "import torch; print(torch.cuda.is_available())"`
* GPU VRAM सत्यापित करें: `nvidia-smi`
* मॉडल डाउनलोड पूरा हुआ या नहीं जांचें (लॉग में डाउनलोड प्रगति देखें)

### पोर्ट 3000 पहुँच योग्य नहीं है

```bash
# सुनिश्चित करें कि सेवा 0.0.0.0 से बाइंड हो (localhost नहीं)
bentoml serve service:svc --host 0.0.0.0 --port 3000
```

### पहले अनुरोध पर उच्च विलंबता

यह सामान्य है — पहले अनुरोध से मॉडल लोडिंग (वार्म-अप) ट्रिगर होता है। सभी बाद के अनुरोध तेज़ होंगे। स्टार्ट के बाद एक वार्म-अप एन्डपॉइंट कॉल जोड़ें:

```bash
# शुरू करने के बाद वार्म अप करें
sleep 10 && curl -s -o /dev/null http://localhost:3000/healthz
```

### इम्पोर्ट त्रुटियाँ

```
ModuleNotFoundError: No module named 'transformers'
```

**समाधान:**

```bash
pip install transformers accelerate
```

***

## Clore.ai GPU सिफारिशें

BentoML एक सर्विंग फ्रेमवर्क है — GPU आवश्यकताएँ पूरी तरह उस मॉडल पर निर्भर करती हैं जिसे आप डिप्लॉय करते हैं। सामान्य वर्कलोड्स के लिए यह अपेक्षा करें:

| GPU       | VRAM  | Clore.ai कीमत | LLM (7B Q4) थ्रूपुट | डिफ्यूज़न (SDXL) | विजन (ResNet50) |
| --------- | ----- | ------------- | ------------------- | ---------------- | --------------- |
| RTX 3090  | 24 GB | \~$0.12/घंटा  | \~80 tok/s          | \~4 img/min      | \~400 req/s     |
| RTX 4090  | 24 GB | \~$0.70/घंटा  | \~140 tok/s         | \~8 img/min      | \~700 req/s     |
| A100 40GB | 40 GB | \~$1.20/घंटा  | \~110 tok/s         | \~6 img/min      | \~1200 req/s    |
| A100 80GB | 80 GB | \~$2.00/घंटा  | \~130 tok/s         | \~7 img/min      | \~1400 req/s    |

**उपयोग मामला मार्गदर्शन:**

* **LLM API सर्विंग (7B–13B):** RTX 3090 (\~$0.12/hr) — अनुकूलित कीमत-प्रदर्शन
* **इमेज जनरेशन APIs:** थ्रूपुट आवश्यकताओं के आधार पर RTX 3090 या RTX 4090
* **बड़े मॉडल (34B–70B Q4):** A100 40GB (\~$1.20/hr) — आराम से फिट बैठता है
* **प्रोडक्शन मल्टी-मॉडल सर्विंग:** मेमोरी हेडरूम के लिए A100 80GB

{% hint style="info" %}
BentoML का **एडैप्टिव माइक्रो-बैचिंग** A100s पर विशेष रूप से प्रभावी है — हार्डवेयर शेड्युलर कुशलतापूर्वक बैचिंग संभालता है, जिससे प्रति डॉलर अधिक थ्रूपुट मिलता है बनिस्पत सरल सिंगल-रिक्वेस्ट सर्विंग के। उच्च-ट्रैफ़िक APIs के लिए, अक्सर दो RTX 4090s की तुलना में A100 40GB बेहतर ROI देता है।
{% endhint %}

***

## उपयोगी संसाधन

* [BentoML आधिकारिक दस्तावेज़](https://docs.bentoml.com)
* [BentoML GitHub](https://github.com/bentoml/BentoML)
* [BentoML उदाहरण](https://github.com/bentoml/BentoML/tree/main/examples)
* [BentoML Discord समुदाय](https://l.bentoml.com/join-slack-space)
* [BentoML गैलरी](https://www.bentoml.com/gallery)
* [क्विकस्टार्ट: LLMs की सर्विंग](https://docs.bentoml.com/en/latest/get-started/quickstart.html)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-hi/mlops-deployment/bentoml.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.