# BentoML

**BentoML** is a modern, open-source framework for **building, shipping, and scaling AI applications**. It bridges the gap between ML experimentation and production deployment, letting you package any model from any framework into a production-ready API service in minutes. Run BentoML on Clore.ai's GPU cloud for cost-efficient AI application hosting.

***

## What is BentoML?

BentoML makes it easy to take a trained model and turn it into a scalable API service:

* **Framework-agnostic:** PyTorch, TensorFlow, JAX, scikit-learn, HuggingFace, XGBoost, LightGBM, and more
* **Bento:** A self-contained, reproducible artifact (model + code + dependencies)
* **Runner:** Scalable model inference unit with automatic batching
* **Service:** FastAPI-like HTTP/gRPC service definition
* **BentoCloud:** Optional managed deployment platform
* **Docker-first:** Every Bento can be containerized with one command

**Key features:**

* Adaptive micro-batching for throughput optimization
* Built-in input/output validation with Pydantic
* OpenAPI spec auto-generated
* Prometheus metrics built-in
* Streaming response support (LLMs)

***

## Prerequisites

| Requirement | Minimum    | Recommended     |
| ----------- | ---------- | --------------- |
| GPU VRAM    | 8 GB       | 16–24 GB        |
| GPU         | Any NVIDIA | RTX 4090 / A100 |
| RAM         | 8 GB       | 16 GB           |
| Storage     | 20 GB      | 40 GB           |
| Python      | 3.9+       | 3.11+           |

***

## Step 1 — Rent a GPU on Clore.ai

1. Log in to [clore.ai](https://clore.ai).
2. Click **Marketplace** and select a GPU instance with ≥ 16 GB VRAM.
3. Set Docker image: we'll use a custom build (see Step 2).
4. Set open ports: `22` (SSH) and `3000` (BentoML service).
5. Click **Rent**.

***

## Step 2 — Dockerfile

BentoML doesn't have an official GPU Docker image, so we build one:

```dockerfile
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    git wget curl \
    openssh-server \
    libgl1 libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Configure SSH
RUN mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Install BentoML and common ML libs
RUN pip install --upgrade pip && \
    pip install \
        bentoml \
        transformers \
        accelerate \
        diffusers \
        Pillow \
        numpy \
        scipy \
        tritonclient[all]

WORKDIR /workspace

EXPOSE 22 3000

CMD service ssh start && tail -f /dev/null
```

### Build and Push

Build the image and push it to your own Docker Hub account (replace `YOUR_DOCKERHUB_USERNAME` with your actual username):

```bash
docker build -t YOUR_DOCKERHUB_USERNAME/bentoml-gpu:latest .
docker push YOUR_DOCKERHUB_USERNAME/bentoml-gpu:latest
```

{% hint style="info" %}
BentoML does not provide an official GPU Docker image on Docker Hub. The `bentoml/bento-server` images on Docker Hub are for serving pre-packaged Bentos and do not include CUDA support. Build the image from the Dockerfile above for GPU-enabled deployments on Clore.ai.
{% endhint %}

***

## Step 3 — Connect via SSH

```bash
ssh root@<clore-host> -p <assigned-ssh-port>
```

Verify BentoML:

```bash
bentoml --version
# Expected: bentoml, version 1.x.x
```

***

## Step 4 — Your First BentoML Service

### Simple Text Classifier

Create a service file:

```bash
mkdir -p /workspace/my-service
cat > /workspace/my-service/service.py << 'EOF'
import bentoml
from bentoml.io import JSON, Text
import numpy as np

# Define a Runner (the model unit)
class TextClassifierRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True
    
    def __init__(self):
        import torch
        from transformers import pipeline
        
        self.classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=0 if torch.cuda.is_available() else -1,
        )
    
    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def classify(self, texts: list[str]) -> list[dict]:
        results = self.classifier(texts)
        return results

# Create Runner
classifier_runner = bentoml.Runner(
    TextClassifierRunnable,
    name="text_classifier",
    max_batch_size=32,
    max_latency_ms=100,
)

# Define Service
svc = bentoml.Service(
    name="text_classifier_service",
    runners=[classifier_runner],
)

@svc.api(input=Text(), output=JSON())
async def classify(text: str) -> dict:
    """Classify sentiment of input text."""
    results = await classifier_runner.classify.async_run([text])
    return results[0]
EOF
```

### Start the Service

```bash
cd /workspace/my-service

bentoml serve service:svc \
    --host 0.0.0.0 \
    --port 3000 \
    --reload
```

{% hint style="info" %}
The `--reload` flag enables hot-reload during development. Remove it in production for stability.
{% endhint %}

***

## Step 5 — Access the Service

Open the auto-generated Swagger UI:

```
http://<clore-host>:<public-port-3000>
```

Or test via `curl`:

```bash
curl -X POST http://<clore-host>:<public-port-3000>/classify \
    -H "Content-Type: text/plain" \
    -d "This GPU cloud service is amazing!"
```

Expected response:

```json
{"label": "POSITIVE", "score": 0.9986}
```

***

## Step 6 — Image Classification Service

### Vision Model Service

```python
# /workspace/vision-service/service.py
import bentoml
from bentoml.io import Image, JSON
from PIL import Image as PILImage
import numpy as np

class ImageClassifierRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu",)
    SUPPORTS_CPU_MULTI_THREADING = False
    
    def __init__(self):
        import torch
        import torchvision.transforms as transforms
        from torchvision.models import resnet50, ResNet50_Weights
        
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        weights = ResNet50_Weights.DEFAULT
        self.model = resnet50(weights=weights).to(self.device)
        self.model.eval()
        self.preprocess = weights.transforms()
        self.categories = weights.meta["categories"]
    
    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def predict(self, images: list) -> list[dict]:
        import torch
        
        batch = torch.stack([self.preprocess(img) for img in images]).to(self.device)
        
        with torch.no_grad():
            predictions = self.model(batch).softmax(dim=1)
        
        results = []
        for pred in predictions:
            top5 = pred.topk(5)
            results.append({
                "predictions": [
                    {"label": self.categories[idx], "score": round(score.item(), 4)}
                    for score, idx in zip(top5.values, top5.indices)
                ]
            })
        return results


image_runner = bentoml.Runner(
    ImageClassifierRunnable,
    name="image_classifier",
    max_batch_size=16,
)

svc = bentoml.Service(
    name="image_classifier_service",
    runners=[image_runner],
)

@svc.api(input=Image(), output=JSON())
async def classify(image: PILImage.Image) -> dict:
    """Classify an image with ResNet50."""
    results = await image_runner.predict.async_run([image])
    return results[0]
```

```bash
bentoml serve service:svc --host 0.0.0.0 --port 3000
```

Test with an image:

```bash
curl -X POST http://<clore-host>:<public-port-3000>/classify \
    -H "Content-Type: image/jpeg" \
    --data-binary @/path/to/image.jpg
```

***

## Step 7 — LLM Streaming Service

For language models with streaming responses:

```python
# /workspace/llm-service/service.py
import bentoml
from bentoml.io import JSON, Text
from typing import AsyncGenerator

class LLMRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu",)
    SUPPORTS_CPU_MULTI_THREADING = False
    
    def __init__(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch
        
        model_name = "microsoft/phi-2"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
    
    @bentoml.Runnable.method(batchable=False)
    def generate(self, prompt: str, max_tokens: int = 200) -> str:
        import torch
        
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                do_sample=True,
                temperature=0.7,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)


llm_runner = bentoml.Runner(LLMRunnable, name="llm")

svc = bentoml.Service("llm_service", runners=[llm_runner])

@svc.api(input=JSON(), output=Text())
async def generate(body: dict) -> str:
    prompt = body.get("prompt", "")
    max_tokens = body.get("max_tokens", 200)
    return await llm_runner.generate.async_run(prompt, max_tokens)
```

***

## Step 8 — Save and Build a Bento

A **Bento** is a packaged, reproducible artifact:

```python
# /workspace/build_bento.py
import bentoml

# Save model to BentoML model store
import torch
from torchvision.models import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.DEFAULT)
model.eval()

saved_model = bentoml.pytorch.save_model(
    name="resnet50",
    model=model,
    labels={"framework": "pytorch", "task": "image-classification"},
    metadata={"accuracy": 0.80, "dataset": "ImageNet"}
)
print(f"Model saved: {saved_model.tag}")
```

```bash
python /workspace/build_bento.py

# List saved models
bentoml models list

# Build a Bento (requires bentofile.yaml)
bentoml build
```

### bentofile.yaml

```yaml
service: "service:svc"
labels:
  owner: "ml-team"
  stage: "production"
include:
  - "*.py"
python:
  packages:
    - torch
    - torchvision
    - transformers
    - Pillow
    - numpy
docker:
  python_version: "3.11"
  cuda_version: "12.1"
  system_packages:
    - libgl1
```

```bash
bentoml build

# List built bentos
bentoml list

# Containerize
bentoml containerize image_classifier_service:latest \
    --image-tag YOUR_DOCKERHUB_USERNAME/my-bento:latest
```

***

## Monitoring and Metrics

BentoML exposes Prometheus metrics at `/metrics`:

```bash
curl http://<clore-host>:<public-port-3000>/metrics
```

Key metrics:

```
# Request rate
bentoml_service_request_total{endpoint="classify", http_status_code="200"}
# Latency
bentoml_service_request_duration_seconds{endpoint="classify"}
# Runner throughput  
bentoml_runner_request_total{runner_name="image_classifier"}
```

***

## Adaptive Batching Configuration

```python
# Fine-tune batching behavior
image_runner = bentoml.Runner(
    ImageClassifierRunnable,
    name="image_classifier",
    max_batch_size=64,          # Max requests per batch
    max_latency_ms=50,          # Max wait before dispatching
)
```

***

## Troubleshooting

### Service Won't Start

```
ERROR - Failed to initialize runner
```

**Solutions:**

* Check CUDA availability: `python -c "import torch; print(torch.cuda.is_available())"`
* Verify GPU VRAM: `nvidia-smi`
* Check model download completed (look for download progress in logs)

### Port 3000 Not Accessible

```bash
# Ensure service binds to 0.0.0.0 (not localhost)
bentoml serve service:svc --host 0.0.0.0 --port 3000
```

### High Latency on First Request

This is normal — the first request triggers model loading (warm-up). All subsequent requests will be fast. Add a warm-up endpoint call after start:

```bash
# Warm up after starting
sleep 10 && curl -s -o /dev/null http://localhost:3000/healthz
```

### Import Errors

```
ModuleNotFoundError: No module named 'transformers'
```

**Solution:**

```bash
pip install transformers accelerate
```

***

## Clore.ai GPU Recommendations

BentoML is a serving framework — GPU requirements depend entirely on the model you deploy. Here's what to expect for common workloads:

| GPU       | VRAM  | Clore.ai Price | LLM (7B Q4) Throughput | Diffusion (SDXL) | Vision (ResNet50) |
| --------- | ----- | -------------- | ---------------------- | ---------------- | ----------------- |
| RTX 3090  | 24 GB | \~$0.12/hr     | \~80 tok/s             | \~4 img/min      | \~400 req/s       |
| RTX 4090  | 24 GB | \~$0.70/hr     | \~140 tok/s            | \~8 img/min      | \~700 req/s       |
| A100 40GB | 40 GB | \~$1.20/hr     | \~110 tok/s            | \~6 img/min      | \~1200 req/s      |
| A100 80GB | 80 GB | \~$2.00/hr     | \~130 tok/s            | \~7 img/min      | \~1400 req/s      |

**Use case guidance:**

* **LLM API serving (7B–13B):** RTX 3090 (\~$0.12/hr) — optimal price-performance
* **Image generation APIs:** RTX 3090 or RTX 4090 depending on throughput needs
* **Large models (34B–70B Q4):** A100 40GB (\~$1.20/hr) — fits comfortably
* **Production multi-model serving:** A100 80GB for memory headroom

{% hint style="info" %}
BentoML's **adaptive micro-batching** is particularly effective on A100s — the hardware scheduler handles batching efficiently, extracting more throughput per dollar than naive single-request serving. For high-traffic APIs, A100 40GB often delivers better ROI than two RTX 4090s.
{% endhint %}

***

## Useful Resources

* [BentoML Official Documentation](https://docs.bentoml.com)
* [BentoML GitHub](https://github.com/bentoml/BentoML)
* [BentoML Examples](https://github.com/bentoml/BentoML/tree/main/examples)
* [BentoML Discord Community](https://l.bentoml.com/join-slack-space)
* [BentoML Gallery](https://www.bentoml.com/gallery)
* [Quickstart: Serving LLMs](https://docs.bentoml.com/en/latest/get-started/quickstart.html)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/mlops-and-deployment/bentoml.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
