# BentoML

**BentoML** is a modern, open-source framework for **building, shipping, and scaling AI applications**. It bridges the gap between ML experimentation and production deployment, letting you package any model from any framework into a production-ready API service in minutes. Run BentoML on Clore.ai's GPU cloud for cost-efficient AI application hosting.

***

## What is BentoML?

BentoML makes it easy to take a trained model and turn it into a scalable API service:

* **Framework-agnostic:** PyTorch, TensorFlow, JAX, scikit-learn, HuggingFace, XGBoost, LightGBM, and more
* **Bento:** A self-contained, reproducible artifact (model + code + dependencies)
* **Runner:** Scalable model inference unit with automatic batching
* **Service:** FastAPI-like HTTP/gRPC service definition
* **BentoCloud:** Optional managed deployment platform
* **Docker-first:** Every Bento can be containerized with one command

**Key features:**

* Adaptive micro-batching for throughput optimization
* Built-in input/output validation with Pydantic
* OpenAPI spec auto-generated
* Prometheus metrics built-in
* Streaming response support (LLMs)

***

## Prerequisites

| Requirement | Minimum    | Recommended     |
| ----------- | ---------- | --------------- |
| GPU VRAM    | 8 GB       | 16–24 GB        |
| GPU         | Any NVIDIA | RTX 4090 / A100 |
| RAM         | 8 GB       | 16 GB           |
| Storage     | 20 GB      | 40 GB           |
| Python      | 3.9+       | 3.11+           |

***

## Step 1 — Rent a GPU on Clore.ai

1. Log in to [clore.ai](https://clore.ai).
2. Click **Marketplace** and select a GPU instance with ≥ 16 GB VRAM.
3. Set Docker image: we'll use a custom build (see Step 2).
4. Set open ports: `22` (SSH) and `3000` (BentoML service).
5. Click **Rent**.

***

## Step 2 — Dockerfile

BentoML doesn't have an official GPU Docker image, so we build one:

```dockerfile
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    git wget curl \
    openssh-server \
    libgl1 libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Configure SSH
RUN mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Install BentoML and common ML libs
RUN pip install --upgrade pip && \
    pip install \
        bentoml \
        transformers \
        accelerate \
        diffusers \
        Pillow \
        numpy \
        scipy \
        tritonclient[all]

WORKDIR /workspace

EXPOSE 22 3000

CMD service ssh start && tail -f /dev/null
```

### Build and Push

Build the image and push it to your own Docker Hub account (replace `YOUR_DOCKERHUB_USERNAME` with your actual username):

```bash
docker build -t YOUR_DOCKERHUB_USERNAME/bentoml-gpu:latest .
docker push YOUR_DOCKERHUB_USERNAME/bentoml-gpu:latest
```

{% hint style="info" %}
BentoML does not provide an official GPU Docker image on Docker Hub. The `bentoml/bento-server` images on Docker Hub are for serving pre-packaged Bentos and do not include CUDA support. Build the image from the Dockerfile above for GPU-enabled deployments on Clore.ai.
{% endhint %}

***

## Step 3 — Connect via SSH

```bash
ssh root@<clore-host> -p <assigned-ssh-port>
```

Verify BentoML:

```bash
bentoml --version
# Expected: bentoml, version 1.x.x
```

***

## Step 4 — Your First BentoML Service

### Simple Text Classifier

Create a service file:

```bash
mkdir -p /workspace/my-service
cat > /workspace/my-service/service.py << 'EOF'
import bentoml
from bentoml.io import JSON, Text
import numpy as np

# Define a Runner (the model unit)
class TextClassifierRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True
    
    def __init__(self):
        import torch
        from transformers import pipeline
        
        self.classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=0 if torch.cuda.is_available() else -1,
        )
    
    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def classify(self, texts: list[str]) -> list[dict]:
        results = self.classifier(texts)
        return results

# Create Runner
classifier_runner = bentoml.Runner(
    TextClassifierRunnable,
    name="text_classifier",
    max_batch_size=32,
    max_latency_ms=100,
)

# Define Service
svc = bentoml.Service(
    name="text_classifier_service",
    runners=[classifier_runner],
)

@svc.api(input=Text(), output=JSON())
async def classify(text: str) -> dict:
    """Classify sentiment of input text."""
    results = await classifier_runner.classify.async_run([text])
    return results[0]
EOF
```

### Start the Service

```bash
cd /workspace/my-service

bentoml serve service:svc \
    --host 0.0.0.0 \
    --port 3000 \
    --reload
```

{% hint style="info" %}
The `--reload` flag enables hot-reload during development. Remove it in production for stability.
{% endhint %}

***

## Step 5 — Access the Service

Open the auto-generated Swagger UI:

```
http://<clore-host>:<public-port-3000>
```

Or test via `curl`:

```bash
curl -X POST http://<clore-host>:<public-port-3000>/classify \
    -H "Content-Type: text/plain" \
    -d "This GPU cloud service is amazing!"
```

Expected response:

```json
{"label": "POSITIVE", "score": 0.9986}
```

***

## Step 6 — Image Classification Service

### Vision Model Service

```python
# /workspace/vision-service/service.py
import bentoml
from bentoml.io import Image, JSON
from PIL import Image as PILImage
import numpy as np

class ImageClassifierRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu",)
    SUPPORTS_CPU_MULTI_THREADING = False
    
    def __init__(self):
        import torch
        import torchvision.transforms as transforms
        from torchvision.models import resnet50, ResNet50_Weights
        
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        weights = ResNet50_Weights.DEFAULT
        self.model = resnet50(weights=weights).to(self.device)
        self.model.eval()
        self.preprocess = weights.transforms()
        self.categories = weights.meta["categories"]
    
    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def predict(self, images: list) -> list[dict]:
        import torch
        
        batch = torch.stack([self.preprocess(img) for img in images]).to(self.device)
        
        with torch.no_grad():
            predictions = self.model(batch).softmax(dim=1)
        
        results = []
        for pred in predictions:
            top5 = pred.topk(5)
            results.append({
                "predictions": [
                    {"label": self.categories[idx], "score": round(score.item(), 4)}
                    for score, idx in zip(top5.values, top5.indices)
                ]
            })
        return results


image_runner = bentoml.Runner(
    ImageClassifierRunnable,
    name="image_classifier",
    max_batch_size=16,
)

svc = bentoml.Service(
    name="image_classifier_service",
    runners=[image_runner],
)

@svc.api(input=Image(), output=JSON())
async def classify(image: PILImage.Image) -> dict:
    """Classify an image with ResNet50."""
    results = await image_runner.predict.async_run([image])
    return results[0]
```

```bash
bentoml serve service:svc --host 0.0.0.0 --port 3000
```

Test with an image:

```bash
curl -X POST http://<clore-host>:<public-port-3000>/classify \
    -H "Content-Type: image/jpeg" \
    --data-binary @/path/to/image.jpg
```

***

## Step 7 — LLM Streaming Service

For language models with streaming responses:

```python
# /workspace/llm-service/service.py
import bentoml
from bentoml.io import JSON, Text
from typing import AsyncGenerator

class LLMRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu",)
    SUPPORTS_CPU_MULTI_THREADING = False
    
    def __init__(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch
        
        model_name = "microsoft/phi-2"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
    
    @bentoml.Runnable.method(batchable=False)
    def generate(self, prompt: str, max_tokens: int = 200) -> str:
        import torch
        
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                do_sample=True,
                temperature=0.7,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)


llm_runner = bentoml.Runner(LLMRunnable, name="llm")

svc = bentoml.Service("llm_service", runners=[llm_runner])

@svc.api(input=JSON(), output=Text())
async def generate(body: dict) -> str:
    prompt = body.get("prompt", "")
    max_tokens = body.get("max_tokens", 200)
    return await llm_runner.generate.async_run(prompt, max_tokens)
```

***

## Step 8 — Save and Build a Bento

A **Bento** is a packaged, reproducible artifact:

```python
# /workspace/build_bento.py
import bentoml

# Save model to BentoML model store
import torch
from torchvision.models import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.DEFAULT)
model.eval()

saved_model = bentoml.pytorch.save_model(
    name="resnet50",
    model=model,
    labels={"framework": "pytorch", "task": "image-classification"},
    metadata={"accuracy": 0.80, "dataset": "ImageNet"}
)
print(f"Model saved: {saved_model.tag}")
```

```bash
python /workspace/build_bento.py

# List saved models
bentoml models list

# Build a Bento (requires bentofile.yaml)
bentoml build
```

### bentofile.yaml

```yaml
service: "service:svc"
labels:
  owner: "ml-team"
  stage: "production"
include:
  - "*.py"
python:
  packages:
    - torch
    - torchvision
    - transformers
    - Pillow
    - numpy
docker:
  python_version: "3.11"
  cuda_version: "12.1"
  system_packages:
    - libgl1
```

```bash
bentoml build

# List built bentos
bentoml list

# Containerize
bentoml containerize image_classifier_service:latest \
    --image-tag YOUR_DOCKERHUB_USERNAME/my-bento:latest
```

***

## Monitoring and Metrics

BentoML exposes Prometheus metrics at `/metrics`:

```bash
curl http://<clore-host>:<public-port-3000>/metrics
```

Key metrics:

```
# Request rate
bentoml_service_request_total{endpoint="classify", http_status_code="200"}
# Latency
bentoml_service_request_duration_seconds{endpoint="classify"}
# Runner throughput  
bentoml_runner_request_total{runner_name="image_classifier"}
```

***

## Adaptive Batching Configuration

```python
# Fine-tune batching behavior
image_runner = bentoml.Runner(
    ImageClassifierRunnable,
    name="image_classifier",
    max_batch_size=64,          # Max requests per batch
    max_latency_ms=50,          # Max wait before dispatching
)
```

***

## Troubleshooting

### Service Won't Start

```
ERROR - Failed to initialize runner
```

**Solutions:**

* Check CUDA availability: `python -c "import torch; print(torch.cuda.is_available())"`
* Verify GPU VRAM: `nvidia-smi`
* Check model download completed (look for download progress in logs)

### Port 3000 Not Accessible

```bash
# Ensure service binds to 0.0.0.0 (not localhost)
bentoml serve service:svc --host 0.0.0.0 --port 3000
```

### High Latency on First Request

This is normal — the first request triggers model loading (warm-up). All subsequent requests will be fast. Add a warm-up endpoint call after start:

```bash
# Warm up after starting
sleep 10 && curl -s -o /dev/null http://localhost:3000/healthz
```

### Import Errors

```
ModuleNotFoundError: No module named 'transformers'
```

**Solution:**

```bash
pip install transformers accelerate
```

***

## Clore.ai GPU Recommendations

BentoML is a serving framework — GPU requirements depend entirely on the model you deploy. Here's what to expect for common workloads:

| GPU       | VRAM  | Clore.ai Price | LLM (7B Q4) Throughput | Diffusion (SDXL) | Vision (ResNet50) |
| --------- | ----- | -------------- | ---------------------- | ---------------- | ----------------- |
| RTX 3090  | 24 GB | \~$0.12/hr     | \~80 tok/s             | \~4 img/min      | \~400 req/s       |
| RTX 4090  | 24 GB | \~$0.70/hr     | \~140 tok/s            | \~8 img/min      | \~700 req/s       |
| A100 40GB | 40 GB | \~$1.20/hr     | \~110 tok/s            | \~6 img/min      | \~1200 req/s      |
| A100 80GB | 80 GB | \~$2.00/hr     | \~130 tok/s            | \~7 img/min      | \~1400 req/s      |

**Use case guidance:**

* **LLM API serving (7B–13B):** RTX 3090 (\~$0.12/hr) — optimal price-performance
* **Image generation APIs:** RTX 3090 or RTX 4090 depending on throughput needs
* **Large models (34B–70B Q4):** A100 40GB (\~$1.20/hr) — fits comfortably
* **Production multi-model serving:** A100 80GB for memory headroom

{% hint style="info" %}
BentoML's **adaptive micro-batching** is particularly effective on A100s — the hardware scheduler handles batching efficiently, extracting more throughput per dollar than naive single-request serving. For high-traffic APIs, A100 40GB often delivers better ROI than two RTX 4090s.
{% endhint %}

***

## Useful Resources

* [BentoML Official Documentation](https://docs.bentoml.com)
* [BentoML GitHub](https://github.com/bentoml/BentoML)
* [BentoML Examples](https://github.com/bentoml/BentoML/tree/main/examples)
* [BentoML Discord Community](https://l.bentoml.com/join-slack-space)
* [BentoML Gallery](https://www.bentoml.com/gallery)
* [Quickstart: Serving LLMs](https://docs.bentoml.com/en/latest/get-started/quickstart.html)
