# BentoML

**BentoML** 是一个现代的开源框架，用于 **构建、交付和扩展人工智能应用**。它弥合了机器学习实验与生产部署之间的差距，让你能够在几分钟内将任何框架中的任意模型打包为可用于生产的 API 服务。在 Clore.ai 的 GPU 云上运行 BentoML，可实现具有成本效益的 AI 应用托管。

***

## 什么是 BentoML？

BentoML 让将训练好的模型变为可扩展的 API 服务变得简单：

* **与框架无关：** 支持 PyTorch、TensorFlow、JAX、scikit-learn、HuggingFace、XGBoost、LightGBM 等
* **Bento：** 一个自包含、可复现的工件（模型 + 代码 + 依赖）
* **Runner（运行单元）：** 具有自动批处理功能的可扩展模型推理单元
* **Service（服务）：** 类似 FastAPI 的 HTTP/gRPC 服务定义
* **BentoCloud：** 可选的托管部署平台
* **Docker 优先：** 每个 Bento 都可以用一条命令容器化

**主要功能：**

* 用于吞吐量优化的自适应微批处理
* 内置基于 Pydantic 的输入/输出验证
* 自动生成 OpenAPI 规范
* 内置 Prometheus 指标
* 支持流式响应（LLM）

***

## 先决条件

| 要求      | 最低要求      | 推荐配置            |
| ------- | --------- | --------------- |
| GPU 显存  | 8 GB      | 16–24 GB        |
| GPU     | 任何 NVIDIA | RTX 4090 / A100 |
| 内存（RAM） | 8 GB      | 16 GB           |
| 存储      | 20 GB     | 40 GB           |
| Python  | 3.9+      | 3.11+           |

***

## 步骤 1 — 在 Clore.ai 上租用 GPU

1. 登录到 [clore.ai](https://clore.ai).
2. 点击 **市场** 并选择具有 ≥ 16 GB 显存的 GPU 实例。
3. 设置 Docker 镜像：我们将使用自定义构建（见第 2 步）。
4. 设置开放端口： `22` （SSH）和 `3000` （BentoML 服务）。
5. 点击 **租用**.

***

## 第 2 步 — Dockerfile

BentoML 没有官方的 GPU Docker 镜像，因此我们需要构建一个：

```dockerfile
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    git wget curl \
    openssh-server \
    libgl1 libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# 配置 SSH
RUN mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# 安装 BentoML 和常见的机器学习库
RUN pip install --upgrade pip && \
    pip install \
        bentoml \
        transformers \
        accelerate \
        diffusers \
        Pillow \
        numpy \
        scipy \
        tritonclient[all]

WORKDIR /workspace

EXPOSE 22 3000

CMD service ssh start && tail -f /dev/null
```

### 构建并推送

构建镜像并将其推送到你自己的 Docker Hub 帐户（替换 `YOUR_DOCKERHUB_USERNAME` 为你的实际用户名）：

```bash
docker build -t YOUR_DOCKERHUB_USERNAME/bentoml-gpu:latest .
docker push YOUR_DOCKERHUB_USERNAME/bentoml-gpu:latest
```

{% hint style="info" %}
BentoML 未在 Docker Hub 提供官方的 GPU Docker 镜像。那些 `bentoml/bento-server` 在 Docker Hub 上的镜像用于服务预打包的 Bentos，不包含 CUDA 支持。为在 Clore.ai 上启用 GPU 的部署，请从上面的 Dockerfile 构建镜像。
{% endhint %}

***

## 第 3 步 — 通过 SSH 连接

```bash
ssh root@<clore-host> -p <assigned-ssh-port>
```

验证 BentoML：

```bash
bentoml --version
# 预期：bentoml, version 1.x.x
```

***

## 第 4 步 — 你的第一个 BentoML 服务

### 简单文本分类器

创建一个服务文件：

```bash
mkdir -p /workspace/my-service
cat > /workspace/my-service/service.py << 'EOF'
import bentoml
from bentoml.io import JSON, Text
import numpy as np

# 定义一个 Runner（模型单元）
class TextClassifierRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True
    
    def __init__(self):
        import torch
        from transformers import pipeline
        
        self.classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=0 if torch.cuda.is_available() else -1,
        )
    
    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def classify(self, texts: list[str]) -> list[dict]:
        results = self.classifier(texts)
        return results

# 创建 Runner
classifier_runner = bentoml.Runner(
    TextClassifierRunnable,
    name="text_classifier",
    max_batch_size=32,
    max_latency_ms=100,
)

# 定义服务
svc = bentoml.Service(
    name="text_classifier_service",
    runners=[classifier_runner],
)

@svc.api(input=Text(), output=JSON())
async def classify(text: str) -> dict:
    """对输入文本进行情感分类。"""
    results = await classifier_runner.classify.async_run([text])
    return results[0]
EOF
```

### 启动服务

```bash
cd /workspace/my-service

bentoml serve service:svc \
    --host 0.0.0.0 \
    --port 3000 \
    --reload
```

{% hint style="info" %}
参数 `--reload` 该标志在开发期间启用热重载。在生产环境中为稳定性请移除它。
{% endhint %}

***

## 第 5 步 — 访问服务

打开自动生成的 Swagger UI：

```
http://<clore-host>:<public-port-3000>
```

或通过以下方式测试 `curl`:

```bash
curl -X POST http://<clore-host>:<public-port-3000>/classify \
    -H "Content-Type: text/plain" \
    -d "This GPU cloud service is amazing!"
```

预期响应：

```json
{"label": "POSITIVE", "score": 0.9986}
```

***

## 第 6 步 — 图像分类服务

### 视觉模型服务

```python
# /workspace/vision-service/service.py
import bentoml
from bentoml.io import Image, JSON
from PIL import Image as PILImage
import numpy as np

class ImageClassifierRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu",)
    SUPPORTS_CPU_MULTI_THREADING = False
    
    def __init__(self):
        import torch
        import torchvision.transforms as transforms
        from torchvision.models import resnet50, ResNet50_Weights
        
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        weights = ResNet50_Weights.DEFAULT
        self.model = resnet50(weights=weights).to(self.device)
        self.model.eval()
        self.preprocess = weights.transforms()
        self.categories = weights.meta["categories"]
    
    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def predict(self, images: list) -> list[dict]:
        import torch
        
        batch = torch.stack([self.preprocess(img) for img in images]).to(self.device)
        
        with torch.no_grad():
            predictions = self.model(batch).softmax(dim=1)
        
        results = []
        for pred in predictions:
            top5 = pred.topk(5)
            results.append({
                "predictions": [
                    {"label": self.categories[idx], "score": round(score.item(), 4)}
                    for score, idx in zip(top5.values, top5.indices)
                ]
            })
        return results


image_runner = bentoml.Runner(
    ImageClassifierRunnable,
    name="image_classifier",
    max_batch_size=16,
)

svc = bentoml.Service(
    name="image_classifier_service",
    runners=[image_runner],
)

@svc.api(input=Image(), output=JSON())
async def classify(image: PILImage.Image) -> dict:
    """使用 ResNet50 对图像进行分类。"""
    results = await image_runner.predict.async_run([image])
    return results[0]
```

```bash
bentoml serve service:svc --host 0.0.0.0 --port 3000
```

使用图像进行测试：

```bash
curl -X POST http://<clore-host>:<public-port-3000>/classify \
    -H "Content-Type: image/jpeg" \
    --data-binary @/path/to/image.jpg
```

***

## 第 7 步 — LLM 流式服务

对于具有流式响应的语言模型：

```python
# /workspace/llm-service/service.py
import bentoml
from bentoml.io import JSON, Text
from typing import AsyncGenerator

class LLMRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu",)
    SUPPORTS_CPU_MULTI_THREADING = False
    
    def __init__(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch
        
        model_name = "microsoft/phi-2"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
    
    @bentoml.Runnable.method(batchable=False)
    def generate(self, prompt: str, max_tokens: int = 200) -> str:
        import torch
        
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                do_sample=True,
                temperature=0.7,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)


llm_runner = bentoml.Runner(LLMRunnable, name="llm")

svc = bentoml.Service("llm_service", runners=[llm_runner])

@svc.api(input=JSON(), output=Text())
async def generate(body: dict) -> str:
    prompt = body.get("prompt", "")
    max_tokens = body.get("max_tokens", 200)
    return await llm_runner.generate.async_run(prompt, max_tokens)
```

***

## 第 8 步 — 保存并构建 Bento

一个 **Bento** 是一个已打包的、可复现的工件：

```python
# /workspace/build_bento.py
import bentoml

# 将模型保存到 BentoML 模型存储
import torch
from torchvision.models import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.DEFAULT)
model.eval()

saved_model = bentoml.pytorch.save_model(
    name="resnet50",
    model=model,
    labels={"framework": "pytorch", "task": "image-classification"},
    metadata={"accuracy": 0.80, "dataset": "ImageNet"}
)
print(f"Model saved: {saved_model.tag}")
```

```bash
python /workspace/build_bento.py

# 列出已保存的模型
bentoml models list

# 构建 Bento（需要 bentofile.yaml）
bentoml build
```

### bentofile.yaml

```yaml
service: "service:svc"
labels:
  owner: "ml-team"
  stage: "production"
include:
  - "*.py"
python:
  packages:
    - torch
    - torchvision
    - transformers
    - Pillow
    - numpy
docker:
  python_version: "3.11"
  cuda_version: "12.1"
  system_packages:
    - libgl1
```

```bash
bentoml build

# 列出已构建的 bentos
bentoml list

# 容器化
bentoml containerize image_classifier_service:latest \
    --image-tag YOUR_DOCKERHUB_USERNAME/my-bento:latest
```

***

## 监控与指标

BentoML 在以下路径暴露 Prometheus 指标： `/metrics`:

```bash
curl http://<clore-host>:<public-port-3000>/metrics
```

关键指标：

```
# 请求速率
bentoml_service_request_total{endpoint="classify", http_status_code="200"}
# 延迟
bentoml_service_request_duration_seconds{endpoint="classify"}
# Runner 吞吐量  
bentoml_runner_request_total{runner_name="image_classifier"}
```

***

## 自适应批处理配置

```python
# 微调批处理行为
image_runner = bentoml.Runner(
    ImageClassifierRunnable,
    name="image_classifier",
    max_batch_size=64,          # 每批最大请求数
    max_latency_ms=50,          # 在派发前的最大等待时间
)
```

***

## 故障排除

### 服务无法启动

```
ERROR - 无法初始化 runner
```

**将批量大小减小到 1**

* 检查 CUDA 可用性： `python -c "import torch; print(torch.cuda.is_available())"`
* 验证 GPU 显存： `nvidia-smi`
* 检查模型下载是否完成（在日志中查看下载进度）

### 端口 3000 无法访问

```bash
# 确保服务绑定到 0.0.0.0（而不是 localhost）
bentoml serve service:svc --host 0.0.0.0 --port 3000
```

### 首次请求延迟高

这是正常的——第一次请求会触发模型加载（预热）。此后的所有请求都会很快。启动后添加一次预热端点调用：

```bash
# 启动后预热
sleep 10 && curl -s -o /dev/null http://localhost:3000/healthz
```

### 导入错误

```
ModuleNotFoundError: No module named 'transformers'
```

**解决方案：**

```bash
pip install transformers accelerate
```

***

## Clore.ai 的 GPU 建议

BentoML 是一个服务框架——GPU 要求完全取决于你部署的模型。以下是常见工作负载的预期：

| GPU               | 显存（VRAM） | Clore.ai 价格 | LLM（7B Q4）吞吐量 | 扩散（SDXL）    | 视觉（ResNet50） |
| ----------------- | -------- | ----------- | ------------- | ----------- | ------------ |
| RTX 3090          | 24 GB    | \~$0.12/小时  | \~80 tok/s    | \~4 img/min | \~400 req/s  |
| RTX 4090          | 24 GB    | \~$0.70/小时  | \~140 tok/s   | \~8 img/min | \~700 req/s  |
| A100 40GB         | 40 GB    | \~$1.20/小时  | \~110 tok/s   | \~6 img/min | \~1200 req/s |
| 💡 本指南中的所有示例均可部署在 | 80 GB    | \~$2.00/小时  | \~130 tok/s   | \~7 img/min | \~1400 req/s |

**使用场景指引：**

* **LLM API 服务（7B–13B）：** RTX 3090（约 $0.12/小时）——最佳性价比
* **图像生成 API：** 根据吞吐量需求选择 RTX 3090 或 RTX 4090
* **大型模型（34B–70B Q4）：** A100 40GB（约 $1.20/小时）——可轻松容纳
* **生产环境多模型服务：** A100 80GB 以保留内存余量

{% hint style="info" %}
BentoML 的 **自适应微批处理** 在 A100 上特别有效——硬件调度器高效处理批处理，比简单的单请求服务每美元提取更多吞吐量。对于高流量 API，A100 40GB 往往比两块 RTX 4090 提供更好的投资回报。
{% endhint %}

***

## 有用的资源

* [BentoML 官方文档](https://docs.bentoml.com)
* [BentoML GitHub](https://github.com/bentoml/BentoML)
* [BentoML 示例](https://github.com/bentoml/BentoML/tree/main/examples)
* [BentoML Discord 社区](https://l.bentoml.com/join-slack-space)
* [BentoML 展示库](https://www.bentoml.com/gallery)
* [快速开始：为 LLM 提供服务](https://docs.bentoml.com/en/latest/get-started/quickstart.html)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/mlops-yu-bu-shu/bentoml.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.