BentoML

BentoML is a modern, open-source framework for building, shipping, and scaling AI applications. It bridges the gap between ML experimentation and production deployment, letting you package any model from any framework into a production-ready API service in minutes. Run BentoML on Clore.ai's GPU cloud for cost-efficient AI application hosting.


What is BentoML?

BentoML makes it easy to take a trained model and turn it into a scalable API service:

  • Framework-agnostic: PyTorch, TensorFlow, JAX, scikit-learn, HuggingFace, XGBoost, LightGBM, and more

  • Bento: A self-contained, reproducible artifact (model + code + dependencies)

  • Runner: Scalable model inference unit with automatic batching

  • Service: FastAPI-like HTTP/gRPC service definition

  • BentoCloud: Optional managed deployment platform

  • Docker-first: Every Bento can be containerized with one command

Key features:

  • Adaptive micro-batching for throughput optimization

  • Built-in input/output validation with Pydantic

  • OpenAPI spec auto-generated

  • Prometheus metrics built-in

  • Streaming response support (LLMs)


Prerequisites

Requirement
Minimum
Recommended

GPU VRAM

8 GB

16–24 GB

GPU

Any NVIDIA

RTX 4090 / A100

RAM

8 GB

16 GB

Storage

20 GB

40 GB

Python

3.9+

3.11+


Step 1 — Rent a GPU on Clore.ai

  1. Click Marketplace and select a GPU instance with ≥ 16 GB VRAM.

  2. Set Docker image: we'll use a custom build (see Step 2).

  3. Set open ports: 22 (SSH) and 3000 (BentoML service).

  4. Click Rent.


Step 2 — Dockerfile

BentoML doesn't have an official GPU Docker image, so we build one:

Build and Push

Build the image and push it to your own Docker Hub account (replace YOUR_DOCKERHUB_USERNAME with your actual username):

circle-info

BentoML does not provide an official GPU Docker image on Docker Hub. The bentoml/bento-server images on Docker Hub are for serving pre-packaged Bentos and do not include CUDA support. Build the image from the Dockerfile above for GPU-enabled deployments on Clore.ai.


Step 3 — Connect via SSH

Verify BentoML:


Step 4 — Your First BentoML Service

Simple Text Classifier

Create a service file:

Start the Service

circle-info

The --reload flag enables hot-reload during development. Remove it in production for stability.


Step 5 — Access the Service

Open the auto-generated Swagger UI:

Or test via curl:

Expected response:


Step 6 — Image Classification Service

Vision Model Service

Test with an image:


Step 7 — LLM Streaming Service

For language models with streaming responses:


Step 8 — Save and Build a Bento

A Bento is a packaged, reproducible artifact:

bentofile.yaml


Monitoring and Metrics

BentoML exposes Prometheus metrics at /metrics:

Key metrics:


Adaptive Batching Configuration


Troubleshooting

Service Won't Start

Solutions:

  • Check CUDA availability: python -c "import torch; print(torch.cuda.is_available())"

  • Verify GPU VRAM: nvidia-smi

  • Check model download completed (look for download progress in logs)

Port 3000 Not Accessible

High Latency on First Request

This is normal — the first request triggers model loading (warm-up). All subsequent requests will be fast. Add a warm-up endpoint call after start:

Import Errors

Solution:


Clore.ai GPU Recommendations

BentoML is a serving framework — GPU requirements depend entirely on the model you deploy. Here's what to expect for common workloads:

GPU
VRAM
Clore.ai Price
LLM (7B Q4) Throughput
Diffusion (SDXL)
Vision (ResNet50)

RTX 3090

24 GB

~$0.12/hr

~80 tok/s

~4 img/min

~400 req/s

RTX 4090

24 GB

~$0.70/hr

~140 tok/s

~8 img/min

~700 req/s

A100 40GB

40 GB

~$1.20/hr

~110 tok/s

~6 img/min

~1200 req/s

A100 80GB

80 GB

~$2.00/hr

~130 tok/s

~7 img/min

~1400 req/s

Use case guidance:

  • LLM API serving (7B–13B): RTX 3090 (~$0.12/hr) — optimal price-performance

  • Image generation APIs: RTX 3090 or RTX 4090 depending on throughput needs

  • Large models (34B–70B Q4): A100 40GB (~$1.20/hr) — fits comfortably

  • Production multi-model serving: A100 80GB for memory headroom

circle-info

BentoML's adaptive micro-batching is particularly effective on A100s — the hardware scheduler handles batching efficiently, extracting more throughput per dollar than naive single-request serving. For high-traffic APIs, A100 40GB often delivers better ROI than two RTX 4090s.


Useful Resources

Last updated

Was this helpful?