# BentoML

**BentoML** es un marco moderno y de código abierto para **construir, desplegar y escalar aplicaciones de IA**. Cierra la brecha entre la experimentación en ML y el despliegue en producción, permitiéndote empaquetar cualquier modelo de cualquier framework en un servicio API listo para producción en minutos. Ejecuta BentoML en la nube con GPU de Clore.ai para un alojamiento de aplicaciones de IA rentable.

***

## ¿Qué es BentoML?

BentoML facilita tomar un modelo entrenado y convertirlo en un servicio API escalable:

* **Agnóstico al framework:** PyTorch, TensorFlow, JAX, scikit-learn, HuggingFace, XGBoost, LightGBM y más
* **Bento:** Un artefacto autocontenido y reproducible (modelo + código + dependencias)
* **Runner:** Unidad escalable de inferencia de modelos con agrupamiento automático
* **Service:** Definición de servicio HTTP/gRPC estilo FastAPI
* **BentoCloud:** Plataforma de despliegue gestionada opcional
* **Docker-first:** Cada Bento puede contenerizarse con un solo comando

**Características clave:**

* Micro-agrupamiento adaptativo para optimizar el rendimiento
* Validación de entrada/salida incorporada con Pydantic
* Especificación OpenAPI auto-generada
* Métricas Prometheus integradas
* Soporte de respuestas en streaming (LLMs)

***

## Prerrequisitos

| Requisito      | Mínimo           | Recomendado     |
| -------------- | ---------------- | --------------- |
| VRAM GPU       | 8 GB             | 16–24 GB        |
| GPU            | Cualquier NVIDIA | RTX 4090 / A100 |
| RAM            | 8 GB             | 16 GB           |
| Almacenamiento | 20 GB            | 40 GB           |
| Python         | 3.9+             | 3.11+           |

***

## Paso 1 — Alquila una GPU en Clore.ai

1. Inicia sesión en [clore.ai](https://clore.ai).
2. Haz clic **Marketplace** y selecciona una instancia GPU con ≥ 16 GB de VRAM.
3. Establece la imagen Docker: usaremos una compilación personalizada (ver Paso 2).
4. Establecer puertos abiertos: `22` (SSH) y `3000` (servicio BentoML).
5. Haz clic **Alquilar**.

***

## Paso 2 — Dockerfile

BentoML no tiene una imagen Docker oficial con GPU, así que construimos una:

```dockerfile
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    git wget curl \
    openssh-server \
    libgl1 libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Configurar SSH
RUN mkdir /var/run/sshd && \
    echo 'root:clore123' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Instalar BentoML y bibliotecas ML comunes
RUN pip install --upgrade pip && \
    pip install \
        bentoml \
        transformers \
        accelerate \
        diffusers \
        Pillow \
        numpy \
        scipy \
        tritonclient[all]

WORKDIR /workspace

EXPOSE 22 3000

CMD service ssh start && tail -f /dev/null
```

### Compilar y subir

Construye la imagen y súbela a tu cuenta de Docker Hub (reemplaza `YOUR_DOCKERHUB_USERNAME` con tu nombre de usuario real):

```bash
docker build -t YOUR_DOCKERHUB_USERNAME/bentoml-gpu:latest .
docker push YOUR_DOCKERHUB_USERNAME/bentoml-gpu:latest
```

{% hint style="info" %}
BentoML no proporciona una imagen Docker oficial con GPU en Docker Hub. Las `bentoml/bento-server` imágenes en Docker Hub son para servir Bentos preempaquetados y no incluyen soporte CUDA. Construye la imagen desde el Dockerfile anterior para despliegues con GPU en Clore.ai.
{% endhint %}

***

## Paso 3 — Conectar vía SSH

```bash
ssh root@<clore-host> -p <assigned-ssh-port>
```

Verificar BentoML:

```bash
bentoml --version
# Esperado: bentoml, versión 1.x.x
```

***

## Paso 4 — Tu primer servicio BentoML

### Clasificador de texto simple

Crear un archivo de servicio:

```bash
mkdir -p /workspace/my-service
cat > /workspace/my-service/service.py << 'EOF'
import bentoml
from bentoml.io import JSON, Text
import numpy as np

# Definir un Runner (la unidad del modelo)
class TextClassifierRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True
    
    def __init__(self):
        import torch
        from transformers import pipeline
        
        self.classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=0 if torch.cuda.is_available() else -1,
        )
    
    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def classify(self, texts: list[str]) -> list[dict]:
        results = self.classifier(texts)
        return results

# Crear Runner
classifier_runner = bentoml.Runner(
    TextClassifierRunnable,
    name="text_classifier",
    max_batch_size=32,
    max_latency_ms=100,
)

# Definir Servicio
svc = bentoml.Service(
    name="text_classifier_service",
    runners=[classifier_runner],
)

@svc.api(input=Text(), output=JSON())
async def classify(text: str) -> dict:
    """Clasifica el sentimiento del texto de entrada."""
    results = await classifier_runner.classify.async_run([text])
    return results[0]
EOF
```

### Iniciar el servicio

```bash
cd /workspace/my-service

bentoml serve service:svc \
    --host 0.0.0.0 \
    --port 3000 \
    --reload
```

{% hint style="info" %}
El `--reload` la bandera habilita recarga en caliente durante el desarrollo. Elimínala en producción para mayor estabilidad.
{% endhint %}

***

## Paso 5 — Acceder al servicio

Abre la interfaz Swagger UI auto-generada:

```
http://<clore-host>:<public-port-3000>
```

O prueba vía `curl`:

```bash
curl -X POST http://<clore-host>:<public-port-3000>/classify \
    -H "Content-Type: text/plain" \
    -d "This GPU cloud service is amazing!"
```

Respuesta esperada:

```json
{"label": "POSITIVE", "score": 0.9986}
```

***

## Paso 6 — Servicio de clasificación de imágenes

### Servicio de modelo de visión

```python
# /workspace/vision-service/service.py
import bentoml
from bentoml.io import Image, JSON
from PIL import Image as PILImage
import numpy as np

class ImageClassifierRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu",)
    SUPPORTS_CPU_MULTI_THREADING = False
    
    def __init__(self):
        import torch
        import torchvision.transforms as transforms
        from torchvision.models import resnet50, ResNet50_Weights
        
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        weights = ResNet50_Weights.DEFAULT
        self.model = resnet50(weights=weights).to(self.device)
        self.model.eval()
        self.preprocess = weights.transforms()
        self.categories = weights.meta["categories"]
    
    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def predict(self, images: list) -> list[dict]:
        import torch
        
        batch = torch.stack([self.preprocess(img) for img in images]).to(self.device)
        
        with torch.no_grad():
            predictions = self.model(batch).softmax(dim=1)
        
        results = []
        for pred in predictions:
            top5 = pred.topk(5)
            results.append({
                "predictions": [
                    {"label": self.categories[idx], "score": round(score.item(), 4)}
                    for score, idx in zip(top5.values, top5.indices)
                ]
            })
        return results


image_runner = bentoml.Runner(
    ImageClassifierRunnable,
    name="image_classifier",
    max_batch_size=16,
)

svc = bentoml.Service(
    name="image_classifier_service",
    runners=[image_runner],
)

@svc.api(input=Image(), output=JSON())
async def classify(image: PILImage.Image) -> dict:
    """Clasifica una imagen con ResNet50."""
    results = await image_runner.predict.async_run([image])
    return results[0]
```

```bash
bentoml serve service:svc --host 0.0.0.0 --port 3000
```

Probar con una imagen:

```bash
curl -X POST http://<clore-host>:<public-port-3000>/classify \
    -H "Content-Type: image/jpeg" \
    --data-binary @/path/to/image.jpg
```

***

## Paso 7 — Servicio de streaming para LLM

Para modelos de lenguaje con respuestas en streaming:

```python
# /workspace/llm-service/service.py
import bentoml
from bentoml.io import JSON, Text
from typing import AsyncGenerator

class LLMRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("gpu",)
    SUPPORTS_CPU_MULTI_THREADING = False
    
    def __init__(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch
        
        model_name = "microsoft/phi-2"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
    
    @bentoml.Runnable.method(batchable=False)
    def generate(self, prompt: str, max_tokens: int = 200) -> str:
        import torch
        
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                do_sample=True,
                temperature=0.7,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)


llm_runner = bentoml.Runner(LLMRunnable, name="llm")

svc = bentoml.Service("llm_service", runners=[llm_runner])

@svc.api(input=JSON(), output=Text())
async def generate(body: dict) -> str:
    prompt = body.get("prompt", "")
    max_tokens = body.get("max_tokens", 200)
    return await llm_runner.generate.async_run(prompt, max_tokens)
```

***

## Paso 8 — Guardar y construir un Bento

Un **Bento** es un artefacto empaquetado y reproducible:

```python
# /workspace/build_bento.py
import bentoml

# Guardar el modelo en el almacén de modelos de BentoML
import torch
from torchvision.models import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.DEFAULT)
model.eval()

saved_model = bentoml.pytorch.save_model(
    name="resnet50",
    model=model,
    labels={"framework": "pytorch", "task": "image-classification"},
    metadata={"accuracy": 0.80, "dataset": "ImageNet"}
)
print(f"Model saved: {saved_model.tag}")
```

```bash
python /workspace/build_bento.py

# Listar modelos guardados
bentoml models list

# Construir un Bento (requiere bentofile.yaml)
bentoml build
```

### bentofile.yaml

```yaml
service: "service:svc"
labels:
  owner: "ml-team"
  stage: "production"
include:
  - "*.py"
python:
  packages:
    - torch
    - torchvision
    - transformers
    - Pillow
    - numpy
docker:
  python_version: "3.11"
  cuda_version: "12.1"
  system_packages:
    - libgl1
```

```bash
bentoml build

# Listar bentos construidos
bentoml list

# Contenerizar
bentoml containerize image_classifier_service:latest \
    --image-tag YOUR_DOCKERHUB_USERNAME/my-bento:latest
```

***

## Monitoreo y métricas

BentoML expone métricas Prometheus en `/metrics`:

```bash
curl http://<clore-host>:<public-port-3000>/metrics
```

Métricas clave:

```
# Tasa de solicitudes
bentoml_service_request_total{endpoint="classify", http_status_code="200"}
# Latencia
bentoml_service_request_duration_seconds{endpoint="classify"}
# Rendimiento del Runner  
bentoml_runner_request_total{runner_name="image_classifier"}
```

***

## Configuración de micro-agrupamiento adaptativo

```python
# Afinar el comportamiento de agrupamiento
image_runner = bentoml.Runner(
    ImageClassifierRunnable,
    name="image_classifier",
    max_batch_size=64,          # Máx solicitudes por lote
    max_latency_ms=50,          # Máx espera antes del envío
)
```

***

## Solución de problemas

### El servicio no arrancará

```
ERROR - Error al inicializar el runner
```

**Soluciones:**

* Comprobar disponibilidad de CUDA: `python -c "import torch; print(torch.cuda.is_available())"`
* Verificar VRAM de la GPU: `nvidia-smi`
* Comprobar que la descarga del modelo se completó (busca el progreso de descarga en los registros)

### Puerto 3000 no accesible

```bash
# Asegúrate de que el servicio se enlace a 0.0.0.0 (no localhost)
bentoml serve service:svc --host 0.0.0.0 --port 3000
```

### Alta latencia en la primera solicitud

Esto es normal: la primera solicitud desencadena la carga del modelo (calentamiento). Todas las solicitudes posteriores serán rápidas. Añade una llamada al endpoint de calentamiento después de iniciar:

```bash
# Calentar después de iniciar
sleep 10 && curl -s -o /dev/null http://localhost:3000/healthz
```

### Errores de importación

```
ModuleNotFoundError: No module named 'transformers'
```

**Solución:**

```bash
pip install transformers accelerate
```

***

## Recomendaciones de GPU en Clore.ai

BentoML es un framework de serving: los requisitos de GPU dependen totalmente del modelo que despliegues. Esto es lo que puedes esperar para cargas de trabajo comunes:

| GPU       | VRAM  | Precio en Clore.ai | Rendimiento LLM (7B Q4) | Difusión (SDXL) | Visión (ResNet50) |
| --------- | ----- | ------------------ | ----------------------- | --------------- | ----------------- |
| RTX 3090  | 24 GB | \~$0.12/h          | \~80 tok/s              | \~4 img/min     | \~400 req/s       |
| RTX 4090  | 24 GB | \~$0.70/h          | \~140 tok/s             | \~8 img/min     | \~700 req/s       |
| A100 40GB | 40 GB | \~$1.20/h          | \~110 tok/s             | \~6 img/min     | \~1200 req/s      |
| A100 80GB | 80 GB | \~$2.00/h          | \~130 tok/s             | \~7 img/min     | \~1400 req/s      |

**Guía de casos de uso:**

* **Serving de API LLM (7B–13B):** RTX 3090 (\~$0.12/hr) — precio-rendimiento óptimo
* **APIs de generación de imágenes:** RTX 3090 o RTX 4090 dependiendo de las necesidades de rendimiento
* **Modelos grandes (34B–70B Q4):** A100 40GB (\~$1.20/hr) — cabe cómodamente
* **Serving de múltiples modelos en producción:** A100 80GB para margen de memoria

{% hint style="info" %}
El **micro-agrupamiento adaptativo** de BentoML es particularmente efectivo en A100s: el planificador de hardware maneja el agrupamiento eficientemente, extrayendo más rendimiento por dólar que el serving ingenuo de solicitudes individuales. Para APIs de alto tráfico, A100 40GB suele ofrecer mejor ROI que dos RTX 4090.
{% endhint %}

***

## Recursos Útiles

* [Documentación oficial de BentoML](https://docs.bentoml.com)
* [BentoML en GitHub](https://github.com/bentoml/BentoML)
* [Ejemplos de BentoML](https://github.com/bentoml/BentoML/tree/main/examples)
* [Comunidad de BentoML en Discord](https://l.bentoml.com/join-slack-space)
* [Galería de BentoML](https://www.bentoml.com/gallery)
* [Inicio rápido: Servir LLMs](https://docs.bentoml.com/en/latest/get-started/quickstart.html)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-es/mlops-y-despliegue/bentoml.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.