> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-es/devops-de-gpu/onnx-runtime.md). # GPU de tiempo de ejecución ONNX > **Inferencia ML multiplataforma y acelerada por hardware — despliega cualquier modelo de cualquier framework** ONNX Runtime (ORT) es el motor de inferencia de código abierto de Microsoft para modelos ONNX (Open Neural Network Exchange). Proporciona inferencia acelerada por hardware en CPUs, GPUs y aceleradores especializados mediante una API unificada. Ya sea que tu modelo se haya entrenado en PyTorch, TensorFlow, Scikit-learn o XGBoost — si puedes exportarlo al formato ONNX, ORT puede ejecutarlo más rápido. **GitHub:** [microsoft/onnxruntime](https://github.com/microsoft/onnxruntime) — 14K+ ⭐ *** ## ¿Por qué ONNX Runtime? | Característica | ONNX Runtime | TorchScript | TensorFlow Serving | | ------------------------ | ----------------- | -------------- | ------------------ | | Agnóstico al framework | ✅ | ❌ Solo PyTorch | ❌ Solo TF | | Aceleración por GPU | ✅ CUDA/TensorRT | ✅ | ✅ | | Cuantización INT8/FP16 | ✅ | Parcial | Parcial | | Despliegue en móvil/edge | ✅ | Limitado | Limitado | | Fusión de operadores | ✅ | Parcial | ✅ | | Integración sencilla | ✅ Python/C++/Java | Python | Python/gRPC | {% hint style="success" %} **Beneficio clave:** ONNX Runtime con el proveedor de ejecución CUDA suele ofrecer **aceleración de 1.5–3x** sobre la inferencia nativa de PyTorch para modelos de visión por computadora y NLP. {% endhint %} *** ## Proveedores de ejecución compatibles ONNX Runtime admite múltiples backends de hardware (Proveedores de Ejecución): | Proveedor | Hardware | Caso de uso | | --------------------------- | ------------- | ------------------------- | | `CUDAExecutionProvider` | GPUs NVIDIA | Inferencia general en GPU | | `TensorrtExecutionProvider` | GPUs NVIDIA | Máximo rendimiento | | `CPUExecutionProvider` | CPU | Reserva / edge | | `ROCMExecutionProvider` | GPUs AMD | Hardware AMD | | `CoreMLExecutionProvider` | Apple Silicon | macOS/iOS | | `OpenVINOExecutionProvider` | Intel | CPUs/GPUs Intel | *** ## Prerrequisitos * Cuenta de Clore.ai con alquiler de GPU * Conocimientos básicos de Python * Un modelo entrenado (PyTorch, TensorFlow o ONNX preexportado) *** ## Paso 1 — Alquila una GPU en Clore.ai 1. Ve a [clore.ai](https://clore.ai) → **Marketplace** 2. Cualquier GPU NVIDIA funciona — desde RTX 3070 para modelos pequeños hasta A100 para transformadores grandes 3. **Para modelos transformer:** Se recomiendan RTX 4090 o A100 4. **Para visión por computadora:** RTX 3090 o RTX 4090 son suficientes *** ## Paso 2 — Despliega tu contenedor ONNX Runtime no tiene un contenedor preconstruido oficial, pero la base NVIDIA CUDA es ideal: **Imagen Docker:** ``` nvcr.io/nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04 ``` **Puertos:** ``` 22 ``` **Variables de entorno:** ``` NVIDIA_VISIBLE_DEVICES=all NVIDIA_DRIVER_CAPABILITIES=compute,utility ``` {% hint style="info" %} Alternativamente, usa `pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime` que incluye CUDA y un entorno Python listo para la instalación de ORT. {% endhint %} *** ## Paso 3 — Instala ONNX Runtime con soporte GPU ```bash ssh root@ -p # Actualizar paquetes apt-get update && apt-get install -y \ python3-pip \ python3-dev \ wget \ git \ libgomp1 # Instalar ONNX Runtime con soporte CUDA pip install onnxruntime-gpu # Instalar paquetes de soporte pip install \ onnx \ numpy \ Pillow \ transformers \ torch \ torchvision \ fastapi \ uvicorn # Verificar instalación python3 << 'EOF' import onnxruntime as ort print(f"ORT Version: {ort.__version__}") print(f"Available providers: {ort.get_available_providers()}") # Debería incluir: CUDAExecutionProvider, TensorrtExecutionProvider, CPUExecutionProvider EOF ``` *** ## Paso 4 — Exporta tu modelo a ONNX ### Exportación de modelo PyTorch ```python import torch import torch.nn as nn import onnx # Ejemplo: Exportar ResNet50 model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True) model.eval() # Crear entrada dummy (batch=1, imagen RGB 224x224) dummy_input = torch.randn(1, 3, 224, 224) # Exportar a ONNX torch.onnx.export( model, dummy_input, "resnet50.onnx", export_params=True, opset_version=17, # Usar la versión estable más reciente del opset do_constant_folding=True, # Optimizar constantes input_names=["input"], output_names=["output"], dynamic_axes={ "input": {0: "batch_size"}, # Lote dinámico "output": {0: "batch_size"} } ) print("¡Modelo exportado con éxito!") # Verificar el modelo exportado onnx_model = onnx.load("resnet50.onnx") onnx.checker.check_model(onnx_model) print("¡El modelo ONNX es válido!") ``` ### Exportación de HuggingFace Transformers ```bash # Instalar optimum para exportación ONNX de HuggingFace pip install optimum[exporters] # Exportar BERT para clasificación de texto optimum-cli export onnx \ --model bert-base-uncased \ --task text-classification \ ./bert_onnx/ # Exportar con optimización optimum-cli export onnx \ --model microsoft/phi-2 \ --task text-generation \ --optimize O2 \ ./phi2_onnx/ ``` ### Exportar con optimización ORT ```python from optimum.onnxruntime import ORTModelForSequenceClassification from optimum.onnxruntime.configuration import OptimizationConfig, ORTConfig from optimum.onnxruntime import ORTOptimizer # Cargar y optimizar model = ORTModelForSequenceClassification.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english", export=True ) optimizer = ORTOptimizer.from_pretrained(model) optimization_config = OptimizationConfig( optimization_level=2, optimize_for_gpu=True, fp16=True ) optimizer.optimize( save_dir="./distilbert_optimized", optimization_config=optimization_config ) ``` *** ## Paso 5 — Ejecuta inferencia con ONNX Runtime ### Inferencia básica en GPU ```python import onnxruntime as ort import numpy as np from PIL import Image import torchvision.transforms as transforms # Configurar sesión con proveedores de ejecución GPU # Los proveedores se prueban en orden — CUDA primero, luego reserva en CPU providers = [ ("CUDAExecutionProvider", { "device_id": 0, "arena_extend_strategy": "kNextPowerOfTwo", "gpu_mem_limit": 4 * 1024 * 1024 * 1024, # límite de 4GB "cudnn_conv_algo_search": "EXHAUSTIVE", "do_copy_in_default_stream": True, }), "CPUExecutionProvider" ] # Opciones de sesión para rendimiento opts = ort.SessionOptions() opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL opts.intra_op_num_threads = 8 opts.execution_mode = ort.ExecutionMode.ORT_PARALLEL # Cargar modelo session = ort.InferenceSession( "resnet50.onnx", sess_options=opts, providers=providers ) print(f"Running on: {session.get_providers()}") # Preparar entrada transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) img = Image.open("test_image.jpg").convert("RGB") img_tensor = transform(img).unsqueeze(0).numpy() # Ejecutar inferencia outputs = session.run(None, {"input": img_tensor}) probabilities = outputs[0][0] top5_idx = probabilities.argsort()[-5:][::-1] print("Top 5 predictions:", top5_idx, probabilities[top5_idx]) ``` ### Inferencia por lotes para rendimiento (throughput) ```python import onnxruntime as ort import numpy as np import time session = ort.InferenceSession( "resnet50.onnx", providers=["CUDAExecutionProvider"] ) # Calentar la GPU dummy = np.random.randn(1, 3, 224, 224).astype(np.float32) for _ in range(10): session.run(None, {"input": dummy}) # Medir distintos tamaños de lote for batch_size in [1, 4, 8, 16, 32, 64]: inputs = np.random.randn(batch_size, 3, 224, 224).astype(np.float32) start = time.time() n_iter = 100 for _ in range(n_iter): session.run(None, {"input": inputs}) elapsed = time.time() - start throughput = (batch_size * n_iter) / elapsed latency = (elapsed / n_iter) * 1000 # ms print(f"Batch {batch_size:3d}: {throughput:7.1f} img/sec, {latency:.1f}ms/batch") ``` *** ## Paso 6 — Proveedor de ejecución TensorRT (rendimiento máximo) Para GPUs NVIDIA, el EP de TensorRT ofrece un rendimiento aún mejor: ```python import onnxruntime as ort import numpy as np # Configuración del proveedor de ejecución TensorRT tensorrt_provider_options = { "trt_max_workspace_size": 4 * 1024 * 1024 * 1024, # 4GB "trt_fp16_enable": True, # Habilitar FP16 para inferencia más rápida "trt_int8_enable": False, "trt_engine_cache_enable": True, # Cachear motores compilados "trt_engine_cache_path": "/tmp/trt_cache", "trt_max_partition_iterations": 1000, "trt_min_subgraph_size": 1, "trt_timing_cache_enable": True, } providers = [ ("TensorrtExecutionProvider", tensorrt_provider_options), ("CUDAExecutionProvider", {"device_id": 0}), "CPUExecutionProvider" ] session = ort.InferenceSession("resnet50.onnx", providers=providers) print("Active provider:", session.get_providers()[0]) # La primera ejecución compila el motor TensorRT (puede tardar 1–3 minutos) # Las ejecuciones posteriores usan el motor cacheado y son muy rápidas ``` {% hint style="warning" %} **La compilación del motor TensorRT** ocurre en la primera inferencia y puede tardar 1–5 minutos. Habilita el cache (`trt_engine_cache_enable: True`) para que el motor compilado se reutilice entre sesiones. {% endhint %} *** ## Paso 7 — Cuantización INT8 para velocidad máxima ```python from onnxruntime.quantization import quantize_dynamic, quantize_static, QuantType import onnxruntime as ort import numpy as np # Cuantización dinámica INT8 (no requiere datos de calibración) quantize_dynamic( model_input="resnet50.onnx", model_output="resnet50_int8_dynamic.onnx", weight_type=QuantType.QInt8 ) # Cuantización estática INT8 (requiere datos de calibración) from onnxruntime.quantization import CalibrationDataReader class ImageCalibrationReader(CalibrationDataReader): def __init__(self, data_dir, input_name="input"): self.data_dir = data_dir self.input_name = input_name self.images = self._load_images() self.idx = 0 def _load_images(self): # Cargar 100 imágenes de calibración import glob, torchvision.transforms as T from PIL import Image transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor()]) images = [] for path in glob.glob(f"{self.data_dir}/*.jpg")[:100]: img = Image.open(path).convert("RGB") images.append(transform(img).numpy()) return images def get_next(self): if self.idx >= len(self.images): return None data = {self.input_name: self.images[self.idx:self.idx+1]} self.idx += 1 return data from onnxruntime.quantization import quantize_static, QuantFormat quantize_static( model_input="resnet50.onnx", model_output="resnet50_int8_static.onnx", calibration_data_reader=ImageCalibrationReader("/data/calibration_images"), quant_format=QuantFormat.QDQ, weight_type=QuantType.QInt8 ) ``` *** ## Paso 8 — Construye una API de inferencia ```bash cat > /workspace/onnx_api.py << 'EOF' from fastapi import FastAPI, File, UploadFile from fastapi.responses import JSONResponse import onnxruntime as ort import numpy as np from PIL import Image import io import torchvision.transforms as transforms import json app = FastAPI(title="ONNX Runtime Inference API") # Cargar modelo al iniciar session = ort.InferenceSession( "resnet50.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"] ) # Cargar etiquetas de clase de ImageNet with open("imagenet_classes.json") as f: classes = json.load(f) transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) @app.get("/health") async def health(): return {"status": "ok", "providers": session.get_providers()} @app.post("/predict") async def predict(file: UploadFile = File(...), topk: int = 5): image_data = await file.read() img = Image.open(io.BytesIO(image_data)).convert("RGB") tensor = transform(img).unsqueeze(0).numpy() outputs = session.run(None, {"input": tensor})[0][0] top_indices = outputs.argsort()[-topk:][::-1] results = [ {"label": classes[str(i)], "score": float(outputs[i])} for i in top_indices ] return JSONResponse({"predictions": results}) if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8080) EOF python3 /workspace/onnx_api.py & # Prueba la API curl -X POST "http://localhost:8080/predict" \ -H "accept: application/json" \ -F "file=@test_image.jpg" ``` *** ## Paso 9 — Monitorea el uso de la GPU ```bash # Monitoreo de GPU en tiempo real durante la inferencia watch -n 0.5 nvidia-smi # O usa nvitop para una mejor interfaz pip install nvitop nvitop ``` *** ## Benchmarks de rendimiento | Modelo | GPU | Proveedor | Rendimiento (inf/sec) | | --------- | -------- | ------------- | --------------------- | | ResNet50 | RTX 4090 | CUDA | \~4,200 | | ResNet50 | RTX 4090 | TensorRT FP16 | \~8,500 | | BERT Base | RTX 4090 | CUDA | \~380 | | BERT Base | RTX 4090 | TensorRT FP16 | \~720 | | YOLOv8n | RTX 3090 | CUDA | \~1,800 | | YOLOv8x | A100 | TensorRT FP16 | \~920 | *** ## Solución de problemas ### Proveedor CUDA no disponible ```bash # Verifica que ORT con CUDA esté instalado (no la versión solo para CPU) pip uninstall onnxruntime pip install onnxruntime-gpu python3 -c "import onnxruntime as ort; print(ort.get_available_providers())" ``` ### Errores de compilación de TensorRT ```bash # Verifica la compatibilidad de la versión de TensorRT python3 -c "import tensorrt; print(tensorrt.__version__)" # Usa el proveedor CUDA en su lugar providers = ["CUDAExecutionProvider"] # Omitir el EP de TensorRT ``` ### Errores de desajuste de forma (Shape Mismatch) ```python # Verifica las formas de entrada/salida del modelo for input in session.get_inputs(): print(f"Input: {input.name}, shape: {input.shape}, type: {input.type}") for output in session.get_outputs(): print(f"Output: {output.name}, shape: {output.shape}, type: {output.type}") ``` *** ## Avanzado: Pipeline multi-modelo ```python import onnxruntime as ort import numpy as np class MultiModelPipeline: def __init__(self): providers = ["CUDAExecutionProvider"] self.detector = ort.InferenceSession("detector.onnx", providers=providers) self.classifier = ort.InferenceSession("classifier.onnx", providers=providers) def run(self, image: np.ndarray) -> list: # Etapa 1: Detección de objetos boxes = self.detector.run(None, {"image": image})[0] results = [] for box in boxes: # Recortar la región detectada crop = self._crop(image, box) # Etapa 2: Clasificar cada región label = self.classifier.run(None, {"input": crop})[0] results.append({"box": box.tolist(), "label": int(label.argmax())}) return results def _crop(self, image, box): x1, y1, x2, y2 = box.astype(int) return image[:, :, y1:y2, x1:x2] pipeline = MultiModelPipeline() ``` *** ## Recursos adicionales * [ONNX Runtime en GitHub](https://github.com/microsoft/onnxruntime) * [Documentación de ONNX Runtime](https://onnxruntime.ai/docs/) * [Hugging Face Optimum](https://huggingface.co/docs/optimum/) * [ONNX Model Zoo](https://github.com/onnx/models) — Modelos preexportados * [Netron](https://netron.app/) — Visualizador de modelos ONNX * [API de Python de ONNX Runtime](https://onnxruntime.ai/docs/api/python/) *** *ONNX Runtime en Clore.ai es la opción ideal para servicios de inferencia en producción que necesitan servir modelos de diferentes frameworks con la máxima eficiencia en GPU.* *** ## Recomendaciones de GPU en Clore.ai | Caso de uso | GPU recomendada | Coste estimado en Clore.ai | | ------------------------ | --------------- | -------------------------- | | Desarrollo/Pruebas | RTX 3090 (24GB) | \~$0.12/gpu/hr | | Inferencia en Producción | RTX 4090 (24GB) | \~$0.70/gpu/hr | | Despliegue a gran escala | A100 80GB | \~$1.20/gpu/hr | > 💡 Todos los ejemplos en esta guía pueden desplegarse en [Clore.ai](https://clore.ai/marketplace) servidores GPU. Navega las GPUs disponibles y alquila por hora — sin compromisos, acceso root completo. --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://docs.clore.ai/guides/guides_v2-es/devops-de-gpu/onnx-runtime.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.