# Entrenamiento DeepSpeed

Entrena modelos grandes de forma eficiente con Microsoft DeepSpeed.

{% hint style="success" %}
Todos los ejemplos se pueden ejecutar en servidores GPU alquilados a través de [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Alquilar en CLORE.AI

1. Visita [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filtrar por tipo de GPU, VRAM y precio
3. Elegir **Bajo demanda** (tarifa fija) o **Spot** (precio de puja)
4. Configura tu pedido:
   * Selecciona imagen Docker
   * Establece puertos (TCP para SSH, HTTP para interfaces web)
   * Agrega variables de entorno si es necesario
   * Introduce el comando de inicio
5. Selecciona pago: **CLORE**, **BTC**, o **USDT/USDC**
6. Crea el pedido y espera el despliegue

### Accede a tu servidor

* Encuentra los detalles de conexión en **Mis Pedidos**
* Interfaces web: Usa la URL del puerto HTTP
* SSH: `ssh -p <port> root@<proxy-address>`

## ¿Qué es DeepSpeed?

DeepSpeed permite:

* Entrenar modelos que no caben en la memoria GPU
* Entrenamiento multi-GPU y multinodo
* Optimización ZeRO (eficiencia de memoria)
* Entrenamiento en precisión mixta

## Etapas de ZeRO

| Etapa         | Ahorro de memoria                     | Velocidad           |
| ------------- | ------------------------------------- | ------------------- |
| ZeRO-1        | Estados del optimizador particionados | Rápido              |
| ZeRO-2        | + Gradientes particionados            | Equilibrado         |
| ZeRO-3        | + Parámetros particionados            | Ahorros máximos     |
| ZeRO-Infinity | Descarga a CPU/NVMe                   | Modelos más grandes |

## Despliegue rápido

**Imagen Docker:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Puertos:**

```
22/tcp
```

**Comando:**

```bash
pip install deepspeed transformers datasets accelerate
```

## Instalación

```bash
pip install deepspeed

# Verificar la instalación
ds_report
```

## Entrenamiento básico

### Configuración de DeepSpeed

**ds\_config.json:**

```json
{
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-4,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 1e-4,
            "warmup_num_steps": 100
        }
    },
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 16
    },
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "overlap_comm": true
    }
}
```

### Script de entrenamiento

```python
import torch
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

# Inicializar
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Inicialización de DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

# Bucle de entrenamiento
for epoch in range(num_epochs):
    for batch in dataloader:
        inputs = tokenizer(batch["text"], return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(model_engine.device) for k, v in inputs.items()}

        outputs = model_engine(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss

        model_engine.backward(loss)
        model_engine.step()
```

## Configuración de ZeRO Etapa 2

```json
{
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "overlap_comm": true
    }
}
```

## Configuración de ZeRO Etapa 3

Para modelos grandes:

```json
{
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}
```

## Con Hugging Face Transformers

### Integración con Trainer

```python
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    num_train_epochs=3,
    fp16=True,
    deepspeed="ds_config.json",
    logging_steps=10,
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

trainer.train()
```

## Entrenamiento multi-GPU

### Comando de lanzamiento

```bash

# Nodo único, 4 GPUs
deepspeed --num_gpus=4 train.py --deepspeed ds_config.json

# GPUs específicas
deepspeed --include="localhost:0,1,2,3" train.py --deepspeed ds_config.json
```

### Con torchrun

```bash
torchrun --nproc_per_node=4 train.py --deepspeed ds_config.json
```

## Entrenamiento multinodo

### Archivo de hosts

**hostfile:**

```
node1 slots=4
node2 slots=4
```

### Lanzar

```bash
deepspeed --hostfile=hostfile train.py --deepspeed ds_config.json
```

### Configuración de SSH

```bash

# Asegurar SSH sin contraseña entre nodos
ssh-keygen -t rsa
ssh-copy-id user@node2
```

## Configuraciones eficientes en memoria

### Modelo de 7B en GPU de 24GB

```json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "gradient_checkpointing": true,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16
}
```

### Modelo de 13B en GPU de 24GB

```json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"},
        "stage3_param_persistence_threshold": 0
    },
    "gradient_checkpointing": true,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 32
}
```

## Gradient Checkpointing

Ahorra memoria recomputando activaciones:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model.gradient_checkpointing_enable()
```

## Guardar y cargar checkpoints

### Guardar

```python

# DeepSpeed gestiona los checkpoints
model_engine.save_checkpoint("./checkpoints", tag="step_1000")
```

### Cargar

```python
model_engine.load_checkpoint("./checkpoints", tag="step_1000")
```

### Guardar en formato HuggingFace

```python

# Convertir checkpoint de DeepSpeed al formato HF
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint("./checkpoints/step_1000")
model.load_state_dict(state_dict)
model.save_pretrained("./hf_model")
```

## Monitoreo

### TensorBoard

```json
{
    "tensorboard": {
        "enabled": true,
        "output_path": "./logs",
        "job_name": "training_run"
    }
}
```

### Weights & Biases

```json
{
    "wandb": {
        "enabled": true,
        "project": "my_project"
    }
}
```

## Problemas comunes

### Memoria insuficiente

```json
// Intenta:
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "train_micro_batch_size_per_gpu": 1
}
```

### Entrenamiento lento

* Reducir la descarga a CPU
* Aumentar el tamaño del batch
* Usar ZeRO Etapa 2 en lugar de 3

### Errores NCCL

```bash

# Establecer variables de entorno
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
```

## Consejos de rendimiento

| Consejo                          | Efecto              |
| -------------------------------- | ------------------- |
| Usar bf16 en lugar de fp16       | Mejor estabilidad   |
| Habilitar gradient checkpointing | Menos memoria       |
| Ajustar el tamaño del batch      | Mejor rendimiento   |
| Usar descarga a NVMe             | Modelos más grandes |

## Comparación de rendimiento

| Modelo | GPUs    | Etapa ZeRO | Velocidad de entrenamiento |
| ------ | ------- | ---------- | -------------------------- |
| 7B     | 1x A100 | ZeRO-3     | \~1000 tokens/s            |
| 7B     | 4x A100 | ZeRO-2     | \~4000 tokens/s            |
| 13B    | 4x A100 | ZeRO-3     | \~2000 tokens/s            |
| 70B    | 8x A100 | ZeRO-3     | \~800 tokens/s             |

## Solución de problemas

## Estimación de costos

Tarifas típicas del marketplace de CLORE.AI (a fecha de 2024):

| GPU       | Tarifa por hora | Tarifa diaria | Sesión de 4 horas |
| --------- | --------------- | ------------- | ----------------- |
| RTX 3060  | \~$0.03         | \~$0.70       | \~$0.12           |
| RTX 3090  | \~$0.06         | \~$1.50       | \~$0.25           |
| RTX 4090  | \~$0.10         | \~$2.30       | \~$0.40           |
| A100 40GB | \~$0.17         | \~$4.00       | \~$0.70           |
| A100 80GB | \~$0.25         | \~$6.00       | \~$1.00           |

*Los precios varían según el proveedor y la demanda. Consulta* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *para las tarifas actuales.*

**Ahorra dinero:**

* Usa **Spot** market para cargas de trabajo flexibles (a menudo 30-50% más barato)
* Paga con **CLORE** tokens
* Compara precios entre diferentes proveedores

## Próximos pasos

* [Ajustar LLMs](/guides/guides_v2-es/entrenamiento/finetune-llm.md) - Entrenamiento LoRA
* Inferencia vLLM - Desplegar modelo entrenado
* [Guía de Hugging Face](/guides/guides_v2-es/entrenamiento/huggingface-transformers.md) - Biblioteca Transformers


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-es/entrenamiento/deepspeed-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
