# DeepSpeed Training

Train large models efficiently with Microsoft DeepSpeed.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is DeepSpeed?

DeepSpeed enables:

* Training models that don't fit in GPU memory
* Multi-GPU and multi-node training
* ZeRO optimization (memory efficiency)
* Mixed precision training

## ZeRO Stages

| Stage         | Memory Saving                | Speed           |
| ------------- | ---------------------------- | --------------- |
| ZeRO-1        | Optimizer states partitioned | Fast            |
| ZeRO-2        | + Gradients partitioned      | Balanced        |
| ZeRO-3        | + Parameters partitioned     | Maximum savings |
| ZeRO-Infinity | CPU/NVMe offload             | Largest models  |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
```

**Command:**

```bash
pip install deepspeed transformers datasets accelerate
```

## Installation

```bash
pip install deepspeed

# Verify installation
ds_report
```

## Basic Training

### DeepSpeed Config

**ds\_config.json:**

```json
{
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-4,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 1e-4,
            "warmup_num_steps": 100
        }
    },
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 16
    },
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "overlap_comm": true
    }
}
```

### Training Script

```python
import torch
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

# Initialize
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# DeepSpeed initialization
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        inputs = tokenizer(batch["text"], return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(model_engine.device) for k, v in inputs.items()}

        outputs = model_engine(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss

        model_engine.backward(loss)
        model_engine.step()
```

## ZeRO Stage 2 Config

```json
{
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "overlap_comm": true
    }
}
```

## ZeRO Stage 3 Config

For large models:

```json
{
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}
```

## With Hugging Face Transformers

### Trainer Integration

```python
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    num_train_epochs=3,
    fp16=True,
    deepspeed="ds_config.json",
    logging_steps=10,
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

trainer.train()
```

## Multi-GPU Training

### Launch Command

```bash

# Single node, 4 GPUs
deepspeed --num_gpus=4 train.py --deepspeed ds_config.json

# Specific GPUs
deepspeed --include="localhost:0,1,2,3" train.py --deepspeed ds_config.json
```

### With torchrun

```bash
torchrun --nproc_per_node=4 train.py --deepspeed ds_config.json
```

## Multi-Node Training

### Hostfile

**hostfile:**

```
node1 slots=4
node2 slots=4
```

### Launch

```bash
deepspeed --hostfile=hostfile train.py --deepspeed ds_config.json
```

### SSH Setup

```bash

# Ensure passwordless SSH between nodes
ssh-keygen -t rsa
ssh-copy-id user@node2
```

## Memory-Efficient Configs

### 7B Model on 24GB GPU

```json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "gradient_checkpointing": true,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16
}
```

### 13B Model on 24GB GPU

```json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"},
        "stage3_param_persistence_threshold": 0
    },
    "gradient_checkpointing": true,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 32
}
```

## Gradient Checkpointing

Save memory by recomputing activations:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model.gradient_checkpointing_enable()
```

## Save and Load Checkpoints

### Save

```python

# DeepSpeed handles checkpointing
model_engine.save_checkpoint("./checkpoints", tag="step_1000")
```

### Load

```python
model_engine.load_checkpoint("./checkpoints", tag="step_1000")
```

### Save HuggingFace Format

```python

# Convert DeepSpeed checkpoint to HF format
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint("./checkpoints/step_1000")
model.load_state_dict(state_dict)
model.save_pretrained("./hf_model")
```

## Monitoring

### TensorBoard

```json
{
    "tensorboard": {
        "enabled": true,
        "output_path": "./logs",
        "job_name": "training_run"
    }
}
```

### Weights & Biases

```json
{
    "wandb": {
        "enabled": true,
        "project": "my_project"
    }
}
```

## Common Issues

### Out of Memory

```json
// Try:
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "train_micro_batch_size_per_gpu": 1
}
```

### Slow Training

* Reduce CPU offloading
* Increase batch size
* Use ZeRO Stage 2 instead of 3

### NCCL Errors

```bash

# Set environment variables
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
```

## Performance Tips

| Tip                           | Effect            |
| ----------------------------- | ----------------- |
| Use bf16 over fp16            | Better stability  |
| Enable gradient checkpointing | Less memory       |
| Tune batch size               | Better throughput |
| Use NVMe offload              | Larger models     |

## Performance Comparison

| Model | GPUs    | ZeRO Stage | Training Speed  |
| ----- | ------- | ---------- | --------------- |
| 7B    | 1x A100 | ZeRO-3     | \~1000 tokens/s |
| 7B    | 4x A100 | ZeRO-2     | \~4000 tokens/s |
| 13B   | 4x A100 | ZeRO-3     | \~2000 tokens/s |
| 70B   | 8x A100 | ZeRO-3     | \~800 tokens/s  |

## Troubleshooting

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [Fine-tune LLMs](https://docs.clore.ai/guides/training/finetune-llm) - LoRA training
* vLLM Inference - Deploy trained model
* [Hugging Face Guide](https://docs.clore.ai/guides/training/huggingface-transformers) - Transformers library
