# DeepSpeed Training

Train large models efficiently with Microsoft DeepSpeed.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is DeepSpeed?

DeepSpeed enables:

* Training models that don't fit in GPU memory
* Multi-GPU and multi-node training
* ZeRO optimization (memory efficiency)
* Mixed precision training

## ZeRO Stages

| Stage         | Memory Saving                | Speed           |
| ------------- | ---------------------------- | --------------- |
| ZeRO-1        | Optimizer states partitioned | Fast            |
| ZeRO-2        | + Gradients partitioned      | Balanced        |
| ZeRO-3        | + Parameters partitioned     | Maximum savings |
| ZeRO-Infinity | CPU/NVMe offload             | Largest models  |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
```

**Command:**

```bash
pip install deepspeed transformers datasets accelerate
```

## Installation

```bash
pip install deepspeed

# Verify installation
ds_report
```

## Basic Training

### DeepSpeed Config

**ds\_config.json:**

```json
{
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-4,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 1e-4,
            "warmup_num_steps": 100
        }
    },
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 16
    },
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "overlap_comm": true
    }
}
```

### Training Script

```python
import torch
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

# Initialize
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# DeepSpeed initialization
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        inputs = tokenizer(batch["text"], return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(model_engine.device) for k, v in inputs.items()}

        outputs = model_engine(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss

        model_engine.backward(loss)
        model_engine.step()
```

## ZeRO Stage 2 Config

```json
{
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "overlap_comm": true
    }
}
```

## ZeRO Stage 3 Config

For large models:

```json
{
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}
```

## With Hugging Face Transformers

### Trainer Integration

```python
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    num_train_epochs=3,
    fp16=True,
    deepspeed="ds_config.json",
    logging_steps=10,
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

trainer.train()
```

## Multi-GPU Training

### Launch Command

```bash

# Single node, 4 GPUs
deepspeed --num_gpus=4 train.py --deepspeed ds_config.json

# Specific GPUs
deepspeed --include="localhost:0,1,2,3" train.py --deepspeed ds_config.json
```

### With torchrun

```bash
torchrun --nproc_per_node=4 train.py --deepspeed ds_config.json
```

## Multi-Node Training

### Hostfile

**hostfile:**

```
node1 slots=4
node2 slots=4
```

### Launch

```bash
deepspeed --hostfile=hostfile train.py --deepspeed ds_config.json
```

### SSH Setup

```bash

# Ensure passwordless SSH between nodes
ssh-keygen -t rsa
ssh-copy-id user@node2
```

## Memory-Efficient Configs

### 7B Model on 24GB GPU

```json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "gradient_checkpointing": true,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16
}
```

### 13B Model on 24GB GPU

```json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"},
        "stage3_param_persistence_threshold": 0
    },
    "gradient_checkpointing": true,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 32
}
```

## Gradient Checkpointing

Save memory by recomputing activations:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model.gradient_checkpointing_enable()
```

## Save and Load Checkpoints

### Save

```python

# DeepSpeed handles checkpointing
model_engine.save_checkpoint("./checkpoints", tag="step_1000")
```

### Load

```python
model_engine.load_checkpoint("./checkpoints", tag="step_1000")
```

### Save HuggingFace Format

```python

# Convert DeepSpeed checkpoint to HF format
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint("./checkpoints/step_1000")
model.load_state_dict(state_dict)
model.save_pretrained("./hf_model")
```

## Monitoring

### TensorBoard

```json
{
    "tensorboard": {
        "enabled": true,
        "output_path": "./logs",
        "job_name": "training_run"
    }
}
```

### Weights & Biases

```json
{
    "wandb": {
        "enabled": true,
        "project": "my_project"
    }
}
```

## Common Issues

### Out of Memory

```json
// Try:
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "train_micro_batch_size_per_gpu": 1
}
```

### Slow Training

* Reduce CPU offloading
* Increase batch size
* Use ZeRO Stage 2 instead of 3

### NCCL Errors

```bash

# Set environment variables
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
```

## Performance Tips

| Tip                           | Effect            |
| ----------------------------- | ----------------- |
| Use bf16 over fp16            | Better stability  |
| Enable gradient checkpointing | Less memory       |
| Tune batch size               | Better throughput |
| Use NVMe offload              | Larger models     |

## Performance Comparison

| Model | GPUs    | ZeRO Stage | Training Speed  |
| ----- | ------- | ---------- | --------------- |
| 7B    | 1x A100 | ZeRO-3     | \~1000 tokens/s |
| 7B    | 4x A100 | ZeRO-2     | \~4000 tokens/s |
| 13B   | 4x A100 | ZeRO-3     | \~2000 tokens/s |
| 70B   | 8x A100 | ZeRO-3     | \~800 tokens/s  |

## Troubleshooting

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [Fine-tune LLMs](/guides/training/finetune-llm.md) - LoRA training
* vLLM Inference - Deploy trained model
* [Hugging Face Guide](/guides/training/huggingface-transformers.md) - Transformers library


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/training/deepspeed-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.