> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-de/training/deepspeed-training.md).

# DeepSpeed Training

Trainiere große Modelle effizient mit Microsoft DeepSpeed.

{% hint style="success" %}
Alle Beispiele können auf GPU-Servern ausgeführt werden, die über [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Mieten auf CLORE.AI

1. Besuchen Sie [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Nach GPU-Typ, VRAM und Preis filtern
3. Wählen **On-Demand** (Festpreis) oder **Spot** (Gebotspreis)
4. Konfigurieren Sie Ihre Bestellung:
   * Docker-Image auswählen
   * Ports festlegen (TCP für SSH, HTTP für Web-UIs)
   * Umgebungsvariablen bei Bedarf hinzufügen
   * Startbefehl eingeben
5. Zahlung auswählen: **CLORE**, **BTC**, oder **USDT/USDC**
6. Bestellung erstellen und auf Bereitstellung warten

### Zugriff auf Ihren Server

* Verbindungsdetails finden Sie in **Meine Bestellungen**
* Webschnittstellen: Verwenden Sie die HTTP-Port-URL
* SSH: `ssh -p <port> root@<proxy-address>`

## Was ist DeepSpeed?

DeepSpeed ermöglicht:

* Modelle zu trainieren, die nicht in den GPU-Speicher passen
* Training auf mehreren GPUs und mehreren Knoten
* ZeRO-Optimierung (Speichereffizienz)
* Training mit gemischter Genauigkeit

## ZeRO-Stufen

| Stufe         | Speichereinsparung               | Geschwindigkeit       |
| ------------- | -------------------------------- | --------------------- |
| ZeRO-1        | Optimiererzustände partitioniert | Schnell               |
| ZeRO-2        | + Gradienten partitioniert       | Ausgeglichen          |
| ZeRO-3        | + Parameter partitioniert        | Maximale Einsparungen |
| ZeRO-Infinity | CPU/NVMe-Auslagerung             | Größte Modelle        |

## Schnelle Bereitstellung

**Docker-Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
```

**Befehl:**

```bash
pip install deepspeed transformers datasets accelerate
```

## Installation

```bash
pip install deepspeed

# Installation überprüfen
ds_report
```

## Grundlegendes Training

### DeepSpeed-Konfiguration

**ds\_config.json:**

```json
{
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-4,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 1e-4,
            "warmup_num_steps": 100
        }
    },
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 16
    },
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "overlap_comm": true
    }
}
```

### Trainingsskript

```python
import torch
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

# Initialisieren
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# DeepSpeed-Initialisierung
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

# Trainingsschleife
for epoch in range(num_epochs):
    for batch in dataloader:
        inputs = tokenizer(batch["text"], return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(model_engine.device) for k, v in inputs.items()}

        outputs = model_engine(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss

        model_engine.backward(loss)
        model_engine.step()
```

## ZeRO Stufe-2 Konfiguration

```json
{
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "overlap_comm": true
    }
}
```

## ZeRO Stufe-3 Konfiguration

Für große Modelle:

```json
{
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}
```

## Mit Hugging Face Transformers

### Trainer-Integration

```python
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    num_train_epochs=3,
    fp16=True,
    deepspeed="ds_config.json",
    logging_steps=10,
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

trainer.train()
```

## Multi-GPU-Training

### Startbefehl

```bash

# Einzelner Knoten, 4 GPUs
deepspeed --num_gpus=4 train.py --deepspeed ds_config.json

# Bestimmte GPUs
deepspeed --include="localhost:0,1,2,3" train.py --deepspeed ds_config.json
```

### Mit torchrun

```bash
torchrun --nproc_per_node=4 train.py --deepspeed ds_config.json
```

## Multi-Node-Training

### Hostdatei

**hostfile:**

```
node1 slots=4
node2 slots=4
```

### Starten

```bash
deepspeed --hostfile=hostfile train.py --deepspeed ds_config.json
```

### SSH-Einrichtung

```bash

# Stelle passwortloses SSH zwischen den Knoten sicher
ssh-keygen -t rsa
ssh-copy-id user@node2
```

## Speichereffiziente Konfigurationen

### 7B-Modell auf 24GB GPU

```json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "gradient_checkpointing": true,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16
}
```

### 13B-Modell auf 24GB GPU

```json
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"},
        "stage3_param_persistence_threshold": 0
    },
    "gradient_checkpointing": true,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 32
}
```

## Gradient Checkpointing

Sparen Sie Speicher, indem Aktivierungen neu berechnet werden:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model.gradient_checkpointing_enable()
```

## Checkpoints speichern und laden

### Speichern

```python

# DeepSpeed übernimmt das Checkpointing
model_engine.save_checkpoint("./checkpoints", tag="step_1000")
```

### Laden

```python
model_engine.load_checkpoint("./checkpoints", tag="step_1000")
```

### Im HuggingFace-Format speichern

```python

# DeepSpeed-Checkpoint in HF-Format konvertieren
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint("./checkpoints/step_1000")
model.load_state_dict(state_dict)
model.save_pretrained("./hf_model")
```

## Überwachung

### TensorBoard

```json
{
    "tensorboard": {
        "enabled": true,
        "output_path": "./logs",
        "job_name": "training_run"
    }
}
```

### Weights & Biases

```json
{
    "wandb": {
        "enabled": true,
        "project": "my_project"
    }
}
```

## Häufige Probleme

### Kein Speicher mehr

```json
// Versuchen:
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "train_micro_batch_size_per_gpu": 1
}
```

### Langsames Training

* Reduziere CPU-Auslagerung
* Erhöhe die Batch-Größe
* Verwende ZeRO Stufe 2 statt 3

### NCCL-Fehler

```bash

# Setze Umgebungsvariablen
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
```

## Performance-Tipps

| Tipp                             | Effekt             |
| -------------------------------- | ------------------ |
| Verwende bf16 statt fp16         | Bessere Stabilität |
| Aktiviere Gradient Checkpointing | Weniger Speicher   |
| Tunte die Batch-Größe            | Besserer Durchsatz |
| Verwende NVMe-Auslagerung        | Größere Modelle    |

## Leistungsvergleich

| Modell | GPUs    | ZeRO-Stufe | Trainingstempo  |
| ------ | ------- | ---------- | --------------- |
| 7B     | 1x A100 | ZeRO-3     | \~1000 Tokens/s |
| 7B     | 4x A100 | ZeRO-2     | \~4000 Tokens/s |
| 13B    | 4x A100 | ZeRO-3     | \~2000 Tokens/s |
| 70B    | 8x A100 | ZeRO-3     | \~800 Tokens/s  |

## Fehlerbehebung

## Kostenabschätzung

Typische CLORE.AI-Marktplatztarife (Stand 2024):

| GPU       | Stundensatz | Tagessatz | 4-Stunden-Sitzung |
| --------- | ----------- | --------- | ----------------- |
| RTX 3060  | \~$0.03     | \~$0.70   | \~$0.12           |
| RTX 3090  | \~$0.06     | \~$1.50   | \~$0.25           |
| RTX 4090  | \~$0.10     | \~$2.30   | \~$0.40           |
| A100 40GB | \~$0.17     | \~$4.00   | \~$0.70           |
| A100 80GB | \~$0.25     | \~$6.00   | \~$1.00           |

*Preise variieren je nach Anbieter und Nachfrage. Prüfen Sie* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *auf aktuelle Preise.*

**Geld sparen:**

* Verwenden Sie **Spot** Markt für flexible Workloads (oft 30–50% günstiger)
* Bezahlen mit **CLORE** Token
* Preise bei verschiedenen Anbietern vergleichen

## Nächste Schritte

* [Feinabstimmung von LLMs](/guides/guides_v2-de/training/finetune-llm.md) - LoRA-Training
* vLLM Inferenz - Bereitstellung des trainierten Modells
* [Hugging Face Anleitung](/guides/guides_v2-de/training/huggingface-transformers.md) - Transformers-Bibliothek


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-de/training/deepspeed-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
