DeepSpeed Training

Train large models efficiently with Microsoft DeepSpeed.

circle-check

Renting on CLORE.AI

  1. Filter by GPU type, VRAM, and price

  2. Choose On-Demand (fixed rate) or Spot (bid price)

  3. Configure your order:

    • Select Docker image

    • Set ports (TCP for SSH, HTTP for web UIs)

    • Add environment variables if needed

    • Enter startup command

  4. Select payment: CLORE, BTC, or USDT/USDC

  5. Create order and wait for deployment

Access Your Server

  • Find connection details in My Orders

  • Web interfaces: Use the HTTP port URL

  • SSH: ssh -p <port> root@<proxy-address>

What is DeepSpeed?

DeepSpeed enables:

  • Training models that don't fit in GPU memory

  • Multi-GPU and multi-node training

  • ZeRO optimization (memory efficiency)

  • Mixed precision training

ZeRO Stages

Stage
Memory Saving
Speed

ZeRO-1

Optimizer states partitioned

Fast

ZeRO-2

+ Gradients partitioned

Balanced

ZeRO-3

+ Parameters partitioned

Maximum savings

ZeRO-Infinity

CPU/NVMe offload

Largest models

Quick Deploy

Docker Image:

Ports:

Command:

Installation

Basic Training

DeepSpeed Config

ds_config.json:

Training Script

ZeRO Stage 2 Config

ZeRO Stage 3 Config

For large models:

With Hugging Face Transformers

Trainer Integration

Multi-GPU Training

Launch Command

With torchrun

Multi-Node Training

Hostfile

hostfile:

Launch

SSH Setup

Memory-Efficient Configs

7B Model on 24GB GPU

13B Model on 24GB GPU

Gradient Checkpointing

Save memory by recomputing activations:

Save and Load Checkpoints

Save

Load

Save HuggingFace Format

Monitoring

TensorBoard

Weights & Biases

Common Issues

Out of Memory

Slow Training

  • Reduce CPU offloading

  • Increase batch size

  • Use ZeRO Stage 2 instead of 3

NCCL Errors

Performance Tips

Tip
Effect

Use bf16 over fp16

Better stability

Enable gradient checkpointing

Less memory

Tune batch size

Better throughput

Use NVMe offload

Larger models

Performance Comparison

Model
GPUs
ZeRO Stage
Training Speed

7B

1x A100

ZeRO-3

~1000 tokens/s

7B

4x A100

ZeRO-2

~4000 tokens/s

13B

4x A100

ZeRO-3

~2000 tokens/s

70B

8x A100

ZeRO-3

~800 tokens/s

Troubleshooting

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU
Hourly Rate
Daily Rate
4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplacearrow-up-right for current rates.

Save money:

  • Use Spot market for flexible workloads (often 30-50% cheaper)

  • Pay with CLORE tokens

  • Compare prices across different providers

Next Steps

Last updated

Was this helpful?