# Multi-GPU Setup

Run large AI models across multiple GPUs on CLORE.AI.

{% hint style="success" %}
Find multi-GPU servers at [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## When Do You Need Multi-GPU?

| Model Size | Single GPU Option | Multi-GPU Option |
| ---------- | ----------------- | ---------------- |
| ≤13B       | RTX 3090 (Q4)     | Not needed       |
| 30B        | RTX 4090 (Q4)     | 2x RTX 3090      |
| 70B        | A100 40GB (Q4)    | 2x RTX 4090      |
| 70B FP16   | -                 | 2x A100 80GB     |
| 100B+      | -                 | 4x A100 80GB     |
| 405B       | -                 | 8x A100 80GB     |

***

## Multi-GPU Concepts

### Tensor Parallelism (TP)

Split model layers across GPUs. Best for inference.

```
GPU 0: Layers 1-20
GPU 1: Layers 21-40
```

**Pros:** Lower latency, simple setup **Cons:** Requires high-speed interconnect

### Pipeline Parallelism (PP)

Process different batches on different GPUs.

```
GPU 0: Batch 1 → GPU 1: Batch 1
GPU 0: Batch 2 → GPU 1: Batch 2
```

**Pros:** Higher throughput **Cons:** Higher latency, more complex

### Data Parallelism (DP)

Same model on multiple GPUs, different data.

```
GPU 0: Process batch A
GPU 1: Process batch B
```

**Pros:** Simple, linear scaling **Cons:** Each GPU needs full model

***

## LLM Multi-GPU Setup

### vLLM (Recommended)

**2 GPUs:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --host 0.0.0.0
```

**4 GPUs:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --host 0.0.0.0
```

**8 GPUs (for 405B):**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-405B-Instruct \
    --tensor-parallel-size 8 \
    --host 0.0.0.0
```

### Ollama Multi-GPU

Ollama automatically uses multiple GPUs when available:

```bash
# Check available GPUs
nvidia-smi

# Ollama will auto-detect and use all GPUs
ollama run llama3.1:70b
```

**Limit to specific GPUs:**

```bash
CUDA_VISIBLE_DEVICES=0,1 ollama run llama3.1:70b
```

### Text Generation Inference (TGI)

```bash
docker run --gpus all -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-70B-Instruct \
    --num-shard 2
```

### llama.cpp

```bash
# Specify GPU layers per device
./llama-server \
    -m llama-3.1-70b-q4.gguf \
    -ngl 999 \
    --split-mode layer \
    --tensor-split 0.5,0.5
```

***

## Image Generation Multi-GPU

### ComfyUI

ComfyUI can offload different models to different GPUs:

```python
# In ComfyUI workflow
# Use "Load Checkpoint" with device parameter
# device: "cuda:0" for first GPU
# device: "cuda:1" for second GPU
```

**Run VAE on separate GPU:**

```python
# Main model on GPU 0
# VAE on GPU 1
# Reduces VRAM pressure
```

### Stable Diffusion WebUI

**Enable multi-GPU in webui-user.sh:**

```bash
export COMMANDLINE_ARGS="--device-id 0"
# Or for specific models:
export COMMANDLINE_ARGS="--lowvram --device-id 0,1"
```

### FLUX Multi-GPU

```python
from diffusers import FluxPipeline
import torch

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
)

# Distribute across GPUs
pipe.enable_model_cpu_offload()  # or
pipe.to("cuda:0")  # Explicit GPU selection
```

***

## Training Multi-GPU

### PyTorch Distributed

```python
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize
dist.init_process_group("nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# Wrap model
model = YourModel().to(local_rank)
model = DDP(model, device_ids=[local_rank])

# Training loop as normal
```

**Launch:**

```bash
torchrun --nproc_per_node=2 train.py
```

### DeepSpeed

```python
import deepspeed

model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config={
        "train_batch_size": 32,
        "fp16": {"enabled": True},
        "zero_optimization": {"stage": 2}
    }
)
```

**Launch:**

```bash
deepspeed --num_gpus=2 train.py
```

### Accelerate (HuggingFace)

```python
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, dataloader = accelerator.prepare(
    model, optimizer, dataloader
)
```

**Configure:**

```bash
accelerate config  # Interactive setup
accelerate launch train.py
```

### Kohya Training (LoRA)

```bash
# Multi-GPU LoRA training
accelerate launch --num_processes=2 train_network.py \
    --pretrained_model_name_or_path="model.safetensors" \
    --train_data_dir="./images" \
    --output_dir="./output"
```

***

## GPU Selection

### Check Available GPUs

```bash
# List all GPUs
nvidia-smi

# Detailed info
nvidia-smi -L

# Memory usage
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv
```

### Select Specific GPUs

**Environment variable:**

```bash
# Use only GPU 0 and 1
export CUDA_VISIBLE_DEVICES=0,1
python your_script.py

# Use only GPU 2
export CUDA_VISIBLE_DEVICES=2
python your_script.py
```

**In Python:**

```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

# Or with torch
import torch
device = torch.device("cuda:0")  # First visible GPU
device = torch.device("cuda:1")  # Second visible GPU
```

***

## Performance Optimization

### NVLink vs PCIe

| Connection | Bandwidth | Best For           |
| ---------- | --------- | ------------------ |
| NVLink     | 600 GB/s  | Tensor parallelism |
| PCIe 4.0   | 32 GB/s   | Data parallelism   |
| PCIe 5.0   | 64 GB/s   | Mixed workloads    |

**Check NVLink status:**

```bash
nvidia-smi nvlink --status
```

### Optimal Configuration

| GPUs | TP Size | PP Size | Notes                  |
| ---- | ------- | ------- | ---------------------- |
| 2    | 2       | 1       | Simple tensor parallel |
| 4    | 4       | 1       | Requires NVLink        |
| 4    | 2       | 2       | PCIe-friendly          |
| 8    | 8       | 1       | Full tensor parallel   |
| 8    | 4       | 2       | Mixed parallelism      |

### Memory Balancing

**Even split (default):**

```bash
--tensor-parallel-size 2
```

**Custom split (uneven GPUs):**

```bash
# vLLM doesn't support uneven, use llama.cpp:
./llama-server --tensor-split 0.6,0.4
```

***

## Troubleshooting

### "NCCL Error"

```bash
# Set NCCL debug
export NCCL_DEBUG=INFO

# Try different NCCL algorithms
export NCCL_ALGO=Ring
```

### "Out of Memory on GPU X"

```bash
# Check memory per GPU
nvidia-smi

# Reduce batch size
--max-batch-size 1

# Enable gradient checkpointing (training)
--gradient-checkpointing
```

### "Slow Multi-GPU Performance"

1. Check NVLink connectivity
2. Reduce tensor parallel size
3. Use pipeline parallelism instead
4. Check CPU bottleneck

### "GPUs Not Detected"

```bash
# Verify CUDA
nvidia-smi

# Check PyTorch sees GPUs
python -c "import torch; print(torch.cuda.device_count())"

# Reinstall CUDA drivers if needed
```

***

## Cost Optimization

### When Multi-GPU is Worth It

| Scenario           | Single GPU           | Multi-GPU               | Winner          |
| ------------------ | -------------------- | ----------------------- | --------------- |
| 70B occasional use | A100 80GB ($0.25/hr) | 2x RTX 4090 ($0.20/hr)  | Multi           |
| 70B production     | A100 40GB ($0.17/hr) | 2x A100 40GB ($0.34/hr) | Single (Q4)     |
| Training 7B        | RTX 4090 ($0.10/hr)  | 2x RTX 4090 ($0.20/hr)  | Depends on time |

### Cost-Effective Configurations

| Use Case           | Configuration | \~Cost/hr |
| ------------------ | ------------- | --------- |
| 70B inference      | 2x RTX 3090   | $0.12     |
| 70B fast inference | 2x A100 40GB  | $0.34     |
| 70B FP16           | 2x A100 80GB  | $0.50     |
| Training 13B       | 2x RTX 4090   | $0.20     |

***

## Example Configurations

### 70B Chat Server

```bash
# 2x A100 40GB setup
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --host 0.0.0.0 \
    --port 8000
```

### DeepSeek-V3 (671B)

```bash
# 8x A100 80GB required
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --host 0.0.0.0
```

### Image + LLM Pipeline

```bash
# GPU 0: Stable Diffusion
CUDA_VISIBLE_DEVICES=0 python comfyui/main.py --port 8188 &

# GPU 1: LLM for prompts
CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct --port 8000
```

***

## Next Steps

* [vLLM Guide](https://docs.clore.ai/guides/language-models/vllm) - Production LLM serving
* [GPU Comparison](https://docs.clore.ai/guides/getting-started/gpu-comparison) - Choose your GPUs
* [API Integration](https://docs.clore.ai/guides/advanced/api-integration) - Build applications
* [Cost Calculator](https://docs.clore.ai/guides/getting-started/cost-calculator) - Estimate costs
