# Multi-GPU Setup

Run large AI models across multiple GPUs on CLORE.AI.

{% hint style="success" %}
Find multi-GPU servers at [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## When Do You Need Multi-GPU?

| Model Size | Single GPU Option | Multi-GPU Option |
| ---------- | ----------------- | ---------------- |
| ≤13B       | RTX 3090 (Q4)     | Not needed       |
| 30B        | RTX 4090 (Q4)     | 2x RTX 3090      |
| 70B        | A100 40GB (Q4)    | 2x RTX 4090      |
| 70B FP16   | -                 | 2x A100 80GB     |
| 100B+      | -                 | 4x A100 80GB     |
| 405B       | -                 | 8x A100 80GB     |

***

## Multi-GPU Concepts

### Tensor Parallelism (TP)

Split model layers across GPUs. Best for inference.

```
GPU 0: Layers 1-20
GPU 1: Layers 21-40
```

**Pros:** Lower latency, simple setup **Cons:** Requires high-speed interconnect

### Pipeline Parallelism (PP)

Process different batches on different GPUs.

```
GPU 0: Batch 1 → GPU 1: Batch 1
GPU 0: Batch 2 → GPU 1: Batch 2
```

**Pros:** Higher throughput **Cons:** Higher latency, more complex

### Data Parallelism (DP)

Same model on multiple GPUs, different data.

```
GPU 0: Process batch A
GPU 1: Process batch B
```

**Pros:** Simple, linear scaling **Cons:** Each GPU needs full model

***

## LLM Multi-GPU Setup

### vLLM (Recommended)

**2 GPUs:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --host 0.0.0.0
```

**4 GPUs:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --host 0.0.0.0
```

**8 GPUs (for 405B):**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-405B-Instruct \
    --tensor-parallel-size 8 \
    --host 0.0.0.0
```

### Ollama Multi-GPU

Ollama automatically uses multiple GPUs when available:

```bash
# Check available GPUs
nvidia-smi

# Ollama will auto-detect and use all GPUs
ollama run llama3.1:70b
```

**Limit to specific GPUs:**

```bash
CUDA_VISIBLE_DEVICES=0,1 ollama run llama3.1:70b
```

### Text Generation Inference (TGI)

```bash
docker run --gpus all -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-70B-Instruct \
    --num-shard 2
```

### llama.cpp

```bash
# Specify GPU layers per device
./llama-server \
    -m llama-3.1-70b-q4.gguf \
    -ngl 999 \
    --split-mode layer \
    --tensor-split 0.5,0.5
```

***

## Image Generation Multi-GPU

### ComfyUI

ComfyUI can offload different models to different GPUs:

```python
# In ComfyUI workflow
# Use "Load Checkpoint" with device parameter
# device: "cuda:0" for first GPU
# device: "cuda:1" for second GPU
```

**Run VAE on separate GPU:**

```python
# Main model on GPU 0
# VAE on GPU 1
# Reduces VRAM pressure
```

### Stable Diffusion WebUI

**Enable multi-GPU in webui-user.sh:**

```bash
export COMMANDLINE_ARGS="--device-id 0"
# Or for specific models:
export COMMANDLINE_ARGS="--lowvram --device-id 0,1"
```

### FLUX Multi-GPU

```python
from diffusers import FluxPipeline
import torch

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
)

# Distribute across GPUs
pipe.enable_model_cpu_offload()  # or
pipe.to("cuda:0")  # Explicit GPU selection
```

***

## Training Multi-GPU

### PyTorch Distributed

```python
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize
dist.init_process_group("nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# Wrap model
model = YourModel().to(local_rank)
model = DDP(model, device_ids=[local_rank])

# Training loop as normal
```

**Launch:**

```bash
torchrun --nproc_per_node=2 train.py
```

### DeepSpeed

```python
import deepspeed

model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config={
        "train_batch_size": 32,
        "fp16": {"enabled": True},
        "zero_optimization": {"stage": 2}
    }
)
```

**Launch:**

```bash
deepspeed --num_gpus=2 train.py
```

### Accelerate (HuggingFace)

```python
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, dataloader = accelerator.prepare(
    model, optimizer, dataloader
)
```

**Configure:**

```bash
accelerate config  # Interactive setup
accelerate launch train.py
```

### Kohya Training (LoRA)

```bash
# Multi-GPU LoRA training
accelerate launch --num_processes=2 train_network.py \
    --pretrained_model_name_or_path="model.safetensors" \
    --train_data_dir="./images" \
    --output_dir="./output"
```

***

## GPU Selection

### Check Available GPUs

```bash
# List all GPUs
nvidia-smi

# Detailed info
nvidia-smi -L

# Memory usage
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv
```

### Select Specific GPUs

**Environment variable:**

```bash
# Use only GPU 0 and 1
export CUDA_VISIBLE_DEVICES=0,1
python your_script.py

# Use only GPU 2
export CUDA_VISIBLE_DEVICES=2
python your_script.py
```

**In Python:**

```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

# Or with torch
import torch
device = torch.device("cuda:0")  # First visible GPU
device = torch.device("cuda:1")  # Second visible GPU
```

***

## Performance Optimization

### NVLink vs PCIe

| Connection | Bandwidth | Best For           |
| ---------- | --------- | ------------------ |
| NVLink     | 600 GB/s  | Tensor parallelism |
| PCIe 4.0   | 32 GB/s   | Data parallelism   |
| PCIe 5.0   | 64 GB/s   | Mixed workloads    |

**Check NVLink status:**

```bash
nvidia-smi nvlink --status
```

### Optimal Configuration

| GPUs | TP Size | PP Size | Notes                  |
| ---- | ------- | ------- | ---------------------- |
| 2    | 2       | 1       | Simple tensor parallel |
| 4    | 4       | 1       | Requires NVLink        |
| 4    | 2       | 2       | PCIe-friendly          |
| 8    | 8       | 1       | Full tensor parallel   |
| 8    | 4       | 2       | Mixed parallelism      |

### Memory Balancing

**Even split (default):**

```bash
--tensor-parallel-size 2
```

**Custom split (uneven GPUs):**

```bash
# vLLM doesn't support uneven, use llama.cpp:
./llama-server --tensor-split 0.6,0.4
```

***

## Troubleshooting

### "NCCL Error"

```bash
# Set NCCL debug
export NCCL_DEBUG=INFO

# Try different NCCL algorithms
export NCCL_ALGO=Ring
```

### "Out of Memory on GPU X"

```bash
# Check memory per GPU
nvidia-smi

# Reduce batch size
--max-batch-size 1

# Enable gradient checkpointing (training)
--gradient-checkpointing
```

### "Slow Multi-GPU Performance"

1. Check NVLink connectivity
2. Reduce tensor parallel size
3. Use pipeline parallelism instead
4. Check CPU bottleneck

### "GPUs Not Detected"

```bash
# Verify CUDA
nvidia-smi

# Check PyTorch sees GPUs
python -c "import torch; print(torch.cuda.device_count())"

# Reinstall CUDA drivers if needed
```

***

## Cost Optimization

### When Multi-GPU is Worth It

| Scenario           | Single GPU           | Multi-GPU               | Winner          |
| ------------------ | -------------------- | ----------------------- | --------------- |
| 70B occasional use | A100 80GB ($0.25/hr) | 2x RTX 4090 ($0.20/hr)  | Multi           |
| 70B production     | A100 40GB ($0.17/hr) | 2x A100 40GB ($0.34/hr) | Single (Q4)     |
| Training 7B        | RTX 4090 ($0.10/hr)  | 2x RTX 4090 ($0.20/hr)  | Depends on time |

### Cost-Effective Configurations

| Use Case           | Configuration | \~Cost/hr |
| ------------------ | ------------- | --------- |
| 70B inference      | 2x RTX 3090   | $0.12     |
| 70B fast inference | 2x A100 40GB  | $0.34     |
| 70B FP16           | 2x A100 80GB  | $0.50     |
| Training 13B       | 2x RTX 4090   | $0.20     |

***

## Example Configurations

### 70B Chat Server

```bash
# 2x A100 40GB setup
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --host 0.0.0.0 \
    --port 8000
```

### DeepSeek-V3 (671B)

```bash
# 8x A100 80GB required
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --host 0.0.0.0
```

### Image + LLM Pipeline

```bash
# GPU 0: Stable Diffusion
CUDA_VISIBLE_DEVICES=0 python comfyui/main.py --port 8188 &

# GPU 1: LLM for prompts
CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct --port 8000
```

***

## Next Steps

* [vLLM Guide](/guides/language-models/vllm.md) - Production LLM serving
* [GPU Comparison](/guides/getting-started/gpu-comparison.md) - Choose your GPUs
* [API Integration](/guides/advanced/api-integration.md) - Build applications
* [Cost Calculator](/guides/getting-started/cost-calculator.md) - Estimate costs


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/advanced/multi-gpu-setup.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
