Multi-GPU Setup

Run large AI models across multiple GPUs on CLORE.AI.

circle-check

When Do You Need Multi-GPU?

Model Size
Single GPU Option
Multi-GPU Option

≤13B

RTX 3090 (Q4)

Not needed

30B

RTX 4090 (Q4)

2x RTX 3090

70B

A100 40GB (Q4)

2x RTX 4090

70B FP16

-

2x A100 80GB

100B+

-

4x A100 80GB

405B

-

8x A100 80GB


Multi-GPU Concepts

Tensor Parallelism (TP)

Split model layers across GPUs. Best for inference.

GPU 0: Layers 1-20
GPU 1: Layers 21-40

Pros: Lower latency, simple setup Cons: Requires high-speed interconnect

Pipeline Parallelism (PP)

Process different batches on different GPUs.

Pros: Higher throughput Cons: Higher latency, more complex

Data Parallelism (DP)

Same model on multiple GPUs, different data.

Pros: Simple, linear scaling Cons: Each GPU needs full model


LLM Multi-GPU Setup

2 GPUs:

4 GPUs:

8 GPUs (for 405B):

Ollama Multi-GPU

Ollama automatically uses multiple GPUs when available:

Limit to specific GPUs:

Text Generation Inference (TGI)

llama.cpp


Image Generation Multi-GPU

ComfyUI

ComfyUI can offload different models to different GPUs:

Run VAE on separate GPU:

Stable Diffusion WebUI

Enable multi-GPU in webui-user.sh:

FLUX Multi-GPU


Training Multi-GPU

PyTorch Distributed

Launch:

DeepSpeed

Launch:

Accelerate (HuggingFace)

Configure:

Kohya Training (LoRA)


GPU Selection

Check Available GPUs

Select Specific GPUs

Environment variable:

In Python:


Performance Optimization

Connection
Bandwidth
Best For

NVLink

600 GB/s

Tensor parallelism

PCIe 4.0

32 GB/s

Data parallelism

PCIe 5.0

64 GB/s

Mixed workloads

Check NVLink status:

Optimal Configuration

GPUs
TP Size
PP Size
Notes

2

2

1

Simple tensor parallel

4

4

1

Requires NVLink

4

2

2

PCIe-friendly

8

8

1

Full tensor parallel

8

4

2

Mixed parallelism

Memory Balancing

Even split (default):

Custom split (uneven GPUs):


Troubleshooting

"NCCL Error"

"Out of Memory on GPU X"

"Slow Multi-GPU Performance"

  1. Check NVLink connectivity

  2. Reduce tensor parallel size

  3. Use pipeline parallelism instead

  4. Check CPU bottleneck

"GPUs Not Detected"


Cost Optimization

When Multi-GPU is Worth It

Scenario
Single GPU
Multi-GPU
Winner

70B occasional use

A100 80GB ($0.25/hr)

2x RTX 4090 ($0.20/hr)

Multi

70B production

A100 40GB ($0.17/hr)

2x A100 40GB ($0.34/hr)

Single (Q4)

Training 7B

RTX 4090 ($0.10/hr)

2x RTX 4090 ($0.20/hr)

Depends on time

Cost-Effective Configurations

Use Case
Configuration
~Cost/hr

70B inference

2x RTX 3090

$0.12

70B fast inference

2x A100 40GB

$0.34

70B FP16

2x A100 80GB

$0.50

Training 13B

2x RTX 4090

$0.20


Example Configurations

70B Chat Server

DeepSeek-V3 (671B)

Image + LLM Pipeline


Next Steps

Last updated

Was this helpful?