Multi-GPU Setup

Run large AI models across multiple GPUs on Clore.ai

Run large AI models across multiple GPUs on CLORE.AI.

Find multi-GPU servers at CLORE.AI Marketplace.

When Do You Need Multi-GPU?

Model Size

Single GPU Option

Multi-GPU Option

≤13B

RTX 3090 (Q4)

Not needed

30B

RTX 4090 (Q4)

2x RTX 3090

70B

A100 40GB (Q4)

2x RTX 4090

70B FP16

2x A100 80GB

100B+

4x A100 80GB

405B

8x A100 80GB

Multi-GPU Concepts

Tensor Parallelism (TP)

Split model layers across GPUs. Best for inference.

GPU 0: Layers 1-20
GPU 1: Layers 21-40

Pros: Lower latency, simple setup Cons: Requires high-speed interconnect

Pipeline Parallelism (PP)

Process different batches on different GPUs.

GPU 0: Batch 1 → GPU 1: Batch 1
GPU 0: Batch 2 → GPU 1: Batch 2

Pros: Higher throughput Cons: Higher latency, more complex

Data Parallelism (DP)

Same model on multiple GPUs, different data.

GPU 0: Process batch A
GPU 1: Process batch B

Pros: Simple, linear scaling Cons: Each GPU needs full model

LLM Multi-GPU Setup

vLLM (Recommended)

2 GPUs:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --host 0.0.0.0

4 GPUs:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --host 0.0.0.0

8 GPUs (for 405B):

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-405B-Instruct \
    --tensor-parallel-size 8 \
    --host 0.0.0.0

Ollama Multi-GPU

Ollama automatically uses multiple GPUs when available:

# Check available GPUs
nvidia-smi

# Ollama will auto-detect and use all GPUs
ollama run llama3.1:70b

Limit to specific GPUs:

CUDA_VISIBLE_DEVICES=0,1 ollama run llama3.1:70b

Text Generation Inference (TGI)

docker run --gpus all -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-70B-Instruct \
    --num-shard 2

llama.cpp

# Specify GPU layers per device
./llama-server \
    -m llama-3.1-70b-q4.gguf \
    -ngl 999 \
    --split-mode layer \
    --tensor-split 0.5,0.5

Image Generation Multi-GPU

ComfyUI

ComfyUI can offload different models to different GPUs:

# In ComfyUI workflow
# Use "Load Checkpoint" with device parameter
# device: "cuda:0" for first GPU
# device: "cuda:1" for second GPU

Run VAE on separate GPU:

# Main model on GPU 0
# VAE on GPU 1
# Reduces VRAM pressure

Stable Diffusion WebUI

Enable multi-GPU in webui-user.sh:

export COMMANDLINE_ARGS="--device-id 0"
# Or for specific models:
export COMMANDLINE_ARGS="--lowvram --device-id 0,1"

FLUX Multi-GPU

from diffusers import FluxPipeline
import torch

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
)

# Distribute across GPUs
pipe.enable_model_cpu_offload()  # or
pipe.to("cuda:0")  # Explicit GPU selection

Training Multi-GPU

PyTorch Distributed

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize
dist.init_process_group("nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# Wrap model
model = YourModel().to(local_rank)
model = DDP(model, device_ids=[local_rank])

# Training loop as normal

Launch:

torchrun --nproc_per_node=2 train.py

DeepSpeed

import deepspeed

model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config={
        "train_batch_size": 32,
        "fp16": {"enabled": True},
        "zero_optimization": {"stage": 2}
    }
)

Launch:

deepspeed --num_gpus=2 train.py

Accelerate (HuggingFace)

from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, dataloader = accelerator.prepare(
    model, optimizer, dataloader
)

Configure:

accelerate config  # Interactive setup
accelerate launch train.py

Kohya Training (LoRA)

# Multi-GPU LoRA training
accelerate launch --num_processes=2 train_network.py \
    --pretrained_model_name_or_path="model.safetensors" \
    --train_data_dir="./images" \
    --output_dir="./output"

GPU Selection

Check Available GPUs

# List all GPUs
nvidia-smi

# Detailed info
nvidia-smi -L

# Memory usage
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

Select Specific GPUs

Environment variable:

# Use only GPU 0 and 1
export CUDA_VISIBLE_DEVICES=0,1
python your_script.py

# Use only GPU 2
export CUDA_VISIBLE_DEVICES=2
python your_script.py

In Python:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

# Or with torch
import torch
device = torch.device("cuda:0")  # First visible GPU
device = torch.device("cuda:1")  # Second visible GPU

Performance Optimization

NVLink vs PCIe

Connection

Bandwidth

Best For

NVLink

600 GB/s

Tensor parallelism

PCIe 4.0

32 GB/s

Data parallelism

PCIe 5.0

64 GB/s

Mixed workloads

Check NVLink status:

nvidia-smi nvlink --status

Optimal Configuration

GPUs

TP Size

PP Size

Notes

Simple tensor parallel

Requires NVLink

PCIe-friendly

Full tensor parallel

Mixed parallelism

Memory Balancing

Even split (default):

--tensor-parallel-size 2

Custom split (uneven GPUs):

# vLLM doesn't support uneven, use llama.cpp:
./llama-server --tensor-split 0.6,0.4

Troubleshooting

"NCCL Error"

# Set NCCL debug
export NCCL_DEBUG=INFO

# Try different NCCL algorithms
export NCCL_ALGO=Ring

"Out of Memory on GPU X"

# Check memory per GPU
nvidia-smi

# Reduce batch size
--max-batch-size 1

# Enable gradient checkpointing (training)
--gradient-checkpointing

"Slow Multi-GPU Performance"

Check NVLink connectivity
Reduce tensor parallel size
Use pipeline parallelism instead
Check CPU bottleneck

"GPUs Not Detected"

# Verify CUDA
nvidia-smi

# Check PyTorch sees GPUs
python -c "import torch; print(torch.cuda.device_count())"

# Reinstall CUDA drivers if needed

Cost Optimization

When Multi-GPU is Worth It

Scenario

Single GPU

Multi-GPU

Winner

70B occasional use

A100 80GB ($0.25/hr)

2x RTX 4090 ($0.20/hr)

Multi

70B production

A100 40GB ($0.17/hr)

2x A100 40GB ($0.34/hr)

Single (Q4)

Training 7B

RTX 4090 ($0.10/hr)

2x RTX 4090 ($0.20/hr)

Depends on time

Cost-Effective Configurations

Use Case

Configuration

~Cost/hr

70B inference

2x RTX 3090

$0.12

70B fast inference

2x A100 40GB

$0.34

70B FP16

2x A100 80GB

$0.50

Training 13B

2x RTX 4090

$0.20

Example Configurations

70B Chat Server

# 2x A100 40GB setup
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --host 0.0.0.0 \
    --port 8000

DeepSeek-V3 (671B)

# 8x A100 80GB required
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --host 0.0.0.0

Image + LLM Pipeline

# GPU 0: Stable Diffusion
CUDA_VISIBLE_DEVICES=0 python comfyui/main.py --port 8188 &

# GPU 1: LLM for prompts
CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct --port 8000

Next Steps

vLLM Guide - Production LLM serving
GPU Comparison - Choose your GPUs
API Integration - Build applications
Cost Calculator - Estimate costs

PreviousOverview NextAPI Integration

Last updated 7 days ago

Was this helpful?

hashtagWhen Do You Need Multi-GPU?

hashtagMulti-GPU Concepts

hashtagTensor Parallelism (TP)

hashtagPipeline Parallelism (PP)

hashtagData Parallelism (DP)

hashtagLLM Multi-GPU Setup

hashtagvLLM (Recommended)

hashtagOllama Multi-GPU

hashtagText Generation Inference (TGI)

hashtagllama.cpp

hashtagImage Generation Multi-GPU

hashtagComfyUI

hashtagStable Diffusion WebUI

hashtagFLUX Multi-GPU

hashtagTraining Multi-GPU

hashtagPyTorch Distributed

hashtagDeepSpeed

hashtagAccelerate (HuggingFace)

hashtagKohya Training (LoRA)

hashtagGPU Selection

hashtagCheck Available GPUs

hashtagSelect Specific GPUs

hashtagPerformance Optimization

hashtagNVLink vs PCIe

hashtagOptimal Configuration

hashtagMemory Balancing

hashtagTroubleshooting

hashtag"NCCL Error"

hashtag"Out of Memory on GPU X"

hashtag"Slow Multi-GPU Performance"

hashtag"GPUs Not Detected"

hashtagCost Optimization

hashtagWhen Multi-GPU is Worth It

hashtagCost-Effective Configurations

hashtagExample Configurations

hashtag70B Chat Server

hashtagDeepSeek-V3 (671B)

hashtagImage + LLM Pipeline

hashtagNext Steps

When Do You Need Multi-GPU?

Multi-GPU Concepts

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Data Parallelism (DP)

LLM Multi-GPU Setup

vLLM (Recommended)

Ollama Multi-GPU

Text Generation Inference (TGI)

llama.cpp

Image Generation Multi-GPU

ComfyUI

Stable Diffusion WebUI

FLUX Multi-GPU

Training Multi-GPU

PyTorch Distributed

DeepSpeed

Accelerate (HuggingFace)

Kohya Training (LoRA)

GPU Selection

Check Available GPUs

Select Specific GPUs

Performance Optimization

NVLink vs PCIe

Optimal Configuration

Memory Balancing

Troubleshooting

"NCCL Error"

"Out of Memory on GPU X"

"Slow Multi-GPU Performance"

"GPUs Not Detected"

Cost Optimization

When Multi-GPU is Worth It

Cost-Effective Configurations

Example Configurations

70B Chat Server

DeepSeek-V3 (671B)

Image + LLM Pipeline

Next Steps