Model Compatibility

Complete guide to which AI models run on which GPUs on CLORE.AI.

circle-check

Quick Reference

Language Models (LLM)

Model
Parameters
Min VRAM
Recommended
Quantization

Llama 3.2

1B

2GB

4GB

Q4, Q8, FP16

Llama 3.2

3B

4GB

6GB

Q4, Q8, FP16

Llama 3.1/3

8B

6GB

12GB

Q4, Q8, FP16

Mistral

7B

6GB

12GB

Q4, Q8, FP16

Qwen 2.5

7B

6GB

12GB

Q4, Q8, FP16

Qwen 2.5

14B

12GB

16GB

Q4, Q8

Qwen 2.5

32B

20GB

24GB

Q4, Q8

Llama 3.1

70B

40GB

48GB

Q4, Q8

Qwen 2.5

72B

48GB

80GB

Q4, Q8

Mixtral

8x7B

24GB

48GB

Q4

DeepSeek-V3

671B

320GB+

640GB

FP8

Image Generation Models

Model
Min VRAM
Recommended
Notes

SD 1.5

4GB

8GB

512x512 native

SD 2.1

6GB

8GB

768x768 native

SDXL

8GB

12GB

1024x1024 native

SDXL Turbo

8GB

12GB

1-4 steps

FLUX.1 schnell

12GB

16GB

4 steps, fast

FLUX.1 dev

16GB

24GB

20-50 steps

Video Generation Models

Model
Min VRAM
Recommended
Output

Stable Video Diffusion

16GB

24GB

4 sec, 576x1024

AnimateDiff

12GB

16GB

2-4 sec

Wan2.1

24GB

40GB

5 sec, 480p-720p

Hunyuan Video

40GB

80GB

5 sec, 720p

OpenSora

24GB

40GB

Variable

Audio Models

Model
Min VRAM
Recommended
Task

Whisper tiny

1GB

2GB

Transcription

Whisper base

1GB

2GB

Transcription

Whisper small

2GB

4GB

Transcription

Whisper medium

4GB

6GB

Transcription

Whisper large-v3

6GB

10GB

Transcription

Bark

8GB

12GB

Text-to-Speech

Stable Audio

8GB

12GB

Music Generation

Vision Models

Model
Min VRAM
Recommended
Task

Llama 3.2 Vision 11B

12GB

16GB

Image Understanding

Llama 3.2 Vision 90B

48GB

80GB

Image Understanding

LLaVA 7B

8GB

12GB

Visual QA

LLaVA 13B

16GB

24GB

Visual QA


Detailed Compatibility Tables

LLM by GPU

GPU
Max Model (Q4)
Max Model (Q8)
Max Model (FP16)

RTX 3060 12GB

13B

7B

3B

RTX 3070 8GB

7B

3B

1B

RTX 3080 10GB

7B

7B

3B

RTX 3090 24GB

30B

13B

7B

RTX 4070 Ti 12GB

13B

7B

3B

RTX 4080 16GB

14B

7B

7B

RTX 4090 24GB

30B

13B

7B

RTX 5090 32GB

70B

14B

13B

A100 40GB

70B

30B

14B

A100 80GB

70B

70B

30B

H100 80GB

70B

70B

30B

Image Generation by GPU

GPU
SD 1.5
SDXL
FLUX schnell
FLUX dev

RTX 3060 12GB

✅ 512

✅ 768

⚠️ 512*

RTX 3070 8GB

✅ 512

⚠️ 512

RTX 3080 10GB

✅ 512

✅ 768

⚠️ 512*

RTX 3090 24GB

✅ 768

✅ 1024

✅ 1024

⚠️ 768*

RTX 4070 Ti 12GB

✅ 512

✅ 768

⚠️ 512*

RTX 4080 16GB

✅ 768

✅ 1024

✅ 768

⚠️ 512*

RTX 4090 24GB

✅ 1024

✅ 1024

✅ 1024

✅ 1024

RTX 5090 32GB

✅ 1024

✅ 1024

✅ 1536

✅ 1536

A100 40GB

✅ 1024

✅ 1024

✅ 1024

✅ 1024

A100 80GB

✅ 2048

✅ 2048

✅ 1536

✅ 1536

*With CPU offload or reduced batch size

Video Generation by GPU

GPU
SVD
AnimateDiff
Wan2.1
Hunyuan

RTX 3060 12GB

⚠️ short

RTX 3090 24GB

✅ 2-4s

⚠️ 480p

RTX 4090 24GB

✅ 4s

✅ 480p

⚠️ short

RTX 5090 32GB

✅ 6s

✅ 720p

✅ 5s

A100 40GB

✅ 4s

✅ 720p

✅ 5s

A100 80GB

✅ 8s

✅ 720p

✅ 10s


Quantization Guide

What is Quantization?

Quantization reduces model precision to fit in less VRAM:

Format
Bits
VRAM Reduction
Quality Loss

FP32

32

Baseline

None

FP16

16

50%

Minimal

BF16

16

50%

Minimal

FP8

8

75%

Small

Q8

8

75%

Small

Q6_K

6

81%

Small

Q5_K_M

5

84%

Moderate

Q4_K_M

4

87%

Moderate

Q3_K_M

3

91%

Noticeable

Q2_K

2

94%

Significant

VRAM Calculator

Formula: VRAM (GB) ≈ Parameters (B) × Bytes per Parameter

Model Size
FP16
Q8
Q4

1B

2 GB

1 GB

0.5 GB

3B

6 GB

3 GB

1.5 GB

7B

14 GB

7 GB

3.5 GB

8B

16 GB

8 GB

4 GB

13B

26 GB

13 GB

6.5 GB

14B

28 GB

14 GB

7 GB

30B

60 GB

30 GB

15 GB

32B

64 GB

32 GB

16 GB

70B

140 GB

70 GB

35 GB

72B

144 GB

72 GB

36 GB

*Add ~20% for KV cache and overhead

Use Case
Recommended
Why

Chat/General

Q4_K_M

Good balance of speed and quality

Coding

Q5_K_M+

Better accuracy for code

Creative Writing

Q4_K_M

Speed matters more

Analysis

Q6_K+

Higher precision needed

Production

FP16/BF16

Maximum quality


Context Length vs VRAM

How Context Affects VRAM

Each model has a context window (max tokens). Longer context = more VRAM:

Model
Default Context
Max Context
VRAM per 1K tokens

Llama 3 8B

8K

128K

~0.3 GB

Llama 3 70B

8K

128K

~0.5 GB

Qwen 2.5 7B

8K

128K

~0.25 GB

Mistral 7B

8K

32K

~0.25 GB

Mixtral 8x7B

32K

32K

~0.4 GB

Context by GPU (Llama 3 8B Q4)

GPU
Comfortable Context
Maximum Context

RTX 3060 12GB

16K

32K

RTX 3090 24GB

64K

96K

RTX 4090 24GB

64K

96K

RTX 5090 32GB

96K

128K

A100 40GB

96K

128K

A100 80GB

128K

128K


Multi-GPU Configurations

Tensor Parallelism

Split one model across multiple GPUs:

Configuration
Total VRAM
Max Model (FP16)

2x RTX 3090

48GB

30B

2x RTX 4090

48GB

30B

2x RTX 5090

64GB

32B

4x RTX 5090

128GB

70B

2x A100 40GB

80GB

70B

4x A100 40GB

160GB

100B+

8x A100 80GB

640GB

DeepSeek-V3

vLLM Multi-GPU


Specific Model Guides

Llama 3.1 Family

Variant
Parameters
Min GPU
Recommended Setup

Llama 3.2 1B

1B

Any 4GB

RTX 3060

Llama 3.2 3B

3B

Any 6GB

RTX 3060

Llama 3.1 8B

8B

RTX 3060

RTX 3090

Llama 3.1 70B

70B

A100 40GB

2x A100 40GB

Llama 3.1 405B

405B

8x A100 80GB

8x H100

Mistral/Mixtral Family

Variant
Parameters
Min GPU
Recommended Setup

Mistral 7B

7B

RTX 3060

RTX 3090

Mixtral 8x7B

46.7B

RTX 3090

A100 40GB

Mixtral 8x22B

141B

2x A100 80GB

4x A100 80GB

Qwen 2.5 Family

Variant
Parameters
Min GPU
Recommended Setup

Qwen 2.5 0.5B

0.5B

Any 2GB

Any 4GB

Qwen 2.5 1.5B

1.5B

Any 4GB

RTX 3060

Qwen 2.5 3B

3B

Any 6GB

RTX 3060

Qwen 2.5 7B

7B

RTX 3060

RTX 3090

Qwen 2.5 14B

14B

RTX 3090

RTX 4090

Qwen 2.5 32B

32B

RTX 4090

A100 40GB

Qwen 2.5 72B

72B

A100 40GB

A100 80GB

DeepSeek Models

Variant
Parameters
Min GPU
Recommended Setup

DeepSeek-Coder 6.7B

6.7B

RTX 3060

RTX 3090

DeepSeek-Coder 33B

33B

RTX 4090

A100 40GB

DeepSeek-V2-Lite

15.7B

RTX 3090

A100 40GB

DeepSeek-V3

671B

8x A100 80GB

8x H100


Troubleshooting

"CUDA out of memory"

  1. Reduce quantization: Q8 → Q4

  2. Lower context length: Reduce max_tokens

  3. Enable CPU offload: --cpu-offload or enable_model_cpu_offload()

  4. Use smaller batch: batch_size=1

  5. Try different GPU: Need more VRAM

"Model too large"

  1. Use quantized version: GGUF Q4 models

  2. Use multiple GPUs: Tensor parallelism

  3. Offload to CPU: Slower but works

  4. Choose smaller model: 7B instead of 13B

"Slow generation"

  1. Upgrade GPU: More VRAM = less offloading

  2. Use faster quantization: Q4 is faster than Q8

  3. Reduce context: Shorter = faster

  4. Enable flash attention: --flash-attn

Next Steps

Last updated

Was this helpful?