Model Compatibility

AI model and GPU compatibility matrix for Clore.ai

Complete guide to which AI models run on which GPUs on CLORE.AI.

Find GPUs with the right VRAM at CLORE.AI Marketplace.

Quick Reference

Language Models (LLM)

Model

Parameters

Min VRAM

Recommended

Quantization

Llama 3.2

2GB

4GB

Q4, Q8, FP16

Llama 3.2

4GB

6GB

Q4, Q8, FP16

Llama 3.1/3

6GB

12GB

Q4, Q8, FP16

Mistral

6GB

12GB

Q4, Q8, FP16

Qwen 2.5

6GB

12GB

Q4, Q8, FP16

Qwen 2.5

14B

12GB

16GB

Q4, Q8

Qwen 2.5

32B

20GB

24GB

Q4, Q8

Llama 3.1

70B

40GB

48GB

Q4, Q8

Qwen 2.5

72B

48GB

80GB

Q4, Q8

Mixtral

8x7B

24GB

48GB

DeepSeek-V3

671B

320GB+

640GB

FP8

DeepSeek-R1

671B

320GB+

8x H100

FP8, reasoning model

DeepSeek-R1-Distill

32B

20GB

2x A100 / RTX 5090

Q4/Q8

Image Generation Models

Model

Min VRAM

Recommended

Notes

SD 1.5

4GB

8GB

512x512 native

SD 2.1

6GB

8GB

768x768 native

SDXL

8GB

12GB

1024x1024 native

SDXL Turbo

8GB

12GB

1-4 steps

SD3.5 Large (8B)

16GB

24GB

1024x1024, advanced quality

FLUX.1 schnell

12GB

16GB

4 steps, fast

FLUX.1 dev

16GB

24GB

20-50 steps

TRELLIS

16GB

24GB (RTX 4090)

3D generation from images

Video Generation Models

Model

Min VRAM

Recommended

Output

Stable Video Diffusion

16GB

24GB

4 sec, 576x1024

AnimateDiff

12GB

16GB

2-4 sec

LTX-Video

16GB

24GB (RTX 4090/3090)

5 sec, 768x512, very fast

Wan2.1

24GB

40GB

5 sec, 480p-720p

Hunyuan Video

40GB

80GB

5 sec, 720p

OpenSora

24GB

40GB

Variable

Audio Models

Model

Min VRAM

Recommended

Task

Whisper tiny

1GB

2GB

Transcription

Whisper base

1GB

2GB

Transcription

Whisper small

2GB

4GB

Transcription

Whisper medium

4GB

6GB

Transcription

Whisper large-v3

6GB

10GB

Transcription

Bark

8GB

12GB

Text-to-Speech

Stable Audio

8GB

12GB

Music Generation

Vision & Vision-Language Models

Model

Min VRAM

Recommended

Task

Llama 3.2 Vision 11B

12GB

16GB

Image Understanding

Llama 3.2 Vision 90B

48GB

80GB

Image Understanding

LLaVA 7B

8GB

12GB

Visual QA

LLaVA 13B

16GB

24GB

Visual QA

Qwen2.5-VL 7B

16GB

24GB (RTX 4090)

Image/Video/Doc OCR

Qwen2.5-VL 72B

48GB

2x A100 80GB

Maximum VL capability

Fine-tuning & Training Tools

Tool / Method

Min VRAM

Recommended GPU

Task

Unsloth QLoRA 7B

12GB

RTX 3090 24GB

2x faster QLoRA, low VRAM

Unsloth QLoRA 13B

16GB

RTX 4090 24GB

Fast fine-tuning

LoRA (standard)

12GB

RTX 3090

Parameter-efficient fine-tuning

Full fine-tune 7B

40GB

A100 40GB

Maximum quality training

Detailed Compatibility Tables

LLM by GPU

GPU

Max Model (Q4)

Max Model (Q8)

Max Model (FP16)

RTX 3060 12GB

13B

RTX 3070 8GB

RTX 3080 10GB

RTX 3090 24GB

30B

13B

RTX 4070 Ti 12GB

13B

RTX 4080 16GB

14B

RTX 4090 24GB

30B

13B

RTX 5090 32GB

70B

14B

13B

A100 40GB

70B

30B

14B

A100 80GB

70B

30B

H100 80GB

70B

30B

Image Generation by GPU

GPU

SD 1.5

SDXL

FLUX schnell

FLUX dev

RTX 3060 12GB

✅ 512

✅ 768

⚠️ 512*

❌

RTX 3070 8GB

✅ 512

⚠️ 512

❌

RTX 3080 10GB

✅ 512

✅ 768

⚠️ 512*

❌

RTX 3090 24GB

✅ 768

✅ 1024

⚠️ 768*

RTX 4070 Ti 12GB

✅ 512

✅ 768

⚠️ 512*

❌

RTX 4080 16GB

✅ 768

✅ 1024

✅ 768

⚠️ 512*

RTX 4090 24GB

✅ 1024

RTX 5090 32GB

✅ 1024

✅ 1536

A100 40GB

✅ 1024

A100 80GB

✅ 2048

✅ 1536

*With CPU offload or reduced batch size

Video Generation by GPU

GPU

SVD

AnimateDiff

Wan2.1

Hunyuan

RTX 3060 12GB

❌

⚠️ short

❌

RTX 3090 24GB

✅ 2-4s

✅

⚠️ 480p

❌

RTX 4090 24GB

✅ 4s

✅

✅ 480p

⚠️ short

RTX 5090 32GB

✅ 6s

✅

✅ 720p

✅ 5s

A100 40GB

✅ 4s

✅

✅ 720p

✅ 5s

A100 80GB

✅ 8s

✅

✅ 720p

✅ 10s

Quantization Guide

What is Quantization?

Quantization reduces model precision to fit in less VRAM:

Format

Bits

VRAM Reduction

Quality Loss

FP32

Baseline

None

FP16

50%

Minimal

BF16

50%

Minimal

FP8

75%

Small

75%

Small

Q6_K

81%

Small

Q5_K_M

84%

Moderate

Q4_K_M

87%

Moderate

Q3_K_M

91%

Noticeable

Q2_K

94%

Significant

VRAM Calculator

Formula: VRAM (GB) ≈ Parameters (B) × Bytes per Parameter

Model Size

FP16

2 GB

1 GB

0.5 GB

6 GB

3 GB

1.5 GB

14 GB

7 GB

3.5 GB

16 GB

8 GB

4 GB

13B

26 GB

13 GB

6.5 GB

14B

28 GB

14 GB

7 GB

30B

60 GB

30 GB

15 GB

32B

64 GB

32 GB

16 GB

70B

140 GB

70 GB

35 GB

72B

144 GB

72 GB

36 GB

*Add ~20% for KV cache and overhead

Recommended Quantization by Use Case

Use Case

Recommended

Why

Chat/General

Q4_K_M

Good balance of speed and quality

Coding

Q5_K_M+

Better accuracy for code

Creative Writing

Q4_K_M

Speed matters more

Analysis

Q6_K+

Higher precision needed

Production

FP16/BF16

Maximum quality

Context Length vs VRAM

How Context Affects VRAM

Each model has a context window (max tokens). Longer context = more VRAM:

Model

Default Context

Max Context

VRAM per 1K tokens

Llama 3 8B

128K

~0.3 GB

Llama 3 70B

128K

~0.5 GB

Qwen 2.5 7B

128K

~0.25 GB

Mistral 7B

32K

~0.25 GB

Mixtral 8x7B

32K

~0.4 GB

Context by GPU (Llama 3 8B Q4)

GPU

Comfortable Context

Maximum Context

RTX 3060 12GB

16K

32K

RTX 3090 24GB

64K

96K

RTX 4090 24GB

64K

96K

RTX 5090 32GB

96K

128K

A100 40GB

96K

128K

A100 80GB

128K

Multi-GPU Configurations

Tensor Parallelism

Split one model across multiple GPUs:

Configuration

Total VRAM

Max Model (FP16)

2x RTX 3090

48GB

30B

2x RTX 4090

48GB

30B

2x RTX 5090

64GB

32B

4x RTX 5090

128GB

70B

2x A100 40GB

80GB

70B

4x A100 40GB

160GB

100B+

8x A100 80GB

640GB

DeepSeek-V3

vLLM Multi-GPU

# 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

# 4 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4

Specific Model Guides

Llama 3.1 Family

Variant

Parameters

Min GPU

Recommended Setup

Llama 3.2 1B

Any 4GB

RTX 3060

Llama 3.2 3B

Any 6GB

RTX 3060

Llama 3.1 8B

RTX 3060

RTX 3090

Llama 3.1 70B

70B

A100 40GB

2x A100 40GB

Llama 3.1 405B

405B

8x A100 80GB

8x H100

Mistral/Mixtral Family

Variant

Parameters

Min GPU

Recommended Setup

Mistral 7B

RTX 3060

RTX 3090

Mixtral 8x7B

46.7B

RTX 3090

A100 40GB

Mixtral 8x22B

141B

2x A100 80GB

4x A100 80GB

Qwen 2.5 Family

Variant

Parameters

Min GPU

Recommended Setup

Qwen 2.5 0.5B

0.5B

Any 2GB

Any 4GB

Qwen 2.5 1.5B

1.5B

Any 4GB

RTX 3060

Qwen 2.5 3B

Any 6GB

RTX 3060

Qwen 2.5 7B

RTX 3060

RTX 3090

Qwen 2.5 14B

14B

RTX 3090

RTX 4090

Qwen 2.5 32B

32B

RTX 4090

A100 40GB

Qwen 2.5 72B

72B

A100 40GB

A100 80GB

DeepSeek Models

Variant

Parameters

Min GPU

Recommended Setup

DeepSeek-Coder 6.7B

6.7B

RTX 3060

RTX 3090

DeepSeek-Coder 33B

33B

RTX 4090

A100 40GB

DeepSeek-V2-Lite

15.7B

RTX 3090

A100 40GB

DeepSeek-V3

671B

8x A100 80GB

8x H100

DeepSeek-R1

671B

8x A100 80GB

8x H100 (FP8)

DeepSeek-R1-Distill-Qwen-32B

32B

RTX 5090 32GB

2x A100 40GB

DeepSeek-R1-Distill-Qwen-7B

RTX 3090 24GB

RTX 4090

Troubleshooting

"CUDA out of memory"

Reduce quantization: Q8 → Q4
Lower context length: Reduce max_tokens
Enable CPU offload: --cpu-offload or enable_model_cpu_offload()
Use smaller batch: batch_size=1
Try different GPU: Need more VRAM

"Model too large"

Use quantized version: GGUF Q4 models
Use multiple GPUs: Tensor parallelism
Offload to CPU: Slower but works
Choose smaller model: 7B instead of 13B

"Slow generation"

Upgrade GPU: More VRAM = less offloading
Use faster quantization: Q4 is faster than Q8
Reduce context: Shorter = faster
Enable flash attention: --flash-attn

Next Steps

GPU Comparison Guide - Detailed GPU specs
Docker Images Catalog - Ready-to-deploy images
Quickstart Guide - Get started in 5 minutes

PreviousGPU Comparison NextCost Calculator

Last updated 7 days ago

Was this helpful?

hashtagQuick Reference

hashtagLanguage Models (LLM)

hashtagImage Generation Models

hashtagVideo Generation Models

hashtagAudio Models

hashtagVision & Vision-Language Models

hashtagFine-tuning & Training Tools

hashtagDetailed Compatibility Tables

hashtagLLM by GPU

hashtagImage Generation by GPU

hashtagVideo Generation by GPU

hashtagQuantization Guide

hashtagWhat is Quantization?

hashtagVRAM Calculator

hashtagRecommended Quantization by Use Case

hashtagContext Length vs VRAM

hashtagHow Context Affects VRAM

hashtagContext by GPU (Llama 3 8B Q4)

hashtagMulti-GPU Configurations

hashtagTensor Parallelism

hashtagvLLM Multi-GPU

hashtagSpecific Model Guides

hashtagLlama 3.1 Family

hashtagMistral/Mixtral Family

hashtagQwen 2.5 Family

hashtagDeepSeek Models

hashtagTroubleshooting

hashtag"CUDA out of memory"

hashtag"Model too large"

hashtag"Slow generation"

hashtagNext Steps

Quick Reference

Language Models (LLM)

Image Generation Models

Video Generation Models

Audio Models

Vision & Vision-Language Models

Fine-tuning & Training Tools

Detailed Compatibility Tables

LLM by GPU

Image Generation by GPU

Video Generation by GPU

Quantization Guide

What is Quantization?

VRAM Calculator

Recommended Quantization by Use Case

Context Length vs VRAM

How Context Affects VRAM

Context by GPU (Llama 3 8B Q4)

Multi-GPU Configurations

Tensor Parallelism

vLLM Multi-GPU

Specific Model Guides

Llama 3.1 Family

Mistral/Mixtral Family

Qwen 2.5 Family

DeepSeek Models

Troubleshooting

"CUDA out of memory"

"Model too large"

"Slow generation"

Next Steps