# Model Compatibility

Complete guide to which AI models run on which GPUs on CLORE.AI.

{% hint style="success" %}
Find GPUs with the right VRAM at [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Quick Reference

### Language Models (LLM)

| Model                   | Parameters | Min VRAM   | Recommended            | Quantization             |
| ----------------------- | ---------- | ---------- | ---------------------- | ------------------------ |
| Llama 3.2               | 1B         | 2GB        | 4GB                    | Q4, Q8, FP16             |
| Llama 3.2               | 3B         | 4GB        | 6GB                    | Q4, Q8, FP16             |
| Llama 3.1/3             | 8B         | 6GB        | 12GB                   | Q4, Q8, FP16             |
| Mistral                 | 7B         | 6GB        | 12GB                   | Q4, Q8, FP16             |
| Qwen 2.5                | 7B         | 6GB        | 12GB                   | Q4, Q8, FP16             |
| Qwen 2.5                | 14B        | 12GB       | 16GB                   | Q4, Q8                   |
| Qwen 2.5                | 32B        | 20GB       | 24GB                   | Q4, Q8                   |
| Llama 3.1               | 70B        | 40GB       | 48GB                   | Q4, Q8                   |
| Qwen 2.5                | 72B        | 48GB       | 80GB                   | Q4, Q8                   |
| Mixtral                 | 8x7B       | 24GB       | 48GB                   | Q4                       |
| DeepSeek-V3             | 671B       | 320GB+     | 640GB                  | FP8                      |
| **DeepSeek-R1**         | **671B**   | **320GB+** | **8x H100**            | **FP8, reasoning model** |
| **DeepSeek-R1-Distill** | **32B**    | **20GB**   | **2x A100 / RTX 5090** | **Q4/Q8**                |

### Image Generation Models

| Model                | Min VRAM | Recommended         | Notes                           |
| -------------------- | -------- | ------------------- | ------------------------------- |
| SD 1.5               | 4GB      | 8GB                 | 512x512 native                  |
| SD 2.1               | 6GB      | 8GB                 | 768x768 native                  |
| SDXL                 | 8GB      | 12GB                | 1024x1024 native                |
| SDXL Turbo           | 8GB      | 12GB                | 1-4 steps                       |
| **SD3.5 Large (8B)** | **16GB** | **24GB**            | **1024x1024, advanced quality** |
| FLUX.1 schnell       | 12GB     | 16GB                | 4 steps, fast                   |
| FLUX.1 dev           | 16GB     | 24GB                | 20-50 steps                     |
| **TRELLIS**          | **16GB** | **24GB (RTX 4090)** | **3D generation from images**   |

### Video Generation Models

| Model                  | Min VRAM | Recommended              | Output                        |
| ---------------------- | -------- | ------------------------ | ----------------------------- |
| Stable Video Diffusion | 16GB     | 24GB                     | 4 sec, 576x1024               |
| AnimateDiff            | 12GB     | 16GB                     | 2-4 sec                       |
| **LTX-Video**          | **16GB** | **24GB (RTX 4090/3090)** | **5 sec, 768x512, very fast** |
| Wan2.1                 | 24GB     | 40GB                     | 5 sec, 480p-720p              |
| Hunyuan Video          | 40GB     | 80GB                     | 5 sec, 720p                   |
| OpenSora               | 24GB     | 40GB                     | Variable                      |

### Audio Models

| Model            | Min VRAM | Recommended | Task             |
| ---------------- | -------- | ----------- | ---------------- |
| Whisper tiny     | 1GB      | 2GB         | Transcription    |
| Whisper base     | 1GB      | 2GB         | Transcription    |
| Whisper small    | 2GB      | 4GB         | Transcription    |
| Whisper medium   | 4GB      | 6GB         | Transcription    |
| Whisper large-v3 | 6GB      | 10GB        | Transcription    |
| Bark             | 8GB      | 12GB        | Text-to-Speech   |
| Stable Audio     | 8GB      | 12GB        | Music Generation |

### Vision & Vision-Language Models

| Model                | Min VRAM | Recommended         | Task                      |
| -------------------- | -------- | ------------------- | ------------------------- |
| Llama 3.2 Vision 11B | 12GB     | 16GB                | Image Understanding       |
| Llama 3.2 Vision 90B | 48GB     | 80GB                | Image Understanding       |
| LLaVA 7B             | 8GB      | 12GB                | Visual QA                 |
| LLaVA 13B            | 16GB     | 24GB                | Visual QA                 |
| **Qwen2.5-VL 7B**    | **16GB** | **24GB (RTX 4090)** | **Image/Video/Doc OCR**   |
| **Qwen2.5-VL 72B**   | **48GB** | **2x A100 80GB**    | **Maximum VL capability** |

### Fine-tuning & Training Tools

| Tool / Method        | Min VRAM | Recommended GPU   | Task                            |
| -------------------- | -------- | ----------------- | ------------------------------- |
| **Unsloth QLoRA 7B** | **12GB** | **RTX 3090 24GB** | **2x faster QLoRA, low VRAM**   |
| Unsloth QLoRA 13B    | 16GB     | RTX 4090 24GB     | Fast fine-tuning                |
| LoRA (standard)      | 12GB     | RTX 3090          | Parameter-efficient fine-tuning |
| Full fine-tune 7B    | 40GB     | A100 40GB         | Maximum quality training        |

***

## Detailed Compatibility Tables

### LLM by GPU

| GPU              | Max Model (Q4) | Max Model (Q8) | Max Model (FP16) |
| ---------------- | -------------- | -------------- | ---------------- |
| RTX 3060 12GB    | 13B            | 7B             | 3B               |
| RTX 3070 8GB     | 7B             | 3B             | 1B               |
| RTX 3080 10GB    | 7B             | 7B             | 3B               |
| RTX 3090 24GB    | 30B            | 13B            | 7B               |
| RTX 4070 Ti 12GB | 13B            | 7B             | 3B               |
| RTX 4080 16GB    | 14B            | 7B             | 7B               |
| RTX 4090 24GB    | 30B            | 13B            | 7B               |
| RTX 5090 32GB    | 70B            | 14B            | 13B              |
| A100 40GB        | 70B            | 30B            | 14B              |
| A100 80GB        | 70B            | 70B            | 30B              |
| H100 80GB        | 70B            | 70B            | 30B              |

### Image Generation by GPU

| GPU              | SD 1.5 | SDXL   | FLUX schnell | FLUX dev |
| ---------------- | ------ | ------ | ------------ | -------- |
| RTX 3060 12GB    | ✅ 512  | ✅ 768  | ⚠️ 512\*     | ❌        |
| RTX 3070 8GB     | ✅ 512  | ⚠️ 512 | ❌            | ❌        |
| RTX 3080 10GB    | ✅ 512  | ✅ 768  | ⚠️ 512\*     | ❌        |
| RTX 3090 24GB    | ✅ 768  | ✅ 1024 | ✅ 1024       | ⚠️ 768\* |
| RTX 4070 Ti 12GB | ✅ 512  | ✅ 768  | ⚠️ 512\*     | ❌        |
| RTX 4080 16GB    | ✅ 768  | ✅ 1024 | ✅ 768        | ⚠️ 512\* |
| RTX 4090 24GB    | ✅ 1024 | ✅ 1024 | ✅ 1024       | ✅ 1024   |
| RTX 5090 32GB    | ✅ 1024 | ✅ 1024 | ✅ 1536       | ✅ 1536   |
| A100 40GB        | ✅ 1024 | ✅ 1024 | ✅ 1024       | ✅ 1024   |
| A100 80GB        | ✅ 2048 | ✅ 2048 | ✅ 1536       | ✅ 1536   |

\*With CPU offload or reduced batch size

### Video Generation by GPU

| GPU           | SVD    | AnimateDiff | Wan2.1  | Hunyuan  |
| ------------- | ------ | ----------- | ------- | -------- |
| RTX 3060 12GB | ❌      | ⚠️ short    | ❌       | ❌        |
| RTX 3090 24GB | ✅ 2-4s | ✅           | ⚠️ 480p | ❌        |
| RTX 4090 24GB | ✅ 4s   | ✅           | ✅ 480p  | ⚠️ short |
| RTX 5090 32GB | ✅ 6s   | ✅           | ✅ 720p  | ✅ 5s     |
| A100 40GB     | ✅ 4s   | ✅           | ✅ 720p  | ✅ 5s     |
| A100 80GB     | ✅ 8s   | ✅           | ✅ 720p  | ✅ 10s    |

***

## Quantization Guide

### What is Quantization?

Quantization reduces model precision to fit in less VRAM:

| Format   | Bits | VRAM Reduction | Quality Loss |
| -------- | ---- | -------------- | ------------ |
| FP32     | 32   | Baseline       | None         |
| FP16     | 16   | 50%            | Minimal      |
| BF16     | 16   | 50%            | Minimal      |
| FP8      | 8    | 75%            | Small        |
| Q8       | 8    | 75%            | Small        |
| Q6\_K    | 6    | 81%            | Small        |
| Q5\_K\_M | 5    | 84%            | Moderate     |
| Q4\_K\_M | 4    | 87%            | Moderate     |
| Q3\_K\_M | 3    | 91%            | Noticeable   |
| Q2\_K    | 2    | 94%            | Significant  |

### VRAM Calculator

**Formula:** `VRAM (GB) ≈ Parameters (B) × Bytes per Parameter`

| Model Size | FP16   | Q8    | Q4     |
| ---------- | ------ | ----- | ------ |
| 1B         | 2 GB   | 1 GB  | 0.5 GB |
| 3B         | 6 GB   | 3 GB  | 1.5 GB |
| 7B         | 14 GB  | 7 GB  | 3.5 GB |
| 8B         | 16 GB  | 8 GB  | 4 GB   |
| 13B        | 26 GB  | 13 GB | 6.5 GB |
| 14B        | 28 GB  | 14 GB | 7 GB   |
| 30B        | 60 GB  | 30 GB | 15 GB  |
| 32B        | 64 GB  | 32 GB | 16 GB  |
| 70B        | 140 GB | 70 GB | 35 GB  |
| 72B        | 144 GB | 72 GB | 36 GB  |

\*Add \~20% for KV cache and overhead

### Recommended Quantization by Use Case

| Use Case         | Recommended | Why                               |
| ---------------- | ----------- | --------------------------------- |
| Chat/General     | Q4\_K\_M    | Good balance of speed and quality |
| Coding           | Q5\_K\_M+   | Better accuracy for code          |
| Creative Writing | Q4\_K\_M    | Speed matters more                |
| Analysis         | Q6\_K+      | Higher precision needed           |
| Production       | FP16/BF16   | Maximum quality                   |

***

## Context Length vs VRAM

### How Context Affects VRAM

Each model has a context window (max tokens). Longer context = more VRAM:

| Model        | Default Context | Max Context | VRAM per 1K tokens |
| ------------ | --------------- | ----------- | ------------------ |
| Llama 3 8B   | 8K              | 128K        | \~0.3 GB           |
| Llama 3 70B  | 8K              | 128K        | \~0.5 GB           |
| Qwen 2.5 7B  | 8K              | 128K        | \~0.25 GB          |
| Mistral 7B   | 8K              | 32K         | \~0.25 GB          |
| Mixtral 8x7B | 32K             | 32K         | \~0.4 GB           |

### Context by GPU (Llama 3 8B Q4)

| GPU           | Comfortable Context | Maximum Context |
| ------------- | ------------------- | --------------- |
| RTX 3060 12GB | 16K                 | 32K             |
| RTX 3090 24GB | 64K                 | 96K             |
| RTX 4090 24GB | 64K                 | 96K             |
| RTX 5090 32GB | 96K                 | 128K            |
| A100 40GB     | 96K                 | 128K            |
| A100 80GB     | 128K                | 128K            |

***

## Multi-GPU Configurations

### Tensor Parallelism

Split one model across multiple GPUs:

| Configuration | Total VRAM | Max Model (FP16) |
| ------------- | ---------- | ---------------- |
| 2x RTX 3090   | 48GB       | 30B              |
| 2x RTX 4090   | 48GB       | 30B              |
| 2x RTX 5090   | 64GB       | 32B              |
| 4x RTX 5090   | 128GB      | 70B              |
| 2x A100 40GB  | 80GB       | 70B              |
| 4x A100 40GB  | 160GB      | 100B+            |
| 8x A100 80GB  | 640GB      | DeepSeek-V3      |

### vLLM Multi-GPU

```bash
# 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

# 4 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4
```

***

## Specific Model Guides

### Llama 3.1 Family

| Variant        | Parameters | Min GPU      | Recommended Setup |
| -------------- | ---------- | ------------ | ----------------- |
| Llama 3.2 1B   | 1B         | Any 4GB      | RTX 3060          |
| Llama 3.2 3B   | 3B         | Any 6GB      | RTX 3060          |
| Llama 3.1 8B   | 8B         | RTX 3060     | RTX 3090          |
| Llama 3.1 70B  | 70B        | A100 40GB    | 2x A100 40GB      |
| Llama 3.1 405B | 405B       | 8x A100 80GB | 8x H100           |

### Mistral/Mixtral Family

| Variant       | Parameters | Min GPU      | Recommended Setup |
| ------------- | ---------- | ------------ | ----------------- |
| Mistral 7B    | 7B         | RTX 3060     | RTX 3090          |
| Mixtral 8x7B  | 46.7B      | RTX 3090     | A100 40GB         |
| Mixtral 8x22B | 141B       | 2x A100 80GB | 4x A100 80GB      |

### Qwen 2.5 Family

| Variant       | Parameters | Min GPU   | Recommended Setup |
| ------------- | ---------- | --------- | ----------------- |
| Qwen 2.5 0.5B | 0.5B       | Any 2GB   | Any 4GB           |
| Qwen 2.5 1.5B | 1.5B       | Any 4GB   | RTX 3060          |
| Qwen 2.5 3B   | 3B         | Any 6GB   | RTX 3060          |
| Qwen 2.5 7B   | 7B         | RTX 3060  | RTX 3090          |
| Qwen 2.5 14B  | 14B        | RTX 3090  | RTX 4090          |
| Qwen 2.5 32B  | 32B        | RTX 4090  | A100 40GB         |
| Qwen 2.5 72B  | 72B        | A100 40GB | A100 80GB         |

### DeepSeek Models

| Variant                          | Parameters | Min GPU           | Recommended Setup |
| -------------------------------- | ---------- | ----------------- | ----------------- |
| DeepSeek-Coder 6.7B              | 6.7B       | RTX 3060          | RTX 3090          |
| DeepSeek-Coder 33B               | 33B        | RTX 4090          | A100 40GB         |
| DeepSeek-V2-Lite                 | 15.7B      | RTX 3090          | A100 40GB         |
| DeepSeek-V3                      | 671B       | 8x A100 80GB      | 8x H100           |
| **DeepSeek-R1**                  | **671B**   | **8x A100 80GB**  | **8x H100 (FP8)** |
| **DeepSeek-R1-Distill-Qwen-32B** | **32B**    | **RTX 5090 32GB** | **2x A100 40GB**  |
| **DeepSeek-R1-Distill-Qwen-7B**  | **7B**     | **RTX 3090 24GB** | **RTX 4090**      |

***

## Troubleshooting

### "CUDA out of memory"

1. **Reduce quantization:** Q8 → Q4
2. **Lower context length:** Reduce max\_tokens
3. **Enable CPU offload:** `--cpu-offload` or `enable_model_cpu_offload()`
4. **Use smaller batch:** batch\_size=1
5. **Try different GPU:** Need more VRAM

### "Model too large"

1. **Use quantized version:** GGUF Q4 models
2. **Use multiple GPUs:** Tensor parallelism
3. **Offload to CPU:** Slower but works
4. **Choose smaller model:** 7B instead of 13B

### "Slow generation"

1. **Upgrade GPU:** More VRAM = less offloading
2. **Use faster quantization:** Q4 is faster than Q8
3. **Reduce context:** Shorter = faster
4. **Enable flash attention:** `--flash-attn`

## Next Steps

* [GPU Comparison Guide](https://docs.clore.ai/guides/getting-started/gpu-comparison) - Detailed GPU specs
* [Docker Images Catalog](https://docs.clore.ai/guides/getting-started/docker-images) - Ready-to-deploy images
* [Quickstart Guide](https://docs.clore.ai/guides/quickstart) - Get started in 5 minutes
