# Model Compatibility

Complete guide to which AI models run on which GPUs on CLORE.AI.

{% hint style="success" %}
Find GPUs with the right VRAM at [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Quick Reference

### Language Models (LLM)

| Model                   | Parameters | Min VRAM   | Recommended            | Quantization             |
| ----------------------- | ---------- | ---------- | ---------------------- | ------------------------ |
| Llama 3.2               | 1B         | 2GB        | 4GB                    | Q4, Q8, FP16             |
| Llama 3.2               | 3B         | 4GB        | 6GB                    | Q4, Q8, FP16             |
| Llama 3.1/3             | 8B         | 6GB        | 12GB                   | Q4, Q8, FP16             |
| Mistral                 | 7B         | 6GB        | 12GB                   | Q4, Q8, FP16             |
| Qwen 2.5                | 7B         | 6GB        | 12GB                   | Q4, Q8, FP16             |
| Qwen 2.5                | 14B        | 12GB       | 16GB                   | Q4, Q8                   |
| Qwen 2.5                | 32B        | 20GB       | 24GB                   | Q4, Q8                   |
| Llama 3.1               | 70B        | 40GB       | 48GB                   | Q4, Q8                   |
| Qwen 2.5                | 72B        | 48GB       | 80GB                   | Q4, Q8                   |
| Mixtral                 | 8x7B       | 24GB       | 48GB                   | Q4                       |
| DeepSeek-V3             | 671B       | 320GB+     | 640GB                  | FP8                      |
| **DeepSeek-R1**         | **671B**   | **320GB+** | **8x H100**            | **FP8, reasoning model** |
| **DeepSeek-R1-Distill** | **32B**    | **20GB**   | **2x A100 / RTX 5090** | **Q4/Q8**                |

### Image Generation Models

| Model                | Min VRAM | Recommended         | Notes                           |
| -------------------- | -------- | ------------------- | ------------------------------- |
| SD 1.5               | 4GB      | 8GB                 | 512x512 native                  |
| SD 2.1               | 6GB      | 8GB                 | 768x768 native                  |
| SDXL                 | 8GB      | 12GB                | 1024x1024 native                |
| SDXL Turbo           | 8GB      | 12GB                | 1-4 steps                       |
| **SD3.5 Large (8B)** | **16GB** | **24GB**            | **1024x1024, advanced quality** |
| FLUX.1 schnell       | 12GB     | 16GB                | 4 steps, fast                   |
| FLUX.1 dev           | 16GB     | 24GB                | 20-50 steps                     |
| **TRELLIS**          | **16GB** | **24GB (RTX 4090)** | **3D generation from images**   |

### Video Generation Models

| Model                  | Min VRAM | Recommended              | Output                        |
| ---------------------- | -------- | ------------------------ | ----------------------------- |
| Stable Video Diffusion | 16GB     | 24GB                     | 4 sec, 576x1024               |
| AnimateDiff            | 12GB     | 16GB                     | 2-4 sec                       |
| **LTX-Video**          | **16GB** | **24GB (RTX 4090/3090)** | **5 sec, 768x512, very fast** |
| Wan2.1                 | 24GB     | 40GB                     | 5 sec, 480p-720p              |
| Hunyuan Video          | 40GB     | 80GB                     | 5 sec, 720p                   |
| OpenSora               | 24GB     | 40GB                     | Variable                      |

### Audio Models

| Model            | Min VRAM | Recommended | Task             |
| ---------------- | -------- | ----------- | ---------------- |
| Whisper tiny     | 1GB      | 2GB         | Transcription    |
| Whisper base     | 1GB      | 2GB         | Transcription    |
| Whisper small    | 2GB      | 4GB         | Transcription    |
| Whisper medium   | 4GB      | 6GB         | Transcription    |
| Whisper large-v3 | 6GB      | 10GB        | Transcription    |
| Bark             | 8GB      | 12GB        | Text-to-Speech   |
| Stable Audio     | 8GB      | 12GB        | Music Generation |

### Vision & Vision-Language Models

| Model                | Min VRAM | Recommended         | Task                      |
| -------------------- | -------- | ------------------- | ------------------------- |
| Llama 3.2 Vision 11B | 12GB     | 16GB                | Image Understanding       |
| Llama 3.2 Vision 90B | 48GB     | 80GB                | Image Understanding       |
| LLaVA 7B             | 8GB      | 12GB                | Visual QA                 |
| LLaVA 13B            | 16GB     | 24GB                | Visual QA                 |
| **Qwen2.5-VL 7B**    | **16GB** | **24GB (RTX 4090)** | **Image/Video/Doc OCR**   |
| **Qwen2.5-VL 72B**   | **48GB** | **2x A100 80GB**    | **Maximum VL capability** |

### Fine-tuning & Training Tools

| Tool / Method        | Min VRAM | Recommended GPU   | Task                            |
| -------------------- | -------- | ----------------- | ------------------------------- |
| **Unsloth QLoRA 7B** | **12GB** | **RTX 3090 24GB** | **2x faster QLoRA, low VRAM**   |
| Unsloth QLoRA 13B    | 16GB     | RTX 4090 24GB     | Fast fine-tuning                |
| LoRA (standard)      | 12GB     | RTX 3090          | Parameter-efficient fine-tuning |
| Full fine-tune 7B    | 40GB     | A100 40GB         | Maximum quality training        |

***

## Detailed Compatibility Tables

### LLM by GPU

| GPU              | Max Model (Q4) | Max Model (Q8) | Max Model (FP16) |
| ---------------- | -------------- | -------------- | ---------------- |
| RTX 3060 12GB    | 13B            | 7B             | 3B               |
| RTX 3070 8GB     | 7B             | 3B             | 1B               |
| RTX 3080 10GB    | 7B             | 7B             | 3B               |
| RTX 3090 24GB    | 30B            | 13B            | 7B               |
| RTX 4070 Ti 12GB | 13B            | 7B             | 3B               |
| RTX 4080 16GB    | 14B            | 7B             | 7B               |
| RTX 4090 24GB    | 30B            | 13B            | 7B               |
| RTX 5090 32GB    | 70B            | 14B            | 13B              |
| A100 40GB        | 70B            | 30B            | 14B              |
| A100 80GB        | 70B            | 70B            | 30B              |
| H100 80GB        | 70B            | 70B            | 30B              |

### Image Generation by GPU

| GPU              | SD 1.5 | SDXL   | FLUX schnell | FLUX dev |
| ---------------- | ------ | ------ | ------------ | -------- |
| RTX 3060 12GB    | ✅ 512  | ✅ 768  | ⚠️ 512\*     | ❌        |
| RTX 3070 8GB     | ✅ 512  | ⚠️ 512 | ❌            | ❌        |
| RTX 3080 10GB    | ✅ 512  | ✅ 768  | ⚠️ 512\*     | ❌        |
| RTX 3090 24GB    | ✅ 768  | ✅ 1024 | ✅ 1024       | ⚠️ 768\* |
| RTX 4070 Ti 12GB | ✅ 512  | ✅ 768  | ⚠️ 512\*     | ❌        |
| RTX 4080 16GB    | ✅ 768  | ✅ 1024 | ✅ 768        | ⚠️ 512\* |
| RTX 4090 24GB    | ✅ 1024 | ✅ 1024 | ✅ 1024       | ✅ 1024   |
| RTX 5090 32GB    | ✅ 1024 | ✅ 1024 | ✅ 1536       | ✅ 1536   |
| A100 40GB        | ✅ 1024 | ✅ 1024 | ✅ 1024       | ✅ 1024   |
| A100 80GB        | ✅ 2048 | ✅ 2048 | ✅ 1536       | ✅ 1536   |

\*With CPU offload or reduced batch size

### Video Generation by GPU

| GPU           | SVD    | AnimateDiff | Wan2.1  | Hunyuan  |
| ------------- | ------ | ----------- | ------- | -------- |
| RTX 3060 12GB | ❌      | ⚠️ short    | ❌       | ❌        |
| RTX 3090 24GB | ✅ 2-4s | ✅           | ⚠️ 480p | ❌        |
| RTX 4090 24GB | ✅ 4s   | ✅           | ✅ 480p  | ⚠️ short |
| RTX 5090 32GB | ✅ 6s   | ✅           | ✅ 720p  | ✅ 5s     |
| A100 40GB     | ✅ 4s   | ✅           | ✅ 720p  | ✅ 5s     |
| A100 80GB     | ✅ 8s   | ✅           | ✅ 720p  | ✅ 10s    |

***

## Quantization Guide

### What is Quantization?

Quantization reduces model precision to fit in less VRAM:

| Format   | Bits | VRAM Reduction | Quality Loss |
| -------- | ---- | -------------- | ------------ |
| FP32     | 32   | Baseline       | None         |
| FP16     | 16   | 50%            | Minimal      |
| BF16     | 16   | 50%            | Minimal      |
| FP8      | 8    | 75%            | Small        |
| Q8       | 8    | 75%            | Small        |
| Q6\_K    | 6    | 81%            | Small        |
| Q5\_K\_M | 5    | 84%            | Moderate     |
| Q4\_K\_M | 4    | 87%            | Moderate     |
| Q3\_K\_M | 3    | 91%            | Noticeable   |
| Q2\_K    | 2    | 94%            | Significant  |

### VRAM Calculator

**Formula:** `VRAM (GB) ≈ Parameters (B) × Bytes per Parameter`

| Model Size | FP16   | Q8    | Q4     |
| ---------- | ------ | ----- | ------ |
| 1B         | 2 GB   | 1 GB  | 0.5 GB |
| 3B         | 6 GB   | 3 GB  | 1.5 GB |
| 7B         | 14 GB  | 7 GB  | 3.5 GB |
| 8B         | 16 GB  | 8 GB  | 4 GB   |
| 13B        | 26 GB  | 13 GB | 6.5 GB |
| 14B        | 28 GB  | 14 GB | 7 GB   |
| 30B        | 60 GB  | 30 GB | 15 GB  |
| 32B        | 64 GB  | 32 GB | 16 GB  |
| 70B        | 140 GB | 70 GB | 35 GB  |
| 72B        | 144 GB | 72 GB | 36 GB  |

\*Add \~20% for KV cache and overhead

### Recommended Quantization by Use Case

| Use Case         | Recommended | Why                               |
| ---------------- | ----------- | --------------------------------- |
| Chat/General     | Q4\_K\_M    | Good balance of speed and quality |
| Coding           | Q5\_K\_M+   | Better accuracy for code          |
| Creative Writing | Q4\_K\_M    | Speed matters more                |
| Analysis         | Q6\_K+      | Higher precision needed           |
| Production       | FP16/BF16   | Maximum quality                   |

***

## Context Length vs VRAM

### How Context Affects VRAM

Each model has a context window (max tokens). Longer context = more VRAM:

| Model        | Default Context | Max Context | VRAM per 1K tokens |
| ------------ | --------------- | ----------- | ------------------ |
| Llama 3 8B   | 8K              | 128K        | \~0.3 GB           |
| Llama 3 70B  | 8K              | 128K        | \~0.5 GB           |
| Qwen 2.5 7B  | 8K              | 128K        | \~0.25 GB          |
| Mistral 7B   | 8K              | 32K         | \~0.25 GB          |
| Mixtral 8x7B | 32K             | 32K         | \~0.4 GB           |

### Context by GPU (Llama 3 8B Q4)

| GPU           | Comfortable Context | Maximum Context |
| ------------- | ------------------- | --------------- |
| RTX 3060 12GB | 16K                 | 32K             |
| RTX 3090 24GB | 64K                 | 96K             |
| RTX 4090 24GB | 64K                 | 96K             |
| RTX 5090 32GB | 96K                 | 128K            |
| A100 40GB     | 96K                 | 128K            |
| A100 80GB     | 128K                | 128K            |

***

## Multi-GPU Configurations

### Tensor Parallelism

Split one model across multiple GPUs:

| Configuration | Total VRAM | Max Model (FP16) |
| ------------- | ---------- | ---------------- |
| 2x RTX 3090   | 48GB       | 30B              |
| 2x RTX 4090   | 48GB       | 30B              |
| 2x RTX 5090   | 64GB       | 32B              |
| 4x RTX 5090   | 128GB      | 70B              |
| 2x A100 40GB  | 80GB       | 70B              |
| 4x A100 40GB  | 160GB      | 100B+            |
| 8x A100 80GB  | 640GB      | DeepSeek-V3      |

### vLLM Multi-GPU

```bash
# 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

# 4 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4
```

***

## Specific Model Guides

### Llama 3.1 Family

| Variant        | Parameters | Min GPU      | Recommended Setup |
| -------------- | ---------- | ------------ | ----------------- |
| Llama 3.2 1B   | 1B         | Any 4GB      | RTX 3060          |
| Llama 3.2 3B   | 3B         | Any 6GB      | RTX 3060          |
| Llama 3.1 8B   | 8B         | RTX 3060     | RTX 3090          |
| Llama 3.1 70B  | 70B        | A100 40GB    | 2x A100 40GB      |
| Llama 3.1 405B | 405B       | 8x A100 80GB | 8x H100           |

### Mistral/Mixtral Family

| Variant       | Parameters | Min GPU      | Recommended Setup |
| ------------- | ---------- | ------------ | ----------------- |
| Mistral 7B    | 7B         | RTX 3060     | RTX 3090          |
| Mixtral 8x7B  | 46.7B      | RTX 3090     | A100 40GB         |
| Mixtral 8x22B | 141B       | 2x A100 80GB | 4x A100 80GB      |

### Qwen 2.5 Family

| Variant       | Parameters | Min GPU   | Recommended Setup |
| ------------- | ---------- | --------- | ----------------- |
| Qwen 2.5 0.5B | 0.5B       | Any 2GB   | Any 4GB           |
| Qwen 2.5 1.5B | 1.5B       | Any 4GB   | RTX 3060          |
| Qwen 2.5 3B   | 3B         | Any 6GB   | RTX 3060          |
| Qwen 2.5 7B   | 7B         | RTX 3060  | RTX 3090          |
| Qwen 2.5 14B  | 14B        | RTX 3090  | RTX 4090          |
| Qwen 2.5 32B  | 32B        | RTX 4090  | A100 40GB         |
| Qwen 2.5 72B  | 72B        | A100 40GB | A100 80GB         |

### DeepSeek Models

| Variant                          | Parameters | Min GPU           | Recommended Setup |
| -------------------------------- | ---------- | ----------------- | ----------------- |
| DeepSeek-Coder 6.7B              | 6.7B       | RTX 3060          | RTX 3090          |
| DeepSeek-Coder 33B               | 33B        | RTX 4090          | A100 40GB         |
| DeepSeek-V2-Lite                 | 15.7B      | RTX 3090          | A100 40GB         |
| DeepSeek-V3                      | 671B       | 8x A100 80GB      | 8x H100           |
| **DeepSeek-R1**                  | **671B**   | **8x A100 80GB**  | **8x H100 (FP8)** |
| **DeepSeek-R1-Distill-Qwen-32B** | **32B**    | **RTX 5090 32GB** | **2x A100 40GB**  |
| **DeepSeek-R1-Distill-Qwen-7B**  | **7B**     | **RTX 3090 24GB** | **RTX 4090**      |

***

## Troubleshooting

### "CUDA out of memory"

1. **Reduce quantization:** Q8 → Q4
2. **Lower context length:** Reduce max\_tokens
3. **Enable CPU offload:** `--cpu-offload` or `enable_model_cpu_offload()`
4. **Use smaller batch:** batch\_size=1
5. **Try different GPU:** Need more VRAM

### "Model too large"

1. **Use quantized version:** GGUF Q4 models
2. **Use multiple GPUs:** Tensor parallelism
3. **Offload to CPU:** Slower but works
4. **Choose smaller model:** 7B instead of 13B

### "Slow generation"

1. **Upgrade GPU:** More VRAM = less offloading
2. **Use faster quantization:** Q4 is faster than Q8
3. **Reduce context:** Shorter = faster
4. **Enable flash attention:** `--flash-attn`

## Next Steps

* [GPU Comparison Guide](/guides/getting-started/gpu-comparison.md) - Detailed GPU specs
* [Docker Images Catalog](/guides/getting-started/docker-images.md) - Ready-to-deploy images
* [Quickstart Guide](/guides/quickstart.md) - Get started in 5 minutes


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/getting-started/model-compatibility.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
