> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-fr/prise-en-main/model-compatibility.md).

# Compatibilité des modèles

Guide complet indiquant quels modèles d'IA fonctionnent sur quels GPU sur CLORE.AI.

{% hint style="success" %}
Trouvez des GPU avec la VRAM appropriée sur [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Référence rapide

### Modèles de langage (LLM)

| Modèle                  | Paramètres | VRAM min   | Recommandé             | Quantification                  |
| ----------------------- | ---------- | ---------- | ---------------------- | ------------------------------- |
| Llama 3.2               | 1B         | 2Go        | 4 Go                   | Q4, Q8, FP16                    |
| Llama 3.2               | 3B         | 4 Go       | 6 Go                   | Q4, Q8, FP16                    |
| Llama 3.1/3             | 8B         | 6 Go       | 12Go                   | Q4, Q8, FP16                    |
| Mistral                 | 7B         | 6 Go       | 12Go                   | Q4, Q8, FP16                    |
| Qwen 2.5                | 7B         | 6 Go       | 12Go                   | Q4, Q8, FP16                    |
| Qwen 2.5                | 14B        | 12Go       | 16Go                   | Q4, Q8                          |
| Qwen 2.5                | 32B        | 20Go       | 24 Go                  | Q4, Q8                          |
| Llama 3.1               | 70B        | 40Go       | 48Go                   | Q4, Q8                          |
| Qwen 2.5                | 72B        | 48Go       | 80Go                   | Q4, Q8                          |
| Mixtral                 | 8x7B       | 24 Go      | 48Go                   | Q4                              |
| DeepSeek-V3             | 671B       | 320Go+     | 640Go                  | FP8                             |
| **DeepSeek-R1**         | **671B**   | **320Go+** | **8x H100**            | **FP8, modèle de raisonnement** |
| **DeepSeek-R1-Distill** | **32B**    | **20Go**   | **2x A100 / RTX 5090** | **Q4/Q8**                       |

### Modèles de génération d'images

| Modèle               | VRAM min | Recommandé          | Remarques                           |
| -------------------- | -------- | ------------------- | ----------------------------------- |
| SD 1.5               | 4 Go     | 8 Go                | 512x512 natif                       |
| SD 2.1               | 6 Go     | 8 Go                | 768x768 natif                       |
| SDXL                 | 8 Go     | 12Go                | 1024x1024 natif                     |
| SDXL Turbo           | 8 Go     | 12Go                | 1-4 étapes                          |
| **SD3.5 Large (8B)** | **16Go** | **24 Go**           | **1024x1024, qualité avancée**      |
| FLUX.1 schnell       | 12Go     | 16Go                | 4 étapes, rapide                    |
| FLUX.1 dev           | 16Go     | 24 Go               | 20-50 étapes                        |
| **TRELLIS**          | **16Go** | **24Go (RTX 4090)** | **Génération 3D à partir d'images** |

### Modèles de génération vidéo

| Modèle                 | VRAM min | Recommandé               | Sortie                        |
| ---------------------- | -------- | ------------------------ | ----------------------------- |
| Stable Video Diffusion | 16Go     | 24 Go                    | 4 s, 576x1024                 |
| AnimateDiff            | 12Go     | 16Go                     | 2-4 s                         |
| **LTX-Video**          | **16Go** | **24Go (RTX 4090/3090)** | **5 s, 768x512, très rapide** |
| Wan2.1                 | 24 Go    | 40Go                     | 5 s, 480p-720p                |
| Hunyuan Video          | 40Go     | 80Go                     | 5 s, 720p                     |
| OpenSora               | 24 Go    | 40Go                     | Variable                      |

### Modèles audio

| Modèle           | VRAM min | Recommandé | Tâche               |
| ---------------- | -------- | ---------- | ------------------- |
| Whisper tiny     | 1Go      | 2Go        | Transcription       |
| Whisper base     | 1Go      | 2Go        | Transcription       |
| Whisper small    | 2Go      | 4 Go       | Transcription       |
| Whisper medium   | 4 Go     | 6 Go       | Transcription       |
| Whisper large-v3 | 6 Go     | 10Go       | Transcription       |
| Bark             | 8 Go     | 12Go       | Texte en parole     |
| Stable Audio     | 8 Go     | 12Go       | Génération musicale |

### Vision & modèles vision-langage

| Modèle               | VRAM min | Recommandé          | Tâche                        |
| -------------------- | -------- | ------------------- | ---------------------------- |
| Llama 3.2 Vision 11B | 12Go     | 16Go                | Compréhension d'image        |
| Llama 3.2 Vision 90B | 48Go     | 80Go                | Compréhension d'image        |
| LLaVA 7B             | 8 Go     | 12Go                | QA visuel                    |
| LLaVA 13B            | 16Go     | 24 Go               | QA visuel                    |
| **Qwen2.5-VL 7B**    | **16Go** | **24Go (RTX 4090)** | **OCR image/vidéo/document** |
| **Qwen2.5-VL 72B**   | **48Go** | **2x A100 80Go**    | **Capacité VL maximale**     |

### Outils de fine-tuning & d'entraînement

| Outil / Méthode      | VRAM min | GPU recommandé    | Tâche                                 |
| -------------------- | -------- | ----------------- | ------------------------------------- |
| **Unsloth QLoRA 7B** | **12Go** | **RTX 3090 24GB** | **QLoRA 2x plus rapide, faible VRAM** |
| Unsloth QLoRA 13B    | 16Go     | RTX 4090 24GB     | Fine-tuning rapide                    |
| LoRA (standard)      | 12Go     | RTX 3090          | Fine-tuning économe en paramètres     |
| Fine-tune complet 7B | 40Go     | A100 40GB         | Entraînement qualité maximale         |

***

## Tableaux de compatibilité détaillés

### LLM par GPU

| GPU              | Modèle max (Q4) | Modèle max (Q8) | Modèle max (FP16) |
| ---------------- | --------------- | --------------- | ----------------- |
| RTX 3060 12GB    | 13B             | 7B              | 3B                |
| RTX 3070 8GB     | 7B              | 3B              | 1B                |
| RTX 3080 10Go    | 7B              | 7B              | 3B                |
| RTX 3090 24GB    | 30B             | 13B             | 7B                |
| RTX 4070 Ti 12Go | 13B             | 7B              | 3B                |
| RTX 4080 16GB    | 14B             | 7B              | 7B                |
| RTX 4090 24GB    | 30B             | 13B             | 7B                |
| RTX 5090 32GB    | 70B             | 14B             | 13B               |
| A100 40GB        | 70B             | 30B             | 14B               |
| A100 80GB        | 70B             | 70B             | 30B               |
| H100 80GB        | 70B             | 70B             | 30B               |

### Génération d'images par GPU

| GPU              | SD 1.5 | SDXL   | FLUX schnell | FLUX dev |
| ---------------- | ------ | ------ | ------------ | -------- |
| RTX 3060 12GB    | ✅ 512  | ✅ 768  | ⚠️ 512\*     | ❌        |
| RTX 3070 8GB     | ✅ 512  | ⚠️ 512 | ❌            | ❌        |
| RTX 3080 10Go    | ✅ 512  | ✅ 768  | ⚠️ 512\*     | ❌        |
| RTX 3090 24GB    | ✅ 768  | ✅ 1024 | ✅ 1024       | ⚠️ 768\* |
| RTX 4070 Ti 12Go | ✅ 512  | ✅ 768  | ⚠️ 512\*     | ❌        |
| RTX 4080 16GB    | ✅ 768  | ✅ 1024 | ✅ 768        | ⚠️ 512\* |
| RTX 4090 24GB    | ✅ 1024 | ✅ 1024 | ✅ 1024       | ✅ 1024   |
| RTX 5090 32GB    | ✅ 1024 | ✅ 1024 | ✅ 1536       | ✅ 1536   |
| A100 40GB        | ✅ 1024 | ✅ 1024 | ✅ 1024       | ✅ 1024   |
| A100 80GB        | ✅ 2048 | ✅ 2048 | ✅ 1536       | ✅ 1536   |

\*Avec déchargement CPU ou taille de lot réduite

### Génération vidéo par GPU

| GPU           | SVD    | AnimateDiff | Wan2.1  | Hunyuan  |
| ------------- | ------ | ----------- | ------- | -------- |
| RTX 3060 12GB | ❌      | ⚠️ court    | ❌       | ❌        |
| RTX 3090 24GB | ✅ 2-4s | ✅           | ⚠️ 480p | ❌        |
| RTX 4090 24GB | ✅ 4s   | ✅           | ✅ 480p  | ⚠️ court |
| RTX 5090 32GB | ✅ 6s   | ✅           | ✅ 720p  | ✅ 5s     |
| A100 40GB     | ✅ 4s   | ✅           | ✅ 720p  | ✅ 5s     |
| A100 80GB     | ✅ 8s   | ✅           | ✅ 720p  | ✅ 10s    |

***

## Guide de quantification

### Qu'est-ce que la quantification ?

La quantification réduit la précision du modèle pour tenir dans moins de VRAM :

| Format   | Bits | Réduction de la VRAM | Perte de qualité |
| -------- | ---- | -------------------- | ---------------- |
| FP32     | 32   | Référence            | Aucune           |
| FP16     | 16   | 50%                  | Minimale         |
| BF16     | 16   | 50%                  | Minimale         |
| FP8      | 8    | 75%                  | Faible           |
| Q8       | 8    | 75%                  | Faible           |
| Q6\_K    | 6    | 81%                  | Faible           |
| Q5\_K\_M | 5    | 84%                  | Modérée          |
| Q4\_K\_M | 4    | 87%                  | Modérée          |
| Q3\_K\_M | 3    | 91%                  | Remarquable      |
| Q2\_K    | 2    | 94%                  | Significative    |

### Calculateur de VRAM

**Formule :** `VRAM (Go) ≈ Paramètres (B) × Octets par paramètre`

| Taille du modèle | FP16   | Q8    | Q4     |
| ---------------- | ------ | ----- | ------ |
| 1B               | 2 Go   | 1 Go  | 0,5 Go |
| 3B               | 6 Go   | 3 Go  | 1,5 Go |
| 7B               | 14 Go  | 7 Go  | 3,5 Go |
| 8B               | 16 Go  | 8 Go  | 4 Go   |
| 13B              | 26 Go  | 13 Go | 6,5 Go |
| 14B              | 28 Go  | 14 Go | 7 Go   |
| 30B              | 60 Go  | 30 Go | 15 Go  |
| 32B              | 64 Go  | 32 Go | 16 Go  |
| 70B              | 140 Go | 70 Go | 35 Go  |
| 72B              | 144 Go | 72 Go | 36 Go  |

\*Ajouter \~20% pour le cache KV et les frais généraux

### Quantification recommandée selon le cas d'utilisation

| Cas d'utilisation | Recommandé | Pourquoi                               |
| ----------------- | ---------- | -------------------------------------- |
| Chat/Général      | Q4\_K\_M   | Bon équilibre entre vitesse et qualité |
| Programmation     | Q5\_K\_M+  | Meilleure précision pour le code       |
| Écriture créative | Q4\_K\_M   | La vitesse compte davantage            |
| Analyse           | Q6\_K+     | Précision plus élevée nécessaire       |
| Production        | FP16/BF16  | Qualité maximale                       |

***

## Longueur de contexte vs VRAM

### Comment le contexte affecte la VRAM

Chaque modèle a une fenêtre de contexte (tokens max). Contexte plus long = plus de VRAM :

| Modèle       | Contexte par défaut | Contexte max | VRAM par 1K tokens |
| ------------ | ------------------- | ------------ | ------------------ |
| Llama 3 8B   | 8K                  | 128K         | \~0,3 Go           |
| Llama 3 70B  | 8K                  | 128K         | \~0,5 Go           |
| Qwen 2.5 7B  | 8K                  | 128K         | \~0,25 Go          |
| Mistral 7B   | 8K                  | 32K          | \~0,25 Go          |
| Mixtral 8x7B | 32K                 | 32K          | \~0,4 Go           |

### Contexte par GPU (Llama 3 8B Q4)

| GPU           | Contexte confortable | Contexte maximal |
| ------------- | -------------------- | ---------------- |
| RTX 3060 12GB | 16K                  | 32K              |
| RTX 3090 24GB | 64K                  | 96K              |
| RTX 4090 24GB | 64K                  | 96K              |
| RTX 5090 32GB | 96K                  | 128K             |
| A100 40GB     | 96K                  | 128K             |
| A100 80GB     | 128K                 | 128K             |

***

## Configurations multi-GPU

### Parallélisme tensoriel

Répartir un modèle sur plusieurs GPU :

| Configuration | VRAM totale | Modèle max (FP16) |
| ------------- | ----------- | ----------------- |
| 2x RTX 3090   | 48Go        | 30B               |
| 2x RTX 4090   | 48Go        | 30B               |
| 2x RTX 5090   | 64Go        | 32B               |
| 4x RTX 5090   | 128Go       | 70B               |
| 2x A100 40Go  | 80Go        | 70B               |
| 4x A100 40Go  | 160Go       | 100B+             |
| 8x A100 80Go  | 640Go       | DeepSeek-V3       |

### vLLM Multi-GPU

```bash
# 2 GPU
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

# 4 GPU
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4
```

***

## Guides spécifiques aux modèles

### Famille Llama 3.1

| Variante       | Paramètres | GPU min            | Configuration recommandée |
| -------------- | ---------- | ------------------ | ------------------------- |
| Llama 3.2 1B   | 1B         | N'importe quel 4Go | RTX 3060                  |
| Llama 3.2 3B   | 3B         | N'importe quel 6Go | RTX 3060                  |
| Llama 3.1 8B   | 8B         | RTX 3060           | RTX 3090                  |
| Llama 3.1 70B  | 70B        | A100 40GB          | 2x A100 40Go              |
| Llama 3.1 405B | 405B       | 8x A100 80Go       | 8x H100                   |

### Famille Mistral/Mixtral

| Variante      | Paramètres | GPU min      | Configuration recommandée |
| ------------- | ---------- | ------------ | ------------------------- |
| Mistral 7B    | 7B         | RTX 3060     | RTX 3090                  |
| Mixtral 8x7B  | 46,7B      | RTX 3090     | A100 40GB                 |
| Mixtral 8x22B | 141B       | 2x A100 80Go | 4x A100 80GB              |

### Famille Qwen 2.5

| Variante      | Paramètres | GPU min            | Configuration recommandée |
| ------------- | ---------- | ------------------ | ------------------------- |
| Qwen 2.5 0.5B | 0,5B       | N'importe quel 2Go | N'importe quel 4Go        |
| Qwen 2.5 1.5B | 1,5B       | N'importe quel 4Go | RTX 3060                  |
| Qwen 2.5 3B   | 3B         | N'importe quel 6Go | RTX 3060                  |
| Qwen 2.5 7B   | 7B         | RTX 3060           | RTX 3090                  |
| Qwen 2.5 14B  | 14B        | RTX 3090           | RTX 4090                  |
| Qwen 2.5 32B  | 32B        | RTX 4090           | A100 40GB                 |
| Qwen 2.5 72B  | 72B        | A100 40GB          | A100 80GB                 |

### Modèles DeepSeek

| Variante                         | Paramètres | GPU min           | Configuration recommandée |
| -------------------------------- | ---------- | ----------------- | ------------------------- |
| DeepSeek-Coder 6.7B              | 6,7B       | RTX 3060          | RTX 3090                  |
| DeepSeek-Coder 33B               | 33B        | RTX 4090          | A100 40GB                 |
| DeepSeek-V2-Lite                 | 15,7B      | RTX 3090          | A100 40GB                 |
| DeepSeek-V3                      | 671B       | 8x A100 80Go      | 8x H100                   |
| **DeepSeek-R1**                  | **671B**   | **8x A100 80Go**  | **8x H100 (FP8)**         |
| **DeepSeek-R1-Distill-Qwen-32B** | **32B**    | **RTX 5090 32GB** | **2x A100 40Go**          |
| **DeepSeek-R1-Distill-Qwen-7B**  | **7B**     | **RTX 3090 24GB** | **RTX 4090**              |

***

## Dépannage

### "CUDA out of memory"

1. **Réduire la quantification :** Q8 → Q4
2. **Réduire la longueur de contexte :** Réduire max\_tokens
3. **Activer le déchargement CPU :** `--cpu-offload` ou `enable_model_cpu_offload()`
4. **Utiliser un lot plus petit :** batch\_size=1
5. **Essayer un autre GPU :** Besoin de plus de VRAM

### "Modèle trop grand"

1. **Utiliser la version quantifiée :** Modèles GGUF Q4
2. **Utiliser plusieurs GPU :** Parallélisme tensoriel
3. **Décharger vers le CPU :** Plus lent mais fonctionne
4. **Choisir un modèle plus petit :** 7B au lieu de 13B

### "Génération lente"

1. **Mettre à niveau le GPU :** Plus de VRAM = moins de déchargement
2. **Utiliser une quantification plus rapide :** Q4 est plus rapide que Q8
3. **Réduire le contexte :** Plus court = plus rapide
4. **Activer flash attention :** `--flash-attn`

## Prochaines étapes

* [Guide de comparaison des GPU](/guides/guides_v2-fr/prise-en-main/gpu-comparison.md) - Spécifications détaillées des GPU
* [Catalogue d'images Docker](/guides/guides_v2-fr/prise-en-main/docker-images.md) - Images prêtes à déployer
* [Guide de démarrage rapide](/guides/guides_v2-fr/quickstart.md) - Commencer en 5 minutes


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-fr/prise-en-main/model-compatibility.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.