# Mistral Large 3 (675B MoE)

Mistral Large 3 is Mistral AI's most powerful open-weight model, released in December 2025 under the **Apache 2.0 license**. It's a Mixture-of-Experts (MoE) model with 675B total parameters but only 41B active per token — delivering frontier-class performance at a fraction of the compute of a dense 675B model. With native multimodal support (text + images), a 256K context window, and best-in-class agentic capabilities, it competes directly with GPT-4o and Claude-class models while being fully self-hostable.

**HuggingFace:** [mistralai/Mistral-Large-3-675B-Instruct-2512](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512) **Ollama:** [mistral-large-3:675b](https://ollama.com/library/mistral-large-3) **License:** Apache 2.0

## Key Features

* **675B total / 41B active parameters** — MoE efficiency means you get frontier performance without activating every parameter
* **Apache 2.0 license** — fully open for commercial and personal use, no restrictions
* **Natively multimodal** — understands both text and images via a 2.5B vision encoder
* **256K context window** — handles massive documents, codebases, and long conversations
* **Best-in-class agentic capabilities** — native function calling, JSON mode, tool use
* **Multiple deployment options** — FP8 on H200/B200, NVFP4 on H100/A100, GGUF quantized for consumer GPUs

## Model Architecture

| Component         | Details                           |
| ----------------- | --------------------------------- |
| Architecture      | Granular Mixture-of-Experts (MoE) |
| Total Parameters  | 675B                              |
| Active Parameters | 41B (per token)                   |
| Vision Encoder    | 2.5B parameters                   |
| Context Window    | 256K tokens                       |
| Training          | 3,000× H200 GPUs                  |
| Release           | December 2025                     |

## Requirements

| Configuration | Budget (Q4 GGUF) | Standard (NVFP4) | Full (FP8)     |
| ------------- | ---------------- | ---------------- | -------------- |
| GPU           | 4× RTX 4090      | 8× A100 80GB     | 8× H100/H200   |
| VRAM          | 4×24GB (96GB)    | 8×80GB (640GB)   | 8×80GB (640GB) |
| RAM           | 128GB            | 256GB            | 256GB          |
| Disk          | 400GB            | 700GB            | 1.4TB          |
| CUDA          | 12.0+            | 12.0+            | 12.0+          |

**Recommended Clore.ai setup:**

* **Best value:** 4× RTX 4090 (\~$2–8/day) — run Q4 GGUF quantization via llama.cpp or Ollama
* **Production quality:** 8× A100 80GB (\~$16–32/day) — NVFP4 with full context via vLLM
* **Maximum performance:** 8× H100 (\~$24–48/day) — FP8, full 256K context

## Quick Start with Ollama

The fastest way to run Mistral Large 3 on a multi-GPU Clore.ai instance:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run the 675B model (requires multi-GPU, ~96GB+ VRAM for Q4)
ollama run mistral-large-3:675b

# For the smaller dense variants (single GPU):
ollama run mistral3:14b    # 14B dense — fits on RTX 3060+
ollama run mistral3:8b     # 8B dense — fits on any GPU
```

## Quick Start with vLLM (Production)

For production-grade serving with OpenAI-compatible API:

```bash
# Install vLLM
pip install vllm

# Serve with NVFP4 quantization on 8× A100/H100
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
    --tensor-parallel-size 8 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.90 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --host 0.0.0.0 \
    --port 8000

# For FP8 (original weights, highest quality):
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
    --tensor-parallel-size 8 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --max-model-len 131072 \
    --host 0.0.0.0 \
    --port 8000
```

## Usage Examples

### 1. Chat Completion (OpenAI-Compatible API)

Once vLLM is running, use any OpenAI-compatible client:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python async web scraper using aiohttp and BeautifulSoup."}
    ],
    temperature=0.1,
    max_tokens=4096
)

print(response.choices[0].message.content)
```

### 2. Function Calling / Tool Use

Mistral Large 3 excels at structured tool calling:

```python
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="n/a")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto"
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Args: {tool_call.function.arguments}")
```

### 3. Vision — Image Analysis

Mistral Large 3 natively understands images:

```python
import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="n/a")

# Encode image
with open("diagram.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this architecture diagram in detail."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
        ]
    }],
    max_tokens=2048
)

print(response.choices[0].message.content)
```

## Tips for Clore.ai Users

1. **Start with NVFP4 on A100s** — The `Mistral-Large-3-675B-Instruct-2512-NVFP4` checkpoint is specifically designed for A100/H100 nodes and offers near-lossless quality at half the memory footprint of FP8.
2. **Use Ollama for quick experiments** — If you have a 4× RTX 4090 instance, Ollama handles GGUF quantization automatically. Perfect for testing before committing to a vLLM production setup.
3. **Expose the API securely** — When running vLLM on a Clore.ai instance, use SSH tunneling (`ssh -L 8000:localhost:8000 root@<ip>`) rather than exposing port 8000 directly.
4. **Lower `max-model-len` to save VRAM** — If you don't need the full 256K context, set `--max-model-len 32768` or `65536` to significantly reduce KV-cache memory usage.
5. **Consider the dense alternatives** — For single-GPU setups, Mistral 3 14B (`mistral3:14b` in Ollama) delivers excellent performance on a single RTX 4090 and is from the same model family.

## Troubleshooting

| Issue                           | Solution                                                                                                  |
| ------------------------------- | --------------------------------------------------------------------------------------------------------- |
| `CUDA out of memory` on vLLM    | Reduce `--max-model-len` (try 32768), increase `--tensor-parallel-size`, or use NVFP4 checkpoint          |
| Slow generation speed           | Ensure `--tensor-parallel-size` matches your GPU count; enable speculative decoding with Eagle checkpoint |
| Ollama fails to load 675B       | Ensure you have 96GB+ VRAM across GPUs; Ollama needs `OLLAMA_NUM_PARALLEL=1` for large models             |
| `tokenizer_mode mistral` errors | You must pass all three flags: `--tokenizer-mode mistral --config-format mistral --load-format mistral`   |
| Vision not working              | Ensure images are close to 1:1 aspect ratio; avoid very wide/thin images for best results                 |
| Download too slow               | Use `huggingface-cli download mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4` with `HF_TOKEN` set     |

## Further Reading

* [Mistral 3 Announcement Blog](https://mistral.ai/news/mistral-3) — Official launch post with benchmarks
* [HuggingFace Model Card](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512) — Deployment instructions and benchmark results
* [NVFP4 Quantized Version](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4) — Optimized for A100/H100
* [GGUF Quantized (Unsloth)](https://huggingface.co/unsloth/Mistral-Large-3-675B-Instruct-2512-GGUF) — For llama.cpp and Ollama
* [vLLM Documentation](https://docs.vllm.ai/) — Production serving framework
* [Red Hat Day-0 Guide](https://developers.redhat.com/articles/2025/12/02/run-mistral-large-3-ministral-3-vllm-red-hat-ai) — Step-by-step vLLM deployment


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/mistral-large3.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
