# Mistral Large 3 (675B MoE)

Mistral Large 3 is Mistral AI's most powerful open-weight model, released in December 2025 under the **Apache 2.0 license**. It's a Mixture-of-Experts (MoE) model with 675B total parameters but only 41B active per token — delivering frontier-class performance at a fraction of the compute of a dense 675B model. With native multimodal support (text + images), a 256K context window, and best-in-class agentic capabilities, it competes directly with GPT-4o and Claude-class models while being fully self-hostable.

**HuggingFace:** [mistralai/Mistral-Large-3-675B-Instruct-2512](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512) **Ollama:** [mistral-large-3:675b](https://ollama.com/library/mistral-large-3) **License:** Apache 2.0

## Key Features

* **675B total / 41B active parameters** — MoE efficiency means you get frontier performance without activating every parameter
* **Apache 2.0 license** — fully open for commercial and personal use, no restrictions
* **Natively multimodal** — understands both text and images via a 2.5B vision encoder
* **256K context window** — handles massive documents, codebases, and long conversations
* **Best-in-class agentic capabilities** — native function calling, JSON mode, tool use
* **Multiple deployment options** — FP8 on H200/B200, NVFP4 on H100/A100, GGUF quantized for consumer GPUs

## Model Architecture

| Component         | Details                           |
| ----------------- | --------------------------------- |
| Architecture      | Granular Mixture-of-Experts (MoE) |
| Total Parameters  | 675B                              |
| Active Parameters | 41B (per token)                   |
| Vision Encoder    | 2.5B parameters                   |
| Context Window    | 256K tokens                       |
| Training          | 3,000× H200 GPUs                  |
| Release           | December 2025                     |

## Requirements

| Configuration | Budget (Q4 GGUF) | Standard (NVFP4) | Full (FP8)     |
| ------------- | ---------------- | ---------------- | -------------- |
| GPU           | 4× RTX 4090      | 8× A100 80GB     | 8× H100/H200   |
| VRAM          | 4×24GB (96GB)    | 8×80GB (640GB)   | 8×80GB (640GB) |
| RAM           | 128GB            | 256GB            | 256GB          |
| Disk          | 400GB            | 700GB            | 1.4TB          |
| CUDA          | 12.0+            | 12.0+            | 12.0+          |

**Recommended Clore.ai setup:**

* **Best value:** 4× RTX 4090 (\~$2–8/day) — run Q4 GGUF quantization via llama.cpp or Ollama
* **Production quality:** 8× A100 80GB (\~$16–32/day) — NVFP4 with full context via vLLM
* **Maximum performance:** 8× H100 (\~$24–48/day) — FP8, full 256K context

## Quick Start with Ollama

The fastest way to run Mistral Large 3 on a multi-GPU Clore.ai instance:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run the 675B model (requires multi-GPU, ~96GB+ VRAM for Q4)
ollama run mistral-large-3:675b

# For the smaller dense variants (single GPU):
ollama run mistral3:14b    # 14B dense — fits on RTX 3060+
ollama run mistral3:8b     # 8B dense — fits on any GPU
```

## Quick Start with vLLM (Production)

For production-grade serving with OpenAI-compatible API:

```bash
# Install vLLM
pip install vllm

# Serve with NVFP4 quantization on 8× A100/H100
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
    --tensor-parallel-size 8 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.90 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --host 0.0.0.0 \
    --port 8000

# For FP8 (original weights, highest quality):
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
    --tensor-parallel-size 8 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --max-model-len 131072 \
    --host 0.0.0.0 \
    --port 8000
```

## Usage Examples

### 1. Chat Completion (OpenAI-Compatible API)

Once vLLM is running, use any OpenAI-compatible client:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python async web scraper using aiohttp and BeautifulSoup."}
    ],
    temperature=0.1,
    max_tokens=4096
)

print(response.choices[0].message.content)
```

### 2. Function Calling / Tool Use

Mistral Large 3 excels at structured tool calling:

```python
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="n/a")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto"
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Args: {tool_call.function.arguments}")
```

### 3. Vision — Image Analysis

Mistral Large 3 natively understands images:

```python
import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="n/a")

# Encode image
with open("diagram.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this architecture diagram in detail."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
        ]
    }],
    max_tokens=2048
)

print(response.choices[0].message.content)
```

## Tips for Clore.ai Users

1. **Start with NVFP4 on A100s** — The `Mistral-Large-3-675B-Instruct-2512-NVFP4` checkpoint is specifically designed for A100/H100 nodes and offers near-lossless quality at half the memory footprint of FP8.
2. **Use Ollama for quick experiments** — If you have a 4× RTX 4090 instance, Ollama handles GGUF quantization automatically. Perfect for testing before committing to a vLLM production setup.
3. **Expose the API securely** — When running vLLM on a Clore.ai instance, use SSH tunneling (`ssh -L 8000:localhost:8000 root@<ip>`) rather than exposing port 8000 directly.
4. **Lower `max-model-len` to save VRAM** — If you don't need the full 256K context, set `--max-model-len 32768` or `65536` to significantly reduce KV-cache memory usage.
5. **Consider the dense alternatives** — For single-GPU setups, Mistral 3 14B (`mistral3:14b` in Ollama) delivers excellent performance on a single RTX 4090 and is from the same model family.

## Troubleshooting

| Issue                           | Solution                                                                                                  |
| ------------------------------- | --------------------------------------------------------------------------------------------------------- |
| `CUDA out of memory` on vLLM    | Reduce `--max-model-len` (try 32768), increase `--tensor-parallel-size`, or use NVFP4 checkpoint          |
| Slow generation speed           | Ensure `--tensor-parallel-size` matches your GPU count; enable speculative decoding with Eagle checkpoint |
| Ollama fails to load 675B       | Ensure you have 96GB+ VRAM across GPUs; Ollama needs `OLLAMA_NUM_PARALLEL=1` for large models             |
| `tokenizer_mode mistral` errors | You must pass all three flags: `--tokenizer-mode mistral --config-format mistral --load-format mistral`   |
| Vision not working              | Ensure images are close to 1:1 aspect ratio; avoid very wide/thin images for best results                 |
| Download too slow               | Use `huggingface-cli download mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4` with `HF_TOKEN` set     |

## Further Reading

* [Mistral 3 Announcement Blog](https://mistral.ai/news/mistral-3) — Official launch post with benchmarks
* [HuggingFace Model Card](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512) — Deployment instructions and benchmark results
* [NVFP4 Quantized Version](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4) — Optimized for A100/H100
* [GGUF Quantized (Unsloth)](https://huggingface.co/unsloth/Mistral-Large-3-675B-Instruct-2512-GGUF) — For llama.cpp and Ollama
* [vLLM Documentation](https://docs.vllm.ai/) — Production serving framework
* [Red Hat Day-0 Guide](https://developers.redhat.com/articles/2025/12/02/run-mistral-large-3-ministral-3-vllm-red-hat-ai) — Step-by-step vLLM deployment
