Mistral Large 3 (675B MoE)
Run Mistral Large 3 — a 675B MoE frontier model with 41B active parameters on Clore.ai GPUs
Mistral Large 3 is Mistral AI's most powerful open-weight model, released in December 2025 under the Apache 2.0 license. It's a Mixture-of-Experts (MoE) model with 675B total parameters but only 41B active per token — delivering frontier-class performance at a fraction of the compute of a dense 675B model. With native multimodal support (text + images), a 256K context window, and best-in-class agentic capabilities, it competes directly with GPT-4o and Claude-class models while being fully self-hostable.
HuggingFace: mistralai/Mistral-Large-3-675B-Instruct-2512 Ollama: mistral-large-3:675b License: Apache 2.0
Key Features
675B total / 41B active parameters — MoE efficiency means you get frontier performance without activating every parameter
Apache 2.0 license — fully open for commercial and personal use, no restrictions
Natively multimodal — understands both text and images via a 2.5B vision encoder
256K context window — handles massive documents, codebases, and long conversations
Best-in-class agentic capabilities — native function calling, JSON mode, tool use
Multiple deployment options — FP8 on H200/B200, NVFP4 on H100/A100, GGUF quantized for consumer GPUs
Model Architecture
Architecture
Granular Mixture-of-Experts (MoE)
Total Parameters
675B
Active Parameters
41B (per token)
Vision Encoder
2.5B parameters
Context Window
256K tokens
Training
3,000× H200 GPUs
Release
December 2025
Requirements
GPU
4× RTX 4090
8× A100 80GB
8× H100/H200
VRAM
4×24GB (96GB)
8×80GB (640GB)
8×80GB (640GB)
RAM
128GB
256GB
256GB
Disk
400GB
700GB
1.4TB
CUDA
12.0+
12.0+
12.0+
Recommended Clore.ai setup:
Best value: 4× RTX 4090 (~$2–8/day) — run Q4 GGUF quantization via llama.cpp or Ollama
Production quality: 8× A100 80GB (~$16–32/day) — NVFP4 with full context via vLLM
Maximum performance: 8× H100 (~$24–48/day) — FP8, full 256K context
Quick Start with Ollama
The fastest way to run Mistral Large 3 on a multi-GPU Clore.ai instance:
Quick Start with vLLM (Production)
For production-grade serving with OpenAI-compatible API:
Usage Examples
1. Chat Completion (OpenAI-Compatible API)
Once vLLM is running, use any OpenAI-compatible client:
2. Function Calling / Tool Use
Mistral Large 3 excels at structured tool calling:
3. Vision — Image Analysis
Mistral Large 3 natively understands images:
Tips for Clore.ai Users
Start with NVFP4 on A100s — The
Mistral-Large-3-675B-Instruct-2512-NVFP4checkpoint is specifically designed for A100/H100 nodes and offers near-lossless quality at half the memory footprint of FP8.Use Ollama for quick experiments — If you have a 4× RTX 4090 instance, Ollama handles GGUF quantization automatically. Perfect for testing before committing to a vLLM production setup.
Expose the API securely — When running vLLM on a Clore.ai instance, use SSH tunneling (
ssh -L 8000:localhost:8000 root@<ip>) rather than exposing port 8000 directly.Lower
max-model-lento save VRAM — If you don't need the full 256K context, set--max-model-len 32768or65536to significantly reduce KV-cache memory usage.Consider the dense alternatives — For single-GPU setups, Mistral 3 14B (
mistral3:14bin Ollama) delivers excellent performance on a single RTX 4090 and is from the same model family.
Troubleshooting
CUDA out of memory on vLLM
Reduce --max-model-len (try 32768), increase --tensor-parallel-size, or use NVFP4 checkpoint
Slow generation speed
Ensure --tensor-parallel-size matches your GPU count; enable speculative decoding with Eagle checkpoint
Ollama fails to load 675B
Ensure you have 96GB+ VRAM across GPUs; Ollama needs OLLAMA_NUM_PARALLEL=1 for large models
tokenizer_mode mistral errors
You must pass all three flags: --tokenizer-mode mistral --config-format mistral --load-format mistral
Vision not working
Ensure images are close to 1:1 aspect ratio; avoid very wide/thin images for best results
Download too slow
Use huggingface-cli download mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 with HF_TOKEN set
Further Reading
Mistral 3 Announcement Blog — Official launch post with benchmarks
HuggingFace Model Card — Deployment instructions and benchmark results
NVFP4 Quantized Version — Optimized for A100/H100
GGUF Quantized (Unsloth) — For llama.cpp and Ollama
vLLM Documentation — Production serving framework
Red Hat Day-0 Guide — Step-by-step vLLM deployment
Last updated
Was this helpful?