MiMo-V2.5-Pro (Xiaomi 1T MoE)
Deploy MiMo-V2.5-Pro (1.02T MoE, 42B active, 1M context) by Xiaomi on Clore.ai — the first open-weight Pro tier from the MiMo team, FP8 native, hybrid attention
Status (April 2026): MiMo-V2.5-Pro was released on April 27, 2026 by Xiaomi's AI division as the first open-weight model in their Pro tier — the previous MiMo-V2-Pro was API-only with no public weights. Weights live at huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro under the MIT license. The model card was last updated April 28, 2026, so deployment tooling, community quants, and reproductions are still landing day-by-day.
MiMo-V2.5-Pro is a 1.02-trillion parameter Mixture-of-Experts model that activates only ~42B parameters per token. The MiMo team — led by ex-DeepSeek researcher Luo Fuli — designed it around two ideas: a hybrid attention scheme that blends Sliding Window Attention (SWA) and Global Attention (GA) at a 6:1 ratio (~7× KV-cache reduction with a 128-token window), and 3 lightweight Multi-Token Prediction (MTP) modules that yield roughly 3× output speed on autoregressive workloads. The architecture has 70 layers (1 dense + 69 MoE), hidden size 6144, and ships natively in FP8 E4M3 mixed precision.
Two things matter for Clore.ai users. First, this is the first MiMo Pro release with public weights: previous Pro variants only existed as a hosted API and as the stealth-tested "Hunter Alpha" model on OpenRouter (March 2026 timeline). Second, the MIT license removes commercial restrictions outright — fine-tune, redistribute, run it as a paid endpoint, no caveats. Xiaomi's launch announcement claims V2.5-Pro beats DeepSeek V4 on agentic tasks, but that benchmark is vendor-published only — third-party reproduction has not landed yet, and you should not quote it externally without that caveat.
Key Specs
Total Parameters
1.02T (MoE)
Active Parameters
~42B per forward pass
Context Window
1,000,000 tokens (1M)
Precision
FP8 E4M3 mixed (native)
Architecture
Hybrid SWA + GA (6:1), 70 layers (1 dense + 69 MoE), hidden 6144
KV-Cache
Sliding window 128, ~7× reduction vs full GA
Speculative Decoding
3 lightweight MTP modules, ~3× output speed
License
MIT
Release Date
April 27, 2026
Organization
Xiaomi MiMo team (XiaomiMiMo on HuggingFace)
Primary Tooling
SGLang (first-class), vLLM
Why MiMo-V2.5-Pro?
First open Pro-tier MiMo — predecessor MiMo-V2-Pro was API-only, this is the first time the Pro weights are public
1M-token context — handles entire codebases, long agent traces, or multi-document RAG without chunking
Hybrid attention — SWA + GA at 6:1 cuts KV-cache ~7× vs pure global attention; long contexts stay tractable
Native FP8 — no post-hoc quantization, weights ship in FP8 E4M3 directly from the vendor
MTP speculative decoding — 3 built-in MTP modules give ~3× decode throughput out of the box
MIT license — no commercial restrictions, no field-of-use limits
42B active — you pay 42B-dense inference cost despite the 1.02T headline number
Lineage — lead researcher Luo Fuli was previously at DeepSeek, and the architectural choices show
Requirements
Still a 1T model. "42B active" sounds friendly, but the full 1.02T weights must live in VRAM (or be aggressively offloaded). Native FP8 weights need ~600GB+ VRAM before activation memory and KV cache. Plan for 8×H200 or larger for full-context FP8.
GPU VRAM
~141GB (Q4 + RAM offload, when quants land)
8× H100 80GB (640GB)
8× H200 141GB (1,128GB)
RAM
256GB
512GB
512GB
Disk
700GB NVMe
1.5TB NVMe
2TB NVMe
CUDA
12.4+
12.6+
12.6+
Clore.ai pick: For full FP8 with breathing room on the 1M context, 8×H200 is the natural fit — see clore.ai/rent-h200.html. 8×H100 80GB also runs the FP8 checkpoint but you'll cap --context-length lower (typically 256K) to leave room for KV cache. For Blackwell-class hardware see clore.ai/rent-b200.html.
Option A — Ollama / GGUF (Quantized, community builds)
Heads-up: As of April 28, 2026 (one day after release) community GGUF quants for MiMo-V2.5-Pro are not yet published. Expect Q4_K_M / Q5_K_M / Q6_K builds to appear within 1–2 weeks at huggingface.co/models?search=mimo-v2.5-pro+gguf. Until then, FP8 via SGLang or vLLM is the supported path.
Option B — vLLM (Production API, recommended)
vLLM supports MiMo-V2.5-Pro via --trust-remote-code (the hybrid attention + MTP modules ship as custom code in the repo). Use the vendor sampling defaults: temperature 1.0, top_p 0.95.
On 8×H100 80GB, cap --max-model-len at 262144 (256K) to leave headroom for activations + KV cache. On 8×H200 141GB you can comfortably push to 524288 or higher; 1,048,576 (full 1M) is feasible but expect long prefill times — test before relying on it.
Option C — SGLang (recommended for max throughput)
SGLang is the first-class serving target in the MiMo-V2.5-Pro model card. The vendor publishes the launch command with SGLANG_ENABLE_SPEC_V2=1 to activate the new MTP-aware speculative decoding path, which is where the ~3× decode speedup actually materializes.
For a multi-GPU TP setup on 8×H200, add --tp-size 8 and --mem-fraction-static 0.88. Confirm with nvidia-smi that all 8 cards are populated before sending real traffic — the 1M context is unforgiving if one rank is starved.
Clore.ai GPU Recommendations
4× H100 80GB
320GB
FP8 with heavy offload, max ctx ~64K, ~10–15 tok/s
~$25–35/day
8× H100 80GB
640GB
FP8 full, max ctx ~256K, ~30–45 tok/s
~$45–60/day
8× H200 141GB
1,128GB
FP8 full, max ctx 1M, ~60+ tok/s with MTP
~$80–110/day
8× B200
1,536GB
FP8 full, max ctx 1M, fastest available
marketplace pricing
Best value: 8× H200 141GB on the FP8 checkpoint with SGLANG_ENABLE_SPEC_V2=1. You get the full 1M context window, MTP speculative decoding, and enough KV-cache headroom for real agent loops. See clore.ai/rent-h200.html for live availability.
Use Cases
Long-horizon agents — MiMo team explicitly tunes for sustained tool-calling. The 1M context plus MTP speedup means thousands of tool turns without chunking gymnastics.
Whole-codebase analysis — drop a 500K-token monorepo into context for refactor planning, dependency audits, or migration design
Long-document RAG — entire books, multi-year customer transcripts, or year-long chat histories fit in one prompt
Coding — vendor-claimed HumanEval+ 75.6% and the agentic posture make it a candidate for autonomous SWE workloads (pair with SWE-agent / OpenHands)
Research scratchpad — 1M context tolerates the kind of "paste the whole paper, paste the prior work, ask for synthesis" usage that smaller models truncate
Benchmarks
Vendor-claimed — no third-party reproduction yet. All numbers below come from Xiaomi's April 27, 2026 announcement and the HuggingFace model card. The model is two days old at time of writing — independent reproductions on agentic and long-context benchmarks are still pending. The "beats DeepSeek V4 on agentic tasks" claim in particular is from Xiaomi's own write-up; treat it as marketing until reproduced.
GSM8K
99.6%
Math word problems
HumanEval+
75.6%
Coding (extended)
MMLU
89.4%
General knowledge
GraphWalks (1M ctx) BFS
0.37
Long-context graph traversal
GraphWalks (1M ctx) Parents
0.62
Long-context graph traversal
Agentic tasks vs DeepSeek V4
"outperforms" (vendor)
Unverified — third-party reproduction pending
Troubleshooting
OutOfMemoryError on load
Native FP8 still needs ~600GB+ VRAM. Use 8× H200 or drop --context-length to 65536 on 8× H100.
Slow HuggingFace download
huggingface-cli download XiaomiMiMo/MiMo-V2.5-Pro --local-dir ./weights --resume-download. Expect ~600GB FP8.
--trust-remote-code rejected
Hybrid attention and MTP ship as custom code in the repo. The flag is mandatory for both vLLM and SGLang.
MTP speedup not appearing in SGLang
Confirm SGLANG_ENABLE_SPEC_V2=1 is exported in the same shell as python3 -m sglang.launch_server. The default path does not activate MTP.
Reasoning trace flat / low quality
Use temperature=1.0 and top_p=0.95. Lower temps degrade MiMo's reasoning behavior.
1M context OOMs on 8× H100
8× H100 80GB cannot hold KV cache for 1M tokens. Cap at 256K or move to 8× H200.
Prefill takes minutes
Expected at 1M context. Use --enable-chunked-prefill (vLLM) or batch shorter requests for interactive workloads.
GGUF / Ollama pull fails
Community quants are not published as of April 28, 2026. Wait 1–2 weeks or use FP8 directly.
Next Steps
Predecessor / sibling: MiMo-V2-Flash — 309B MoE, 15B active, 32K ctx, faster but smaller
Vendor's claimed rival: DeepSeek V4 — 1M ctx, multimodal, ~1T params (the model Xiaomi says they beat on agentic tasks)
Open-weight coding rival: GLM-5.1 — 744B MoE, 40B active, MIT, currently #1 on SWE-Bench Pro
Clore.ai H200 rentals: clore.ai/rent-h200.html — best fit for full FP8 1T MoE at 1M context
Clore.ai marketplace: clore.ai/marketplace
Links
Last updated
Was this helpful?