For the complete documentation index, see llms.txt. This page is also available as Markdown.

MiMo-V2.5-Pro (Xiaomi 1T MoE)

Deploy MiMo-V2.5-Pro (1.02T MoE, 42B active, 1M context) by Xiaomi on Clore.ai — the first open-weight Pro tier from the MiMo team, FP8 native, hybrid attention

Status (April 2026): MiMo-V2.5-Pro was released on April 27, 2026 by Xiaomi's AI division as the first open-weight model in their Pro tier — the previous MiMo-V2-Pro was API-only with no public weights. Weights live at huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro under the MIT license. The model card was last updated April 28, 2026, so deployment tooling, community quants, and reproductions are still landing day-by-day.

MiMo-V2.5-Pro is a 1.02-trillion parameter Mixture-of-Experts model that activates only ~42B parameters per token. The MiMo team — led by ex-DeepSeek researcher Luo Fuli — designed it around two ideas: a hybrid attention scheme that blends Sliding Window Attention (SWA) and Global Attention (GA) at a 6:1 ratio (~7× KV-cache reduction with a 128-token window), and 3 lightweight Multi-Token Prediction (MTP) modules that yield roughly 3× output speed on autoregressive workloads. The architecture has 70 layers (1 dense + 69 MoE), hidden size 6144, and ships natively in FP8 E4M3 mixed precision.

Two things matter for Clore.ai users. First, this is the first MiMo Pro release with public weights: previous Pro variants only existed as a hosted API and as the stealth-tested "Hunter Alpha" model on OpenRouter (March 2026 timeline). Second, the MIT license removes commercial restrictions outright — fine-tune, redistribute, run it as a paid endpoint, no caveats. Xiaomi's launch announcement claims V2.5-Pro beats DeepSeek V4 on agentic tasks, but that benchmark is vendor-published only — third-party reproduction has not landed yet, and you should not quote it externally without that caveat.

Key Specs

Property
Value

Total Parameters

1.02T (MoE)

Active Parameters

~42B per forward pass

Context Window

1,000,000 tokens (1M)

Precision

FP8 E4M3 mixed (native)

Architecture

Hybrid SWA + GA (6:1), 70 layers (1 dense + 69 MoE), hidden 6144

KV-Cache

Sliding window 128, ~7× reduction vs full GA

Speculative Decoding

3 lightweight MTP modules, ~3× output speed

License

MIT

Release Date

April 27, 2026

Organization

Xiaomi MiMo team (XiaomiMiMo on HuggingFace)

Primary Tooling

SGLang (first-class), vLLM

Why MiMo-V2.5-Pro?

  • First open Pro-tier MiMo — predecessor MiMo-V2-Pro was API-only, this is the first time the Pro weights are public

  • 1M-token context — handles entire codebases, long agent traces, or multi-document RAG without chunking

  • Hybrid attention — SWA + GA at 6:1 cuts KV-cache ~7× vs pure global attention; long contexts stay tractable

  • Native FP8 — no post-hoc quantization, weights ship in FP8 E4M3 directly from the vendor

  • MTP speculative decoding — 3 built-in MTP modules give ~3× decode throughput out of the box

  • MIT license — no commercial restrictions, no field-of-use limits

  • 42B active — you pay 42B-dense inference cost despite the 1.02T headline number

  • Lineage — lead researcher Luo Fuli was previously at DeepSeek, and the architectural choices show


Requirements

Component
Minimum (Quant + offload, future)
Recommended (FP8)
Full FP8, 1M ctx

GPU VRAM

~141GB (Q4 + RAM offload, when quants land)

8× H100 80GB (640GB)

8× H200 141GB (1,128GB)

RAM

256GB

512GB

512GB

Disk

700GB NVMe

1.5TB NVMe

2TB NVMe

CUDA

12.4+

12.6+

12.6+

Clore.ai pick: For full FP8 with breathing room on the 1M context, 8×H200 is the natural fit — see clore.ai/rent-h200.html. 8×H100 80GB also runs the FP8 checkpoint but you'll cap --context-length lower (typically 256K) to leave room for KV cache. For Blackwell-class hardware see clore.ai/rent-b200.html.


Option A — Ollama / GGUF (Quantized, community builds)


vLLM supports MiMo-V2.5-Pro via --trust-remote-code (the hybrid attention + MTP modules ship as custom code in the repo). Use the vendor sampling defaults: temperature 1.0, top_p 0.95.

On 8×H100 80GB, cap --max-model-len at 262144 (256K) to leave headroom for activations + KV cache. On 8×H200 141GB you can comfortably push to 524288 or higher; 1,048,576 (full 1M) is feasible but expect long prefill times — test before relying on it.


SGLang is the first-class serving target in the MiMo-V2.5-Pro model card. The vendor publishes the launch command with SGLANG_ENABLE_SPEC_V2=1 to activate the new MTP-aware speculative decoding path, which is where the ~3× decode speedup actually materializes.

For a multi-GPU TP setup on 8×H200, add --tp-size 8 and --mem-fraction-static 0.88. Confirm with nvidia-smi that all 8 cards are populated before sending real traffic — the 1M context is unforgiving if one rank is starved.


Clore.ai GPU Recommendations

Setup
VRAM
Expected Performance
Clore.ai Cost

4× H100 80GB

320GB

FP8 with heavy offload, max ctx ~64K, ~10–15 tok/s

~$25–35/day

8× H100 80GB

640GB

FP8 full, max ctx ~256K, ~30–45 tok/s

~$45–60/day

8× H200 141GB

1,128GB

FP8 full, max ctx 1M, ~60+ tok/s with MTP

~$80–110/day

8× B200

1,536GB

FP8 full, max ctx 1M, fastest available

marketplace pricing


Use Cases

  • Long-horizon agents — MiMo team explicitly tunes for sustained tool-calling. The 1M context plus MTP speedup means thousands of tool turns without chunking gymnastics.

  • Whole-codebase analysis — drop a 500K-token monorepo into context for refactor planning, dependency audits, or migration design

  • Long-document RAG — entire books, multi-year customer transcripts, or year-long chat histories fit in one prompt

  • Coding — vendor-claimed HumanEval+ 75.6% and the agentic posture make it a candidate for autonomous SWE workloads (pair with SWE-agent / OpenHands)

  • Research scratchpad — 1M context tolerates the kind of "paste the whole paper, paste the prior work, ask for synthesis" usage that smaller models truncate


Benchmarks

Benchmark
MiMo-V2.5-Pro (vendor)
Notes

GSM8K

99.6%

Math word problems

HumanEval+

75.6%

Coding (extended)

MMLU

89.4%

General knowledge

GraphWalks (1M ctx) BFS

0.37

Long-context graph traversal

GraphWalks (1M ctx) Parents

0.62

Long-context graph traversal

Agentic tasks vs DeepSeek V4

"outperforms" (vendor)

Unverified — third-party reproduction pending


Troubleshooting

Issue
Solution

OutOfMemoryError on load

Native FP8 still needs ~600GB+ VRAM. Use 8× H200 or drop --context-length to 65536 on 8× H100.

Slow HuggingFace download

huggingface-cli download XiaomiMiMo/MiMo-V2.5-Pro --local-dir ./weights --resume-download. Expect ~600GB FP8.

--trust-remote-code rejected

Hybrid attention and MTP ship as custom code in the repo. The flag is mandatory for both vLLM and SGLang.

MTP speedup not appearing in SGLang

Confirm SGLANG_ENABLE_SPEC_V2=1 is exported in the same shell as python3 -m sglang.launch_server. The default path does not activate MTP.

Reasoning trace flat / low quality

Use temperature=1.0 and top_p=0.95. Lower temps degrade MiMo's reasoning behavior.

1M context OOMs on 8× H100

8× H100 80GB cannot hold KV cache for 1M tokens. Cap at 256K or move to 8× H200.

Prefill takes minutes

Expected at 1M context. Use --enable-chunked-prefill (vLLM) or batch shorter requests for interactive workloads.

GGUF / Ollama pull fails

Community quants are not published as of April 28, 2026. Wait 1–2 weeks or use FP8 directly.


Next Steps

  • Predecessor / sibling: MiMo-V2-Flash — 309B MoE, 15B active, 32K ctx, faster but smaller

  • Vendor's claimed rival: DeepSeek V4 — 1M ctx, multimodal, ~1T params (the model Xiaomi says they beat on agentic tasks)

  • Open-weight coding rival: GLM-5.1 — 744B MoE, 40B active, MIT, currently #1 on SWE-Bench Pro

  • Clore.ai H200 rentals: clore.ai/rent-h200.html — best fit for full FP8 1T MoE at 1M context

  • Clore.ai marketplace: clore.ai/marketplace

Last updated

Was this helpful?