# MiMo-V2.5-Pro (Xiaomi 1T MoE)

{% hint style="info" %}
**Status (April 2026):** MiMo-V2.5-Pro was released on **April 27, 2026** by Xiaomi's AI division as the first open-weight model in their **Pro** tier — the previous MiMo-V2-Pro was API-only with no public weights. Weights live at [huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro) under the **MIT license**. The model card was last updated April 28, 2026, so deployment tooling, community quants, and reproductions are still landing day-by-day.
{% endhint %}

MiMo-V2.5-Pro is a **1.02-trillion parameter Mixture-of-Experts** model that activates only **\~42B parameters per token**. The MiMo team — led by ex-DeepSeek researcher **Luo Fuli** — designed it around two ideas: a **hybrid attention scheme** that blends Sliding Window Attention (SWA) and Global Attention (GA) at a 6:1 ratio (\~7× KV-cache reduction with a 128-token window), and **3 lightweight Multi-Token Prediction (MTP) modules** that yield roughly **3× output speed** on autoregressive workloads. The architecture has 70 layers (1 dense + 69 MoE), hidden size 6144, and ships natively in **FP8 E4M3 mixed precision**.

Two things matter for Clore.ai users. First, this is the **first MiMo Pro release with public weights**: previous Pro variants only existed as a hosted API and as the stealth-tested "Hunter Alpha" model on OpenRouter (March 2026 timeline). Second, the **MIT license** removes commercial restrictions outright — fine-tune, redistribute, run it as a paid endpoint, no caveats. Xiaomi's launch announcement claims V2.5-Pro **beats DeepSeek V4 on agentic tasks**, but that benchmark is vendor-published only — third-party reproduction has not landed yet, and you should not quote it externally without that caveat.

### Key Specs

| Property             | Value                                                            |
| -------------------- | ---------------------------------------------------------------- |
| Total Parameters     | 1.02T (MoE)                                                      |
| Active Parameters    | \~42B per forward pass                                           |
| Context Window       | 1,000,000 tokens (1M)                                            |
| Precision            | FP8 E4M3 mixed (native)                                          |
| Architecture         | Hybrid SWA + GA (6:1), 70 layers (1 dense + 69 MoE), hidden 6144 |
| KV-Cache             | Sliding window 128, \~7× reduction vs full GA                    |
| Speculative Decoding | 3 lightweight MTP modules, \~3× output speed                     |
| License              | MIT                                                              |
| Release Date         | April 27, 2026                                                   |
| Organization         | Xiaomi MiMo team (XiaomiMiMo on HuggingFace)                     |
| Primary Tooling      | SGLang (first-class), vLLM                                       |

### Why MiMo-V2.5-Pro?

* **First open Pro-tier MiMo** — predecessor MiMo-V2-Pro was API-only, this is the first time the Pro weights are public
* **1M-token context** — handles entire codebases, long agent traces, or multi-document RAG without chunking
* **Hybrid attention** — SWA + GA at 6:1 cuts KV-cache \~7× vs pure global attention; long contexts stay tractable
* **Native FP8** — no post-hoc quantization, weights ship in FP8 E4M3 directly from the vendor
* **MTP speculative decoding** — 3 built-in MTP modules give \~3× decode throughput out of the box
* **MIT license** — no commercial restrictions, no field-of-use limits
* **42B active** — you pay 42B-dense inference cost despite the 1.02T headline number
* **Lineage** — lead researcher Luo Fuli was previously at DeepSeek, and the architectural choices show

***

## Requirements

{% hint style="warning" %}
**Still a 1T model.** "42B active" sounds friendly, but the full 1.02T weights must live in VRAM (or be aggressively offloaded). Native FP8 weights need **\~600GB+ VRAM** before activation memory and KV cache. Plan for 8×H200 or larger for full-context FP8.
{% endhint %}

| Component | Minimum (Quant + offload, future)            | Recommended (FP8)    | Full FP8, 1M ctx        |
| --------- | -------------------------------------------- | -------------------- | ----------------------- |
| GPU VRAM  | \~141GB (Q4 + RAM offload, when quants land) | 8× H100 80GB (640GB) | 8× H200 141GB (1,128GB) |
| RAM       | 256GB                                        | 512GB                | 512GB                   |
| Disk      | 700GB NVMe                                   | 1.5TB NVMe           | 2TB NVMe                |
| CUDA      | 12.4+                                        | 12.6+                | 12.6+                   |

**Clore.ai pick:** For full FP8 with breathing room on the 1M context, **8×H200** is the natural fit — see [clore.ai/rent-h200.html](https://clore.ai/rent-h200.html). 8×H100 80GB also runs the FP8 checkpoint but you'll cap `--context-length` lower (typically 256K) to leave room for KV cache. For Blackwell-class hardware see [clore.ai/rent-b200.html](https://clore.ai/rent-b200.html).

***

## Option A — Ollama / GGUF (Quantized, community builds)

{% hint style="warning" %}
**Heads-up:** As of April 28, 2026 (one day after release) **community GGUF quants for MiMo-V2.5-Pro are not yet published**. Expect Q4\_K\_M / Q5\_K\_M / Q6\_K builds to appear within 1–2 weeks at [huggingface.co/models?search=mimo-v2.5-pro+gguf](https://huggingface.co/models?search=mimo-v2.5-pro+gguf). Until then, FP8 via SGLang or vLLM is the supported path.
{% endhint %}

```bash
# Once a Q4_K_M build is available
docker exec ollama ollama pull mimo-v2.5-pro:q4_K_M
docker exec ollama ollama run mimo-v2.5-pro:q4_K_M

# Or with llama.cpp directly on a GGUF file (when published)
docker run --gpus all -it --rm -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/mimo-v2.5-pro-q4_k_m.gguf \
  --n-gpu-layers 99 --ctx-size 65536 \
  --port 8080 --host 0.0.0.0
```

***

## Option B — vLLM (Production API, recommended)

vLLM supports MiMo-V2.5-Pro via `--trust-remote-code` (the hybrid attention + MTP modules ship as custom code in the repo). Use the vendor sampling defaults: **temperature 1.0, top\_p 0.95**.

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model XiaomiMiMo/MiMo-V2.5-Pro
      --tensor-parallel-size 8
      --quantization fp8
      --max-model-len 262144
      --gpu-memory-utilization 0.90
      --trust-remote-code
      --served-model-name mimo-v2.5-pro
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

```bash
# Test the API (vendor-recommended sampling)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mimo-v2.5-pro",
    "messages": [
      {"role": "system", "content": "You are an autonomous coding agent."},
      {"role": "user", "content": "Walk through this 30K-line monorepo and propose a migration plan from Express 4 to Fastify 5."}
    ],
    "max_tokens": 8192,
    "temperature": 1.0,
    "top_p": 0.95
  }'
```

{% hint style="info" %}
On 8×H100 80GB, cap `--max-model-len` at 262144 (256K) to leave headroom for activations + KV cache. On 8×H200 141GB you can comfortably push to 524288 or higher; 1,048,576 (full 1M) is feasible but expect long prefill times — test before relying on it.
{% endhint %}

***

## Option C — SGLang (recommended for max throughput)

SGLang is the **first-class serving target** in the MiMo-V2.5-Pro model card. The vendor publishes the launch command with **`SGLANG_ENABLE_SPEC_V2=1`** to activate the new MTP-aware speculative decoding path, which is where the \~3× decode speedup actually materializes.

```bash
docker pull lmsysorg/sglang:latest

# Verbatim from the HF model card
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
    --model-path XiaomiMiMo/MiMo-V2.5-Pro \
    --trust-remote-code \
    --quantization fp8 \
    --context-length 1048576 \
    --host 0.0.0.0 --port 9001
```

For a multi-GPU TP setup on 8×H200, add `--tp-size 8` and `--mem-fraction-static 0.88`. Confirm with `nvidia-smi` that all 8 cards are populated before sending real traffic — the 1M context is unforgiving if one rank is starved.

***

## Clore.ai GPU Recommendations

| Setup         | VRAM    | Expected Performance                                 | Clore.ai Cost       |
| ------------- | ------- | ---------------------------------------------------- | ------------------- |
| 4× H100 80GB  | 320GB   | FP8 with heavy offload, max ctx \~64K, \~10–15 tok/s | \~$25–35/day        |
| 8× H100 80GB  | 640GB   | FP8 full, max ctx \~256K, \~30–45 tok/s              | \~$45–60/day        |
| 8× H200 141GB | 1,128GB | FP8 full, max ctx 1M, \~60+ tok/s with MTP           | \~$80–110/day       |
| 8× B200       | 1,536GB | FP8 full, max ctx 1M, fastest available              | marketplace pricing |

{% hint style="success" %}
**Best value:** 8× H200 141GB on the FP8 checkpoint with `SGLANG_ENABLE_SPEC_V2=1`. You get the full 1M context window, MTP speculative decoding, and enough KV-cache headroom for real agent loops. See [clore.ai/rent-h200.html](https://clore.ai/rent-h200.html) for live availability.
{% endhint %}

***

## Use Cases

* **Long-horizon agents** — MiMo team explicitly tunes for sustained tool-calling. The 1M context plus MTP speedup means thousands of tool turns without chunking gymnastics.
* **Whole-codebase analysis** — drop a 500K-token monorepo into context for refactor planning, dependency audits, or migration design
* **Long-document RAG** — entire books, multi-year customer transcripts, or year-long chat histories fit in one prompt
* **Coding** — vendor-claimed HumanEval+ 75.6% and the agentic posture make it a candidate for autonomous SWE workloads (pair with SWE-agent / OpenHands)
* **Research scratchpad** — 1M context tolerates the kind of "paste the whole paper, paste the prior work, ask for synthesis" usage that smaller models truncate

***

## Benchmarks

{% hint style="warning" %}
**Vendor-claimed — no third-party reproduction yet.** All numbers below come from Xiaomi's April 27, 2026 announcement and the HuggingFace model card. The model is **two days old** at time of writing — independent reproductions on agentic and long-context benchmarks are still pending. The "beats DeepSeek V4 on agentic tasks" claim in particular is from Xiaomi's own write-up; treat it as marketing until reproduced.
{% endhint %}

| Benchmark                    | MiMo-V2.5-Pro (vendor) | Notes                                             |
| ---------------------------- | ---------------------- | ------------------------------------------------- |
| GSM8K                        | **99.6%**              | Math word problems                                |
| HumanEval+                   | 75.6%                  | Coding (extended)                                 |
| MMLU                         | 89.4%                  | General knowledge                                 |
| GraphWalks (1M ctx) BFS      | 0.37                   | Long-context graph traversal                      |
| GraphWalks (1M ctx) Parents  | 0.62                   | Long-context graph traversal                      |
| Agentic tasks vs DeepSeek V4 | "outperforms" (vendor) | **Unverified — third-party reproduction pending** |

***

## Troubleshooting

| Issue                               | Solution                                                                                                                                      |
| ----------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` on load          | Native FP8 still needs \~600GB+ VRAM. Use 8× H200 or drop `--context-length` to 65536 on 8× H100.                                             |
| Slow HuggingFace download           | `huggingface-cli download XiaomiMiMo/MiMo-V2.5-Pro --local-dir ./weights --resume-download`. Expect \~600GB FP8.                              |
| `--trust-remote-code` rejected      | Hybrid attention and MTP ship as custom code in the repo. The flag is **mandatory** for both vLLM and SGLang.                                 |
| MTP speedup not appearing in SGLang | Confirm `SGLANG_ENABLE_SPEC_V2=1` is exported in the same shell as `python3 -m sglang.launch_server`. The default path does not activate MTP. |
| Reasoning trace flat / low quality  | Use `temperature=1.0` and `top_p=0.95`. Lower temps degrade MiMo's reasoning behavior.                                                        |
| 1M context OOMs on 8× H100          | 8× H100 80GB cannot hold KV cache for 1M tokens. Cap at 256K or move to 8× H200.                                                              |
| Prefill takes minutes               | Expected at 1M context. Use `--enable-chunked-prefill` (vLLM) or batch shorter requests for interactive workloads.                            |
| GGUF / Ollama pull fails            | Community quants are not published as of April 28, 2026. Wait 1–2 weeks or use FP8 directly.                                                  |

***

## Next Steps

* **Predecessor / sibling:** [MiMo-V2-Flash](/guides/language-models/mimo-v2-flash.md) — 309B MoE, 15B active, 32K ctx, faster but smaller
* **Vendor's claimed rival:** [DeepSeek V4](/guides/language-models/deepseek-v4.md) — 1M ctx, multimodal, \~1T params (the model Xiaomi says they beat on agentic tasks)
* **Open-weight coding rival:** [GLM-5.1](/guides/language-models/glm-5-1.md) — 744B MoE, 40B active, MIT, currently #1 on SWE-Bench Pro
* **Clore.ai H200 rentals:** [clore.ai/rent-h200.html](https://clore.ai/rent-h200.html) — best fit for full FP8 1T MoE at 1M context
* **Clore.ai marketplace:** [clore.ai/marketplace](https://clore.ai/marketplace)

### Links

* [MiMo-V2.5-Pro on HuggingFace](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro)
* [Xiaomi MiMo HuggingFace org](https://huggingface.co/XiaomiMiMo)
* [SGLang repo](https://github.com/sgl-project/sglang)
* [vLLM docs](https://docs.vllm.ai)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/mimo-v25-pro.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
