# MiniMax M2.7 (229B MoE Coding)

{% hint style="info" %}
**Status (April 2026):** MiniMax M2.7 was published to HuggingFace on **April 9, 2026** by MiniMaxAI and reached **496K downloads in three weeks** — by adoption, the largest open-weight release of our April refresh. Weights live at [huggingface.co/MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) under a **custom MiniMax license** (`license: other`). It is **not** Apache/MIT — read [the LICENSE](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) before any commercial deployment.
{% endhint %}

{% hint style="warning" %}
**Correction:** Earlier revisions of our model index listed M2.7 as a proprietary API-only model. That was wrong by April 9, 2026 — the weights are public. This guide replaces that listing.
{% endhint %}

MiniMax M2.7 is a **229-billion parameter Mixture-of-Experts** model (256 experts, 8 active per token) and the latest entry in MiniMax's M2 family — a line built around **self-evolving / RL-driven post-training** and **agentic coding** workloads. The 2.7 release is the public, self-hostable counterpart to MiniMax's hosted coding agent and is positioned by MiniMax as competitive with Claude Sonnet 4.5 on agentic benchmarks while approaching Claude Opus 4.6 territory on a few of them.

The interesting architectural detail is **Interleaved Thinking** (introduced in M2.1 and refined through 2.5/2.7): the model alternates `<think>` reasoning blocks with normal generation across multi-turn tool calls, so the chain of thought survives across function-call round-trips instead of being discarded each turn. That is what makes it interesting for long-horizon agents — the reasoning trace doesn't reset every time you hit a `tool_use` boundary.

For Clore.ai users the practical news is that M2.7 ships with an **FP8 (float8\_e4m3fn) checkpoint** on the official repo. That puts a single-node deployment within reach on **4× H100 80GB** or **2× H200 141GB** — no H200 octets or 16-GPU racks required. If you've been running [GLM-5.1](/guides/language-models/glm-5-1.md) and want a second open-weight model in your agent stack with a different bias profile, this is the one to pair it with.

### Key Specs

| Property               | Value                                                                                                               |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------- |
| Total Parameters       | 229B (MoE, 256 experts)                                                                                             |
| Experts per Token      | 8 of 256                                                                                                            |
| Active Parameters      | **Not officially published** — see model card. M2 family historically \~10B active; verify before quoting publicly. |
| Hidden size / Layers   | 3,072 / 62                                                                                                          |
| Attention              | 48 heads, 8 KV (GQA)                                                                                                |
| Context Window         | 204,800 tokens (200K)                                                                                               |
| Tensor Types           | F32, BF16, F8\_E4M3                                                                                                 |
| MTP                    | Multi-Token Prediction enabled (3 MTP modules)                                                                      |
| License                | **Custom MiniMax — non-commercial by default**                                                                      |
| Release Date           | April 9, 2026                                                                                                       |
| HF Downloads (3 weeks) | \~496K                                                                                                              |
| Recommended Sampling   | `temperature=1.0`, `top_p=0.95`, `top_k=40`                                                                         |
| Primary Tooling        | vLLM, SGLang, Transformers, KTransformers, MLX-LM                                                                   |

### Why MiniMax M2.7?

* **Open weights at 229B** — biggest "real" open-weight coding model that still fits on a single 4×H100 node in FP8
* **Interleaved Thinking** — `<think>` blocks survive across tool-call turns, which is genuinely useful for SWE-style agents
* **Multi-language coding focus** — MiniMax markets strong Rust, Go, Java, Kotlin, Swift, and TypeScript performance, not just Python
* **Adoption signal** — 496K downloads in three weeks is the strongest community pickup of any April 2026 open-weight release we've tracked
* **MTP support** — speculative decoding via Multi-Token Prediction modules is built in, which translates to real throughput on H100/H200
* **Hosted fallback** — if your workload outgrows a single node, MiniMax's hosted endpoint exists; you don't have to choose at architecture time

***

## Requirements

{% hint style="warning" %}
**229B is still 229B.** BF16 weights are \~460GB. The FP8 checkpoint is roughly half that — \~230GB — which is what makes single-node deployment feasible. INT4 community quants land it under \~120GB but are not officially supported.
{% endhint %}

| Component  | Hobby (INT4 GGUF, offload)       | Recommended (FP8 single-node)     | Full BF16                    |
| ---------- | -------------------------------- | --------------------------------- | ---------------------------- |
| GPU VRAM   | 24–48GB GPU + 128GB+ RAM offload | 4× H100 80GB **or** 2× H200 141GB | 8× H100 80GB / 4× H200 141GB |
| Total VRAM | \~48GB GPU + offload             | 320GB / 282GB                     | 640GB / 564GB                |
| RAM        | 128GB                            | 256GB                             | 512GB                        |
| Disk       | 200GB NVMe                       | 400GB NVMe                        | 600GB NVMe                   |
| CUDA       | 12.0+                            | 12.4+                             | 12.4+                        |

**Clore.ai pick:** The FP8 checkpoint on **2× H200** is the cleanest deployment target — minimum tensor-parallel splits, fewer NCCL hops, and the math for 200K context just works. **4× H100** is the cheaper alternative if H200 stock is tight.

***

## Option A — Ollama / GGUF (Quantized)

{% hint style="warning" %}
**Community quants only.** MiniMax does not publish official GGUF weights for M2.7. Community Q4/Q5 builds typically appear 1–2 weeks after release — search [huggingface.co/models?search=minimax-m2.7+gguf](https://huggingface.co/models?search=minimax-m2.7+gguf) and verify the uploader. Quality varies on MoE quants below Q4.
{% endhint %}

```bash
# Once a community Q4_K_M build lands (check HuggingFace first)
docker exec ollama ollama pull minimax-m2.7:q4_K_M
docker exec ollama ollama run minimax-m2.7:q4_K_M

# Or with llama.cpp directly on a downloaded GGUF
docker run --gpus all -it --rm -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/minimax-m2.7-q4_k_m.gguf \
  --n-gpu-layers 80 --ctx-size 32768 \
  --temp 1.0 --top-p 0.95 --top-k 40 \
  --port 8080 --host 0.0.0.0
```

Hobby use only. For real workloads use vLLM or SGLang against the FP8 checkpoint.

***

## Option B — vLLM (Production API, recommended)

vLLM is the first-class serving target. The official FP8 checkpoint is the one to pull — same quality as BF16 at roughly half the VRAM.

### docker-compose.yml — 4× H100 80GB

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model MiniMaxAI/MiniMax-M2.7
      --quantization fp8
      --tensor-parallel-size 4
      --max-model-len 65536
      --gpu-memory-utilization 0.88
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --served-model-name minimax-m2.7
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

### docker-compose.yml — 2× H200 141GB

Drop `--tensor-parallel-size` to 2 and bump `--max-model-len` to use the headroom:

```yaml
    command: >
      --model MiniMaxAI/MiniMax-M2.7
      --quantization fp8
      --tensor-parallel-size 2
      --max-model-len 131072
      --gpu-memory-utilization 0.90
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --enable-chunked-prefill
      --served-model-name minimax-m2.7
      --trust-remote-code
```

### Smoke test

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m2.7",
    "messages": [
      {"role": "system", "content": "You are a senior engineer. Use Interleaved Thinking when reasoning across tool calls."},
      {"role": "user", "content": "Audit this Rust async handler for tokio cancellation safety: ..."}
    ],
    "max_tokens": 4096,
    "temperature": 1.0,
    "top_p": 0.95
  }'
```

{% hint style="info" %}
**Don't lower `temperature` below 1.0.** MiniMax's recommended sampling is `T=1.0, top_p=0.95, top_k=40`. Greedy decoding silently breaks the `<think>` interleaving on multi-turn tool calls.
{% endhint %}

***

## Option C — SGLang

SGLang's MoE scheduler is competitive with vLLM on Hopper and often wins on long-context coding completions thanks to EAGLE speculative decoding stacking with M2.7's MTP modules.

```bash
docker pull lmsysorg/sglang:latest

python3 -m sglang.launch_server \
  --model-path MiniMaxAI/MiniMax-M2.7 \
  --quantization fp8 \
  --tp-size 4 \
  --mem-fraction-static 0.88 \
  --context-length 65536 \
  --enable-mixed-chunk \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --served-model-name minimax-m2.7 \
  --trust-remote-code
```

Expect a \~1.5–2× throughput gain over vanilla vLLM on long agent traces. Drop `--tp-size` to 2 on H200.

***

## Clore.ai GPU Recommendations

| Setup                          | VRAM         | Expected Performance                             | Clore.ai Cost    |
| ------------------------------ | ------------ | ------------------------------------------------ | ---------------- |
| 1× RTX 4090 24GB + RAM offload | 24GB + 128GB | INT4 hobby, \~5–10 tok/s                         | \~$1–2/day       |
| 4× A100 80GB                   | 320GB        | BF16 sharded, \~15–25 tok/s                      | \~$15–22/day     |
| **4× H100 80GB (FP8)**         | **320GB**    | **FP8 production, \~40–60 tok/s**                | **\~$20–28/day** |
| **2× H200 141GB (FP8)**        | **282GB**    | **FP8 production, \~50–70 tok/s, full 200K ctx** | **\~$18–26/day** |
| 8× H100 80GB                   | 640GB        | BF16 full, \~80+ tok/s                           | \~$40–55/day     |

{% hint style="success" %}
**Best value:** 2× H200 with the FP8 checkpoint. Same throughput class as 4× H100 with half the tensor-parallel hops, often cheaper per day on the marketplace, and you keep enough VRAM headroom for the full 200K context.
{% endhint %}

Rent the boxes here:

* [**Rent H200 GPUs**](https://clore.ai/rent-h200.html) — recommended for the 2× H200 FP8 deployment
* [**Rent H100 GPUs**](https://clore.ai/rent-h100.html) — for the 4× H100 FP8 deployment
* [**Rent A100 80GB**](https://clore.ai/rent-a100-80gb.html) — BF16 multi-GPU fallback
* [**Rent RTX 4090**](https://clore.ai/rent-4090.html) — INT4 hobby use only
* [**Marketplace**](https://clore.ai/marketplace) — full inventory, on-demand and spot bidding

***

## Use Cases

* **Multi-language SWE agents** — Rust, Go, Java, Kotlin, Swift, and TypeScript get first-class treatment, not just Python/JS
* **Long-horizon tool-calling loops** — Interleaved Thinking keeps the reasoning trace alive across hundreds of `tool_use` round-trips
* **Codebase audits** — 200K context fits a mid-sized service plus its tests in one prompt
* **Refactor pipelines** — sustained correctness across many file edits via the MTP modules
* **Agent-of-agents orchestration** — pair M2.7 as planner with a smaller model (Qwen3.5, GLM-4.7-Flash) as worker
* **Self-hosted alternative to Claude Sonnet/Opus** for non-commercial coding research — but **read the license first**

***

## Benchmarks

{% hint style="warning" %}
**Vendor-claimed — verify independently.** Numbers below come from MiniMax's April 9, 2026 release notes. Independent reproductions are still rolling in.
{% endhint %}

| Benchmark        | MiniMax M2.7 | Claude Sonnet 4.5 (vendor ref) | Claude Opus 4.6 (vendor ref) | GPT-5.3-Codex |
| ---------------- | ------------ | ------------------------------ | ---------------------------- | ------------- |
| SWE-Pro          | **56.22%**   | \~55%                          | \~57.3%                      | 56.2%         |
| VIBE-Pro         | **55.6%**    | —                              | \~57%                        | —             |
| Terminal Bench 2 | **57.0%**    | —                              | —                            | —             |
| GDPval-AA (ELO)  | **1495**     | —                              | —                            | —             |

MiniMax's framing: M2.7 matches or beats Claude Sonnet 4.5 on the agentic-coding suite they care about, and lands within a few points of Claude Opus 4.6 on SWE-Pro / VIBE-Pro. Treat this as a directional signal, not a settled ranking — the gap to closed frontier models tightens every release.

***

## MiniMax M2 Family

| Version  | Released        | Architectural Focus                                  | Recommended For                           |
| -------- | --------------- | ---------------------------------------------------- | ----------------------------------------- |
| M2       | Oct 2025        | Initial 229B MoE release, RL-tuned coding            | Reference / historical                    |
| M2.1     | Dec 2025        | **Interleaved Thinking** introduced                  | Earliest version worth running for agents |
| M2.5     | Feb 2026        | Self-evolving RL post-training, longer context       | Solid coding model if disk-constrained    |
| **M2.7** | **Apr 9, 2026** | **Refined multi-language coding, MTP, FP8 official** | **Default choice — use this**             |

If you're starting fresh, skip earlier versions and go straight to M2.7. The architectural deltas compound and the FP8 ergonomics are noticeably better.

***

## Troubleshooting

| Issue                             | Solution                                                                                                                      |
| --------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` on FP8 load    | Need \~230GB VRAM. Use 4× H100 80GB or 2× H200 141GB. Drop `--max-model-len` to 32768 first.                                  |
| Slow HuggingFace download         | `huggingface-cli download MiniMaxAI/MiniMax-M2.7 --local-dir ./weights --resume-download`. Expect \~230GB FP8 / \~460GB BF16. |
| Tool calls silently dropped       | Set `--enable-auto-tool-choice --tool-call-parser hermes` in vLLM. M2.7 uses Hermes-style tool tags.                          |
| `<think>` blocks empty or garbled | Sampling must be `temperature=1.0, top_p=0.95, top_k=40`. Greedy decoding breaks Interleaved Thinking.                        |
| MTP errors / shape mismatch       | Update vLLM to the latest stable; MTP support landed late and older builds don't ship the modules.                            |
| 200K context OOMs on H100         | Use `--enable-chunked-prefill` and start at `--max-model-len 65536`. Full 200K realistically requires H200.                   |
| License confusion                 | Default = non-commercial. Email `api@minimax.io` with subject "M2.7 licensing" before any paid product use.                   |

***

## Next Steps

* **Audio sibling:** [MiniMax Speech](/guides/audio-and-voice/minimax-speech.md) — same vendor, audio/voice generation
* **Open-license alternative:** [GLM-5.1](/guides/language-models/glm-5-1.md) — 744B / 40B active, MIT license, top SWE-Bench Pro
* **Massive-context alternative:** [DeepSeek V4](/guides/language-models/deepseek-v4.md) — 1M context, multimodal
* **Cheaper agentic option:** [GLM-4.7 Flash](/guides/language-models/glm-47-flash.md) — fits on single H100, MIT
* **Clore.ai marketplace:** [clore.ai/marketplace](https://clore.ai/marketplace) — H100/H200/A100 from the spot market

### Links

* [MiniMax M2.7 on HuggingFace](https://huggingface.co/MiniMaxAI/MiniMax-M2.7)
* [MiniMax M2.7 LICENSE](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) — read before commercial use
* [MiniMax platform](https://www.minimax.io)
* [vLLM docs](https://docs.vllm.ai)
* [SGLang repo](https://github.com/sgl-project/sglang)
* [KTransformers](https://github.com/kvcache-ai/ktransformers)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/minimax-m27.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
