MiniMax M2.7 (229B MoE Coding)
Deploy MiniMax M2.7 (229B MoE) on Clore.ai — the open-weight self-hosted release behind MiniMax's coding agent push, with FP8 single-node deployment on H100/H200
Status (April 2026): MiniMax M2.7 was published to HuggingFace on April 9, 2026 by MiniMaxAI and reached 496K downloads in three weeks — by adoption, the largest open-weight release of our April refresh. Weights live at huggingface.co/MiniMaxAI/MiniMax-M2.7 under a custom MiniMax license (license: other). It is not Apache/MIT — read the LICENSE before any commercial deployment.
Correction: Earlier revisions of our model index listed M2.7 as a proprietary API-only model. That was wrong by April 9, 2026 — the weights are public. This guide replaces that listing.
MiniMax M2.7 is a 229-billion parameter Mixture-of-Experts model (256 experts, 8 active per token) and the latest entry in MiniMax's M2 family — a line built around self-evolving / RL-driven post-training and agentic coding workloads. The 2.7 release is the public, self-hostable counterpart to MiniMax's hosted coding agent and is positioned by MiniMax as competitive with Claude Sonnet 4.5 on agentic benchmarks while approaching Claude Opus 4.6 territory on a few of them.
The interesting architectural detail is Interleaved Thinking (introduced in M2.1 and refined through 2.5/2.7): the model alternates <think> reasoning blocks with normal generation across multi-turn tool calls, so the chain of thought survives across function-call round-trips instead of being discarded each turn. That is what makes it interesting for long-horizon agents — the reasoning trace doesn't reset every time you hit a tool_use boundary.
For Clore.ai users the practical news is that M2.7 ships with an FP8 (float8_e4m3fn) checkpoint on the official repo. That puts a single-node deployment within reach on 4× H100 80GB or 2× H200 141GB — no H200 octets or 16-GPU racks required. If you've been running GLM-5.1 and want a second open-weight model in your agent stack with a different bias profile, this is the one to pair it with.
Key Specs
Total Parameters
229B (MoE, 256 experts)
Experts per Token
8 of 256
Active Parameters
Not officially published — see model card. M2 family historically ~10B active; verify before quoting publicly.
Hidden size / Layers
3,072 / 62
Attention
48 heads, 8 KV (GQA)
Context Window
204,800 tokens (200K)
Tensor Types
F32, BF16, F8_E4M3
MTP
Multi-Token Prediction enabled (3 MTP modules)
License
Custom MiniMax — non-commercial by default
Release Date
April 9, 2026
HF Downloads (3 weeks)
~496K
Recommended Sampling
temperature=1.0, top_p=0.95, top_k=40
Primary Tooling
vLLM, SGLang, Transformers, KTransformers, MLX-LM
Why MiniMax M2.7?
Open weights at 229B — biggest "real" open-weight coding model that still fits on a single 4×H100 node in FP8
Interleaved Thinking —
<think>blocks survive across tool-call turns, which is genuinely useful for SWE-style agentsMulti-language coding focus — MiniMax markets strong Rust, Go, Java, Kotlin, Swift, and TypeScript performance, not just Python
Adoption signal — 496K downloads in three weeks is the strongest community pickup of any April 2026 open-weight release we've tracked
MTP support — speculative decoding via Multi-Token Prediction modules is built in, which translates to real throughput on H100/H200
Hosted fallback — if your workload outgrows a single node, MiniMax's hosted endpoint exists; you don't have to choose at architecture time
Requirements
229B is still 229B. BF16 weights are ~460GB. The FP8 checkpoint is roughly half that — ~230GB — which is what makes single-node deployment feasible. INT4 community quants land it under ~120GB but are not officially supported.
GPU VRAM
24–48GB GPU + 128GB+ RAM offload
4× H100 80GB or 2× H200 141GB
8× H100 80GB / 4× H200 141GB
Total VRAM
~48GB GPU + offload
320GB / 282GB
640GB / 564GB
RAM
128GB
256GB
512GB
Disk
200GB NVMe
400GB NVMe
600GB NVMe
CUDA
12.0+
12.4+
12.4+
Clore.ai pick: The FP8 checkpoint on 2× H200 is the cleanest deployment target — minimum tensor-parallel splits, fewer NCCL hops, and the math for 200K context just works. 4× H100 is the cheaper alternative if H200 stock is tight.
Option A — Ollama / GGUF (Quantized)
Community quants only. MiniMax does not publish official GGUF weights for M2.7. Community Q4/Q5 builds typically appear 1–2 weeks after release — search huggingface.co/models?search=minimax-m2.7+gguf and verify the uploader. Quality varies on MoE quants below Q4.
Hobby use only. For real workloads use vLLM or SGLang against the FP8 checkpoint.
Option B — vLLM (Production API, recommended)
vLLM is the first-class serving target. The official FP8 checkpoint is the one to pull — same quality as BF16 at roughly half the VRAM.
docker-compose.yml — 4× H100 80GB
docker-compose.yml — 2× H200 141GB
Drop --tensor-parallel-size to 2 and bump --max-model-len to use the headroom:
Smoke test
Don't lower temperature below 1.0. MiniMax's recommended sampling is T=1.0, top_p=0.95, top_k=40. Greedy decoding silently breaks the <think> interleaving on multi-turn tool calls.
Option C — SGLang
SGLang's MoE scheduler is competitive with vLLM on Hopper and often wins on long-context coding completions thanks to EAGLE speculative decoding stacking with M2.7's MTP modules.
Expect a ~1.5–2× throughput gain over vanilla vLLM on long agent traces. Drop --tp-size to 2 on H200.
Clore.ai GPU Recommendations
1× RTX 4090 24GB + RAM offload
24GB + 128GB
INT4 hobby, ~5–10 tok/s
~$1–2/day
4× A100 80GB
320GB
BF16 sharded, ~15–25 tok/s
~$15–22/day
4× H100 80GB (FP8)
320GB
FP8 production, ~40–60 tok/s
~$20–28/day
2× H200 141GB (FP8)
282GB
FP8 production, ~50–70 tok/s, full 200K ctx
~$18–26/day
8× H100 80GB
640GB
BF16 full, ~80+ tok/s
~$40–55/day
Best value: 2× H200 with the FP8 checkpoint. Same throughput class as 4× H100 with half the tensor-parallel hops, often cheaper per day on the marketplace, and you keep enough VRAM headroom for the full 200K context.
Rent the boxes here:
Rent H200 GPUs — recommended for the 2× H200 FP8 deployment
Rent H100 GPUs — for the 4× H100 FP8 deployment
Rent A100 80GB — BF16 multi-GPU fallback
Rent RTX 4090 — INT4 hobby use only
Marketplace — full inventory, on-demand and spot bidding
Use Cases
Multi-language SWE agents — Rust, Go, Java, Kotlin, Swift, and TypeScript get first-class treatment, not just Python/JS
Long-horizon tool-calling loops — Interleaved Thinking keeps the reasoning trace alive across hundreds of
tool_useround-tripsCodebase audits — 200K context fits a mid-sized service plus its tests in one prompt
Refactor pipelines — sustained correctness across many file edits via the MTP modules
Agent-of-agents orchestration — pair M2.7 as planner with a smaller model (Qwen3.5, GLM-4.7-Flash) as worker
Self-hosted alternative to Claude Sonnet/Opus for non-commercial coding research — but read the license first
Benchmarks
Vendor-claimed — verify independently. Numbers below come from MiniMax's April 9, 2026 release notes. Independent reproductions are still rolling in.
SWE-Pro
56.22%
~55%
~57.3%
56.2%
VIBE-Pro
55.6%
—
~57%
—
Terminal Bench 2
57.0%
—
—
—
GDPval-AA (ELO)
1495
—
—
—
MiniMax's framing: M2.7 matches or beats Claude Sonnet 4.5 on the agentic-coding suite they care about, and lands within a few points of Claude Opus 4.6 on SWE-Pro / VIBE-Pro. Treat this as a directional signal, not a settled ranking — the gap to closed frontier models tightens every release.
MiniMax M2 Family
M2
Oct 2025
Initial 229B MoE release, RL-tuned coding
Reference / historical
M2.1
Dec 2025
Interleaved Thinking introduced
Earliest version worth running for agents
M2.5
Feb 2026
Self-evolving RL post-training, longer context
Solid coding model if disk-constrained
M2.7
Apr 9, 2026
Refined multi-language coding, MTP, FP8 official
Default choice — use this
If you're starting fresh, skip earlier versions and go straight to M2.7. The architectural deltas compound and the FP8 ergonomics are noticeably better.
Troubleshooting
OutOfMemoryError on FP8 load
Need ~230GB VRAM. Use 4× H100 80GB or 2× H200 141GB. Drop --max-model-len to 32768 first.
Slow HuggingFace download
huggingface-cli download MiniMaxAI/MiniMax-M2.7 --local-dir ./weights --resume-download. Expect ~230GB FP8 / ~460GB BF16.
Tool calls silently dropped
Set --enable-auto-tool-choice --tool-call-parser hermes in vLLM. M2.7 uses Hermes-style tool tags.
<think> blocks empty or garbled
Sampling must be temperature=1.0, top_p=0.95, top_k=40. Greedy decoding breaks Interleaved Thinking.
MTP errors / shape mismatch
Update vLLM to the latest stable; MTP support landed late and older builds don't ship the modules.
200K context OOMs on H100
Use --enable-chunked-prefill and start at --max-model-len 65536. Full 200K realistically requires H200.
License confusion
Default = non-commercial. Email api@minimax.io with subject "M2.7 licensing" before any paid product use.
Next Steps
Audio sibling: MiniMax Speech — same vendor, audio/voice generation
Open-license alternative: GLM-5.1 — 744B / 40B active, MIT license, top SWE-Bench Pro
Massive-context alternative: DeepSeek V4 — 1M context, multimodal
Cheaper agentic option: GLM-4.7 Flash — fits on single H100, MIT
Clore.ai marketplace: clore.ai/marketplace — H100/H200/A100 from the spot market
Links
MiniMax M2.7 LICENSE — read before commercial use
Last updated
Was this helpful?