For the complete documentation index, see llms.txt. This page is also available as Markdown.

MiniMax M2.7 (229B MoE Coding)

Deploy MiniMax M2.7 (229B MoE) on Clore.ai — the open-weight self-hosted release behind MiniMax's coding agent push, with FP8 single-node deployment on H100/H200

Status (April 2026): MiniMax M2.7 was published to HuggingFace on April 9, 2026 by MiniMaxAI and reached 496K downloads in three weeks — by adoption, the largest open-weight release of our April refresh. Weights live at huggingface.co/MiniMaxAI/MiniMax-M2.7 under a custom MiniMax license (license: other). It is not Apache/MIT — read the LICENSE before any commercial deployment.

MiniMax M2.7 is a 229-billion parameter Mixture-of-Experts model (256 experts, 8 active per token) and the latest entry in MiniMax's M2 family — a line built around self-evolving / RL-driven post-training and agentic coding workloads. The 2.7 release is the public, self-hostable counterpart to MiniMax's hosted coding agent and is positioned by MiniMax as competitive with Claude Sonnet 4.5 on agentic benchmarks while approaching Claude Opus 4.6 territory on a few of them.

The interesting architectural detail is Interleaved Thinking (introduced in M2.1 and refined through 2.5/2.7): the model alternates <think> reasoning blocks with normal generation across multi-turn tool calls, so the chain of thought survives across function-call round-trips instead of being discarded each turn. That is what makes it interesting for long-horizon agents — the reasoning trace doesn't reset every time you hit a tool_use boundary.

For Clore.ai users the practical news is that M2.7 ships with an FP8 (float8_e4m3fn) checkpoint on the official repo. That puts a single-node deployment within reach on 4× H100 80GB or 2× H200 141GB — no H200 octets or 16-GPU racks required. If you've been running GLM-5.1 and want a second open-weight model in your agent stack with a different bias profile, this is the one to pair it with.

Key Specs

Property
Value

Total Parameters

229B (MoE, 256 experts)

Experts per Token

8 of 256

Active Parameters

Not officially published — see model card. M2 family historically ~10B active; verify before quoting publicly.

Hidden size / Layers

3,072 / 62

Attention

48 heads, 8 KV (GQA)

Context Window

204,800 tokens (200K)

Tensor Types

F32, BF16, F8_E4M3

MTP

Multi-Token Prediction enabled (3 MTP modules)

License

Custom MiniMax — non-commercial by default

Release Date

April 9, 2026

HF Downloads (3 weeks)

~496K

Recommended Sampling

temperature=1.0, top_p=0.95, top_k=40

Primary Tooling

vLLM, SGLang, Transformers, KTransformers, MLX-LM

Why MiniMax M2.7?

  • Open weights at 229B — biggest "real" open-weight coding model that still fits on a single 4×H100 node in FP8

  • Interleaved Thinking<think> blocks survive across tool-call turns, which is genuinely useful for SWE-style agents

  • Multi-language coding focus — MiniMax markets strong Rust, Go, Java, Kotlin, Swift, and TypeScript performance, not just Python

  • Adoption signal — 496K downloads in three weeks is the strongest community pickup of any April 2026 open-weight release we've tracked

  • MTP support — speculative decoding via Multi-Token Prediction modules is built in, which translates to real throughput on H100/H200

  • Hosted fallback — if your workload outgrows a single node, MiniMax's hosted endpoint exists; you don't have to choose at architecture time


Requirements

Component
Hobby (INT4 GGUF, offload)
Recommended (FP8 single-node)
Full BF16

GPU VRAM

24–48GB GPU + 128GB+ RAM offload

4× H100 80GB or 2× H200 141GB

8× H100 80GB / 4× H200 141GB

Total VRAM

~48GB GPU + offload

320GB / 282GB

640GB / 564GB

RAM

128GB

256GB

512GB

Disk

200GB NVMe

400GB NVMe

600GB NVMe

CUDA

12.0+

12.4+

12.4+

Clore.ai pick: The FP8 checkpoint on 2× H200 is the cleanest deployment target — minimum tensor-parallel splits, fewer NCCL hops, and the math for 200K context just works. 4× H100 is the cheaper alternative if H200 stock is tight.


Option A — Ollama / GGUF (Quantized)

Hobby use only. For real workloads use vLLM or SGLang against the FP8 checkpoint.


vLLM is the first-class serving target. The official FP8 checkpoint is the one to pull — same quality as BF16 at roughly half the VRAM.

docker-compose.yml — 4× H100 80GB

docker-compose.yml — 2× H200 141GB

Drop --tensor-parallel-size to 2 and bump --max-model-len to use the headroom:

Smoke test

Don't lower temperature below 1.0. MiniMax's recommended sampling is T=1.0, top_p=0.95, top_k=40. Greedy decoding silently breaks the <think> interleaving on multi-turn tool calls.


Option C — SGLang

SGLang's MoE scheduler is competitive with vLLM on Hopper and often wins on long-context coding completions thanks to EAGLE speculative decoding stacking with M2.7's MTP modules.

Expect a ~1.5–2× throughput gain over vanilla vLLM on long agent traces. Drop --tp-size to 2 on H200.


Clore.ai GPU Recommendations

Setup
VRAM
Expected Performance
Clore.ai Cost

1× RTX 4090 24GB + RAM offload

24GB + 128GB

INT4 hobby, ~5–10 tok/s

~$1–2/day

4× A100 80GB

320GB

BF16 sharded, ~15–25 tok/s

~$15–22/day

4× H100 80GB (FP8)

320GB

FP8 production, ~40–60 tok/s

~$20–28/day

2× H200 141GB (FP8)

282GB

FP8 production, ~50–70 tok/s, full 200K ctx

~$18–26/day

8× H100 80GB

640GB

BF16 full, ~80+ tok/s

~$40–55/day

Rent the boxes here:


Use Cases

  • Multi-language SWE agents — Rust, Go, Java, Kotlin, Swift, and TypeScript get first-class treatment, not just Python/JS

  • Long-horizon tool-calling loops — Interleaved Thinking keeps the reasoning trace alive across hundreds of tool_use round-trips

  • Codebase audits — 200K context fits a mid-sized service plus its tests in one prompt

  • Refactor pipelines — sustained correctness across many file edits via the MTP modules

  • Agent-of-agents orchestration — pair M2.7 as planner with a smaller model (Qwen3.5, GLM-4.7-Flash) as worker

  • Self-hosted alternative to Claude Sonnet/Opus for non-commercial coding research — but read the license first


Benchmarks

Benchmark
MiniMax M2.7
Claude Sonnet 4.5 (vendor ref)
Claude Opus 4.6 (vendor ref)
GPT-5.3-Codex

SWE-Pro

56.22%

~55%

~57.3%

56.2%

VIBE-Pro

55.6%

~57%

Terminal Bench 2

57.0%

GDPval-AA (ELO)

1495

MiniMax's framing: M2.7 matches or beats Claude Sonnet 4.5 on the agentic-coding suite they care about, and lands within a few points of Claude Opus 4.6 on SWE-Pro / VIBE-Pro. Treat this as a directional signal, not a settled ranking — the gap to closed frontier models tightens every release.


MiniMax M2 Family

Version
Released
Architectural Focus
Recommended For

M2

Oct 2025

Initial 229B MoE release, RL-tuned coding

Reference / historical

M2.1

Dec 2025

Interleaved Thinking introduced

Earliest version worth running for agents

M2.5

Feb 2026

Self-evolving RL post-training, longer context

Solid coding model if disk-constrained

M2.7

Apr 9, 2026

Refined multi-language coding, MTP, FP8 official

Default choice — use this

If you're starting fresh, skip earlier versions and go straight to M2.7. The architectural deltas compound and the FP8 ergonomics are noticeably better.


Troubleshooting

Issue
Solution

OutOfMemoryError on FP8 load

Need ~230GB VRAM. Use 4× H100 80GB or 2× H200 141GB. Drop --max-model-len to 32768 first.

Slow HuggingFace download

huggingface-cli download MiniMaxAI/MiniMax-M2.7 --local-dir ./weights --resume-download. Expect ~230GB FP8 / ~460GB BF16.

Tool calls silently dropped

Set --enable-auto-tool-choice --tool-call-parser hermes in vLLM. M2.7 uses Hermes-style tool tags.

<think> blocks empty or garbled

Sampling must be temperature=1.0, top_p=0.95, top_k=40. Greedy decoding breaks Interleaved Thinking.

MTP errors / shape mismatch

Update vLLM to the latest stable; MTP support landed late and older builds don't ship the modules.

200K context OOMs on H100

Use --enable-chunked-prefill and start at --max-model-len 65536. Full 200K realistically requires H200.

License confusion

Default = non-commercial. Email api@minimax.io with subject "M2.7 licensing" before any paid product use.


Next Steps

  • Audio sibling: MiniMax Speech — same vendor, audio/voice generation

  • Open-license alternative: GLM-5.1 — 744B / 40B active, MIT license, top SWE-Bench Pro

  • Massive-context alternative: DeepSeek V4 — 1M context, multimodal

  • Cheaper agentic option: GLM-4.7 Flash — fits on single H100, MIT

  • Clore.ai marketplace: clore.ai/marketplace — H100/H200/A100 from the spot market

Last updated

Was this helpful?