> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-de/sprachmodelle/minimax-m27.md).

# MiniMax M2.7 (229B MoE Coding)

{% hint style="info" %}
**Status (April 2026):** MiniMax M2.7 wurde am **9. April 2026** von MiniMaxAI auf HuggingFace veröffentlicht und erreichte **496K Downloads in drei Wochen** — nach Verbreitung die größte Open-Weight-Veröffentlichung unseres April-Refreshs. Die Gewichte befinden sich unter [huggingface.co/MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) unter einer **benutzerdefinierten MiniMax-Lizenz** (`Lizenz: sonstige`). Es ist **nicht** Apache/MIT — lies [die LICENSE](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) vor jedem kommerziellen Einsatz.
{% endhint %}

{% hint style="warning" %}
**Korrektur:** Frühere Versionen unseres Modellindex führten M2.7 als proprietäres API-only-Modell auf. Das war zum 9. April 2026 falsch — die Gewichte sind öffentlich. Dieser Leitfaden ersetzt diesen Eintrag.
{% endhint %}

MiniMax M2.7 ist ein **229-Milliarden-Parameter Mixture-of-Experts** Modell (256 Experten, 8 aktiv pro Token) und der neueste Eintrag in MiniMax' M2-Familie — einer Reihe, die rund um **selbstentwickelndes / RL-gestütztes Post-Training** und **agentisches Coding** -Workloads aufgebaut ist. Die 2.7-Version ist das öffentliche, selbst hostbare Gegenstück zu MiniMax' gehostetem Coding-Agenten und wird von MiniMax als konkurrenzfähig mit Claude Sonnet 4.5 bei agentischen Benchmarks positioniert, während sie sich bei einigen davon in die Nähe von Claude Opus 4.6 bewegt.

Das interessante architektonische Detail ist **Interleaved Thinking** (eingeführt in M2.1 und über 2.5/2.7 verfeinert): Das Modell wechselt `<think>` zwischen Denkblöcken und normaler Generierung über mehrere Tool-Aufrufe hinweg, sodass die Gedankenkette über Funktionsaufruf-Roundtrips hinweg erhalten bleibt, statt bei jedem Durchlauf verworfen zu werden. Das macht es für Long-Horizon-Agenten interessant — der Reasoning-Trace wird nicht jedes Mal zurückgesetzt, wenn du eine `tool_use` -Grenze erreichst.

Für Clore.ai-Nutzer ist die praktische Neuigkeit, dass M2.7 mit einem **FP8-(float8\_e4m3fn)-Checkpoint** im offiziellen Repo ausgeliefert wird. Damit ist ein Single-Node-Deployment in Reichweite auf **4× H100 80GB** oder **2× H200 141GB** — keine H200-Octets oder 16-GPU-Racks erforderlich. Wenn du bisher [GLM-5.1](/guides/guides_v2-de/sprachmodelle/glm-5-1.md) eingesetzt hast und ein zweites Open-Weight-Modell in deinem Agenten-Stack mit einem anderen Bias-Profil möchtest, ist dies das passende Modell.

### Wichtige Spezifikationen

| Eigenschaft                       | Wert                                                                                                                                        |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| Gesamtparameter                   | 229B (MoE, 256 Experten)                                                                                                                    |
| Experten pro Token                | 8 von 256                                                                                                                                   |
| Aktive Parameter                  | **Nicht offiziell veröffentlicht** — siehe Model Card. Die M2-Familie lag historisch bei \~10B aktiv; vor dem öffentlichen Zitieren prüfen. |
| Hidden Size / Schichten           | 3,072 / 62                                                                                                                                  |
| Attention                         | 48 Köpfe, 8 KV (GQA)                                                                                                                        |
| Kontextfenster                    | 204.800 Tokens (200K)                                                                                                                       |
| Tensor-Typen                      | F32, BF16, F8\_E4M3                                                                                                                         |
| MTP                               | Multi-Token Prediction aktiviert (3 MTP-Module)                                                                                             |
| Lizenz                            | **Benutzerdefiniertes MiniMax — standardmäßig nicht kommerziell**                                                                           |
| Veröffentlichungsdatum            | 9. April 2026                                                                                                                               |
| HF-Downloads (3 Wochen)           | \~496K                                                                                                                                      |
| Empfohlene Sampling-Einstellungen | `temperature=1.0`, `top_p=0.95`, `top_k=40`                                                                                                 |
| Primäres Tooling                  | vLLM, SGLang, Transformers, KTransformers, MLX-LM                                                                                           |

### Warum MiniMax M2.7?

* **Open Weights bei 229B** — größtes „echtes“ Open-Weight-Coding-Modell, das im FP8 noch auf einen einzelnen 4×H100-Knoten passt
* **Interleaved Thinking** — `<think>` Blöcke bleiben über Tool-Call-Runden hinweg erhalten, was für SWE-ähnliche Agenten wirklich nützlich ist
* **Fokus auf mehrsprachiges Coding** — MiniMax bewirbt starke Leistung bei Rust, Go, Java, Kotlin, Swift und TypeScript, nicht nur bei Python
* **Adoptionssignal** — 496K Downloads in drei Wochen sind die stärkste Community-Resonanz aller von uns verfolgten Open-Weight-Releases im April 2026
* **MTP-Unterstützung** — spekulatives Decoding über Multi-Token-Prediction-Module ist eingebaut, was auf H100/H200 zu echtem Durchsatz führt
* **Gehosteter Fallback** — wenn deine Workload einen einzelnen Knoten übersteigt, gibt es MiniMax' gehosteten Endpunkt; du musst dich auf Architekturebene nicht festlegen

***

## Anforderungen

{% hint style="warning" %}
**229B ist immer noch 229B.** BF16-Gewichte sind \~460GB. Der FP8-Checkpoint ist ungefähr halb so groß — \~230GB — und genau das macht ein Single-Node-Deployment praktikabel. INT4-Community-Quants liegen bei unter \~120GB, werden aber offiziell nicht unterstützt.
{% endhint %}

| Komponente  | Hobby (INT4 GGUF, Offload)       | Empfohlen (FP8 Single-Node)         | Vollständiges BF16           |
| ----------- | -------------------------------- | ----------------------------------- | ---------------------------- |
| GPU-VRAM    | 24–48GB GPU + 128GB+ RAM-Offload | 4× H100 80GB **oder** 2× H200 141GB | 8× H100 80GB / 4× H200 141GB |
| Gesamt-VRAM | \~48GB GPU + Offload             | 320GB / 282GB                       | 640GB / 564GB                |
| RAM         | 128GB                            | 256GB                               | 512GB                        |
| Datenträger | 200GB NVMe                       | 400GB NVMe                          | 600GB NVMe                   |
| CUDA        | 12.0+                            | 12.4+                               | 12.4+                        |

**Clore.ai-Auswahl:** Der FP8-Checkpoint auf **2× H200** ist das sauberste Deployment-Ziel — minimale Tensor-Parallel-Splits, weniger NCCL-Hops, und die Mathematik für 200K Kontext funktioniert einfach. **4× H100** ist die günstigere Alternative, wenn der H200-Bestand knapp ist.

***

## Option A — Ollama / GGUF (quantisiert)

{% hint style="warning" %}
**Nur Community-Quants.** MiniMax veröffentlicht keine offiziellen GGUF-Gewichte für M2.7. Community-Builds in Q4/Q5 erscheinen typischerweise 1–2 Wochen nach Veröffentlichung — suche [huggingface.co/models?search=minimax-m2.7+gguf](https://huggingface.co/models?search=minimax-m2.7+gguf) und prüfe den Uploader. Die Qualität variiert bei MoE-Quants unterhalb von Q4.
{% endhint %}

```bash
# Sobald ein Community-Q4_K_M-Build verfügbar ist (zuerst HuggingFace prüfen)
docker exec ollama ollama pull minimax-m2.7:q4_K_M
docker exec ollama ollama run minimax-m2.7:q4_K_M

# Oder direkt mit llama.cpp auf einem heruntergeladenen GGUF
docker run --gpus all -it --rm -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/minimax-m2.7-q4_k_m.gguf \
  --n-gpu-layers 80 --ctx-size 32768 \
  --temp 1.0 --top-p 0.95 --top-k 40 \
  --port 8080 --host 0.0.0.0
```

Nur für Hobbyeinsatz. Für echte Workloads verwende vLLM oder SGLang gegen den FP8-Checkpoint.

***

## Option B — vLLM (Produktions-API, empfohlen)

vLLM ist das primäre Serving-Ziel. Der offizielle FP8-Checkpoint ist der, den du ziehen solltest — gleiche Qualität wie BF16 bei ungefähr halbem VRAM.

### docker-compose.yml — 4× H100 80GB

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model MiniMaxAI/MiniMax-M2.7
      --quantization fp8
      --tensor-parallel-size 4
      --max-model-len 65536
      --gpu-memory-utilization 0.88
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --served-model-name minimax-m2.7
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

### docker-compose.yml — 2× H200 141GB

Reduziere `--tensor-parallel-size` auf 2 und erhöhe `--max-model-len` um den Spielraum zu nutzen:

```yaml
    command: >
      --model MiniMaxAI/MiniMax-M2.7
      --quantization fp8
      --tensor-parallel-size 2
      --max-model-len 131072
      --gpu-memory-utilization 0.90
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --enable-chunked-prefill
      --served-model-name minimax-m2.7
      --trust-remote-code
```

### Smoke-Test

```bash
curl http://localhost:8000/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "minimax-m2.7",
    "messages": [
      {"role": "system", "content": "You are a senior engineer. Use Interleaved Thinking when reasoning across tool calls."},
      {"role": "user", "content": "Audit this Rust async handler for tokio cancellation safety: ..."}
    ],
    "max_tokens": 4096,
    "temperature": 1.0,
    "top_p": 0.95
  }'
```

{% hint style="info" %}
**Senke `temperature` nicht unter 1.0.** MiniMax' empfohlene Sampling-Einstellung ist `T=1.0, top_p=0.95, top_k=40`. Greedy Decoding zerstört stillschweigend das `<think>` Interleaving bei Multi-Turn-Tool-Calls.
{% endhint %}

***

## Option C — SGLang

Der MoE-Scheduler von SGLang ist auf Hopper mit vLLM konkurrenzfähig und gewinnt dank EAGLE-Speculative-Decoding, das mit den MTP-Modulen von M2.7 kombiniert wird, oft bei langen Kontexten für Coding-Compleions.

```bash
docker pull lmsysorg/sglang:latest

python3 -m sglang.launch_server \
  --model-path MiniMaxAI/MiniMax-M2.7 \
  --quantization fp8 \
  --tp-size 4 \
  --mem-fraction-static 0.88 \
  --context-length 65536 \
  --enable-mixed-chunk \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --served-model-name minimax-m2.7 \
  --trust-remote-code
```

Erwarte einen Durchsatzgewinn von \~1,5–2× gegenüber vanilla vLLM bei langen Agent-Traces. Reduziere `--tp-size` auf 2 auf H200.

***

## Clore.ai GPU-Empfehlungen

| Setup                          | VRAM         | Erwartete Leistung                                 | Kosten auf Clore.ai |
| ------------------------------ | ------------ | -------------------------------------------------- | ------------------- |
| 1× RTX 4090 24GB + RAM-Offload | 24GB + 128GB | INT4-Hobby, \~5–10 tok/s                           | \~$1–2/Tag          |
| 4× A100 80GB                   | 320GB        | BF16-geshardet, \~15–25 tok/s                      | \~$15–22/Tag        |
| **4× H100 80GB (FP8)**         | **320GB**    | **FP8-Produktion, \~40–60 tok/s**                  | **\~$20–28/Tag**    |
| **2× H200 141GB (FP8)**        | **282GB**    | **FP8-Produktion, \~50–70 tok/s, voller 200K-ctx** | **\~$18–26/Tag**    |
| 8× H100 80GB                   | 640GB        | BF16 voll, \~80+ tok/s                             | \~$40–55/Tag        |

{% hint style="success" %}
**Bestes Preis-Leistungs-Verhältnis:** 2× H200 mit dem FP8-Checkpoint. Gleiche Durchsatzklasse wie 4× H100 mit halb so vielen Tensor-Parallel-Hops, oft günstiger pro Tag auf dem Marktplatz, und du behältst genug VRAM-Spielraum für den vollen 200K-Kontext.
{% endhint %}

Miete die Maschinen hier:

* [**Miete H200-GPUs**](https://clore.ai/rent-h200.html) — empfohlen für das 2× H200 FP8-Deployment
* [**Miete H100-GPUs**](https://clore.ai/rent-h100.html) — für das 4× H100 FP8-Deployment
* [**Miete A100 80GB**](https://clore.ai/rent-a100-80gb.html) — BF16-Multi-GPU-Fallback
* [**Miete RTX 4090**](https://clore.ai/rent-4090.html) — nur für INT4-Hobbyeinsatz
* [**Marktplatz**](https://clore.ai/marketplace) — gesamtes Inventar, On-Demand- und Spot-Gebote

***

## Anwendungsfälle

* **Mehrsprachige SWE-Agenten** — Rust, Go, Java, Kotlin, Swift und TypeScript werden erstklassig behandelt, nicht nur Python/JS
* **Tool-Calling-Loops über lange Horizonte** — Interleaved Thinking hält den Reasoning-Trace über Hunderte von `tool_use` Roundtrips
* **Codebase-Audits** — 200K Kontext passt eine mittelgroße Service-Komponente plus ihre Tests in einen Prompt
* **Refactoring-Pipelines** — anhaltende Korrektheit über viele Dateibearbeitungen hinweg durch die MTP-Module
* **Agent-von-Agenten-Orchestrierung** — verwende M2.7 als Planer und ein kleineres Modell (Qwen3.5, GLM-4.7-Flash) als Worker
* **Selbst gehostete Alternative zu Claude Sonnet/Opus** für nicht-kommerzielle Coding-Forschung — aber **lies zuerst die Lizenz**

***

## Benchmarks

{% hint style="warning" %}
**Vom Anbieter behauptet — unabhängig verifizieren.** Die folgenden Zahlen stammen aus MiniMax' Release-Notes vom 9. April 2026. Unabhängige Reproduktionen laufen noch ein.
{% endhint %}

| Benchmark        | MiniMax M2.7 | Claude Sonnet 4.5 (Anbieter-Referenz) | Claude Opus 4.6 (Anbieter-Referenz) | GPT-5.3-Codex |
| ---------------- | ------------ | ------------------------------------- | ----------------------------------- | ------------- |
| SWE-Pro          | **56.22%**   | \~55%                                 | \~57.3%                             | 56.2%         |
| VIBE-Pro         | **55.6%**    | —                                     | \~57%                               | —             |
| Terminal Bench 2 | **57.0%**    | —                                     | —                                   | —             |
| GDPval-AA (ELO)  | **1495**     | —                                     | —                                   | —             |

MiniMax' Einordnung: M2.7 erreicht oder übertrifft Claude Sonnet 4.5 bei der agentischen Coding-Suite, die ihnen wichtig ist, und landet bei SWE-Pro / VIBE-Pro innerhalb weniger Punkte von Claude Opus 4.6. Behandle dies als Richtungssignal, nicht als endgültiges Ranking — der Abstand zu geschlossenen Frontier-Modellen schrumpft mit jeder Veröffentlichung.

***

## MiniMax M2-Familie

| Version  | Veröffentlicht   | Architektonischer Fokus                                      | Empfohlen für                                      |
| -------- | ---------------- | ------------------------------------------------------------ | -------------------------------------------------- |
| M2       | Okt 2025         | Erste 229B-MoE-Veröffentlichung, RL-optimiertes Coding       | Referenz / historisch                              |
| M2.1     | Dez 2025         | **Interleaved Thinking** eingeführt                          | Früheste Version, die sich für Agenten lohnt       |
| M2.5     | Feb 2026         | Selbstentwickelndes RL-Post-Training, längerer Kontext       | Solides Coding-Modell bei begrenztem Speicherplatz |
| **M2.7** | **9. Apr. 2026** | **Verfeinertes mehrsprachiges Coding, MTP, offizielles FP8** | **Standardwahl — nimm dieses**                     |

Wenn du neu anfängst, überspringe frühere Versionen und gehe direkt zu M2.7. Die architektonischen Unterschiede summieren sich, und die FP8-Ergonomie ist merklich besser.

***

## Fehlerbehebung

| Problem                                | Lösung                                                                                                                             |
| -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` beim Laden von FP8  | Du brauchst \~230GB VRAM. Verwende 4× H100 80GB oder 2× H200 141GB. Reduziere zuerst `--max-model-len` auf 32768.                  |
| Langsamer HuggingFace-Download         | `huggingface-cli download MiniMaxAI/MiniMax-M2.7 --local-dir ./weights --resume-download`. Rechne mit \~230GB FP8 / \~460GB BF16.  |
| Tool-Calls stillschweigend verworfen   | Setze `--enable-auto-tool-choice --tool-call-parser hermes` in vLLM. M2.7 verwendet Hermes-artige Tool-Tags.                       |
| `<think>` Blöcke leer oder verstümmelt | Sampling muss sein `temperature=1.0, top_p=0.95, top_k=40`. Greedy Decoding zerstört Interleaved Thinking.                         |
| MTP-Fehler / Shape-Mismatch            | Aktualisiere vLLM auf die neueste stabile Version; die MTP-Unterstützung kam spät, und ältere Builds liefern die Module nicht mit. |
| 200K-Kontext führt auf H100 zu OOMs    | Verwende `--enable-chunked-prefill` und beginne bei `--max-model-len 65536`. Der volle 200K-Kontext erfordert realistisch H200.    |
| Lizenzverwirrung                       | Standard = nicht kommerziell. E-Mail an `api@minimax.io` mit dem Betreff "M2.7 licensing" vor jeder bezahlten Produktnutzung.      |

***

## Nächste Schritte

* **Audio-Geschwistermodell:** [MiniMax Speech](/guides/guides_v2-de/audio-and-sprache/minimax-speech.md) — derselbe Anbieter, Audio-/Stimmgenerierung
* **Open-License-Alternative:** [GLM-5.1](/guides/guides_v2-de/sprachmodelle/glm-5-1.md) — 744B / 40B aktiv, MIT-Lizenz, Top SWE-Bench Pro
* **Alternative mit riesigem Kontext:** [DeepSeek V4](/guides/guides_v2-de/sprachmodelle/deepseek-v4.md) — 1M Kontext, multimodal
* **Günstigere agentische Option:** [GLM-4.7 Flash](/guides/guides_v2-de/sprachmodelle/glm-47-flash.md) — passt auf eine einzelne H100, MIT
* **Clore.ai Marktplatz:** [clore.ai/marketplace](https://clore.ai/marketplace) — H100/H200/A100 vom Spot-Markt

### Links

* [MiniMax M2.7 auf HuggingFace](https://huggingface.co/MiniMaxAI/MiniMax-M2.7)
* [MiniMax M2.7 LICENSE](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) — vor kommerzieller Nutzung lesen
* [MiniMax-Plattform](https://www.minimax.io)
* [vLLM-Dokumentation](https://docs.vllm.ai)
* [SGLang-Repo](https://github.com/sgl-project/sglang)
* [KTransformers](https://github.com/kvcache-ai/ktransformers)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-de/sprachmodelle/minimax-m27.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.