# MiMo-V2-Flash

> MiMo-V2-Flash is a **309-billion parameter Mixture-of-Experts** language model that activates 15B parameters per token. Built with advanced speculative decoding (EAGLE/MTP), it delivers **150+ tokens/second** on 8×H100 while maintaining frontier-level performance. Released under **MIT license**, it represents the cutting edge of efficient large-scale inference.

## At a Glance

* **Model Size**: 309B total / 15B active parameters (MoE)
* **License**: MIT (fully commercial)
* **Context**: 32K tokens
* **Performance**: State-of-the-art on reasoning benchmarks
* **VRAM**: \~320GB FP16 (minimum 4×A100 80GB)
* **Speed**: 150+ tok/s on 8×H100 with speculative decoding

## Why MiMo-V2-Flash?

**Breakthrough Speed**: MiMo-V2-Flash achieves unprecedented inference speeds through EAGLE (Extrapolation Algorithm for Greater Language model Efficiency) and MTP (Multi-Token Prediction). Where traditional models generate one token at a time, MiMo-V2 predicts and validates multiple tokens in parallel.

**Production-Ready Scale**: At 309B parameters, MiMo-V2-Flash competes with the largest frontier models while remaining deployable on realistic hardware configurations. The 15B active parameters ensure efficient inference despite the massive parameter count.

**Advanced Architecture**: Beyond standard MoE, MiMo-V2-Flash incorporates speculative decoding natively in the model architecture. This isn't a post-training optimization — it's built into the foundation, enabling guaranteed speedups.

**Enterprise Quality**: MIT licensing with no usage restrictions. Deploy at scale, fine-tune, or integrate into commercial products without licensing concerns.

## GPU Recommendations

| Setup           | VRAM  | Performance    | Daily Cost\* |
| --------------- | ----- | -------------- | ------------ |
| **4×A100 80GB** | 320GB | \~80 tok/s     | \~$16.00     |
| **8×A100 40GB** | 320GB | \~70 tok/s     | \~$28.00     |
| **2×H100**      | 160GB | \~90 tok/s     | \~$12.00     |
| **8×H100**      | 640GB | **150+ tok/s** | \~$48.00     |
| 4×H200          | 564GB | \~120 tok/s    | \~$32.00     |

**Best Value**: 4×A100 80GB provides excellent performance per dollar. **Maximum Performance**: 8×H100 unleashes full speculative decoding potential.

\*Estimated Clore.ai marketplace prices

## Deploy with SGLang (Recommended)

SGLang provides the best support for MiMo-V2-Flash's speculative decoding features:

### Install SGLang

```bash
pip install "sglang[all]>=0.3.0"
# or latest
pip install git+https://github.com/sgl-project/sglang.git
```

### Multi-GPU Setup with MTP

```bash
python -m sglang.launch_server \
  --model-path mimo-ai/MiMo-V2-Flash \
  --tp-size 8 \
  --enable-mtp \
  --mtp-max-draft-tokens 8 \
  --mtp-acceptance-rate 0.8 \
  --mem-fraction-static 0.85 \
  --dtype float16 \
  --context-length 32768 \
  --served-model-name mimo-v2-flash
```

### Query with OpenAI API

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1", 
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[
        {"role": "system", "content": "You are an expert AI researcher."},
        {"role": "user", "content": "Explain the EAGLE speculative decoding algorithm and why it enables faster inference"}
    ],
    max_tokens=1024,
    temperature=0.7,
    stream=True  # Recommended for best latency
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)
```

## Deploy with vLLM

vLLM also supports MiMo-V2-Flash with speculative decoding:

```bash
pip install vllm>=0.6.0

vllm serve mimo-ai/MiMo-V2-Flash \
  --tensor-parallel-size 8 \
  --speculative-model mimo-ai/MiMo-V2-Flash-Draft \
  --speculative-max-model-len 32768 \
  --speculative-draft-tensor-parallel-size 2 \
  --use-v2-block-manager \
  --dtype float16 \
  --served-model-name mimo-v2-flash \
  --trust-remote-code
```

## Docker Template

```dockerfile
FROM nvidia/cuda:12.1-devel-ubuntu22.04

# Install dependencies
RUN apt-get update && \
    apt-get install -y python3.10 python3-pip git && \
    rm -rf /var/lib/apt/lists/*

# Install SGLang with MTP support
RUN pip install "sglang[all]>=0.3.0" transformers

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Pre-download model (optional, saves startup time)
# RUN python3 -c "from transformers import AutoModel; AutoModel.from_pretrained('mimo-ai/MiMo-V2-Flash', trust_remote_code=True)"

EXPOSE 30000

CMD ["python", "-m", "sglang.launch_server", \
     "--model-path", "mimo-ai/MiMo-V2-Flash", \
     "--host", "0.0.0.0", \
     "--port", "30000", \
     "--tp-size", "8", \
     "--enable-mtp", \
     "--mtp-max-draft-tokens", "8", \
     "--dtype", "float16"]
```

Run with all GPUs:

```bash
docker build -t mimo-v2-flash .
docker run --gpus all -p 30000:30000 \
  --shm-size=64g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  mimo-v2-flash
```

## Advanced Configuration

### Optimizing Speculative Decoding

Fine-tune speculative parameters based on your workload:

```bash
# For code generation (higher acceptance rate)
python -m sglang.launch_server \
  --model-path mimo-ai/MiMo-V2-Flash \
  --tp-size 8 \
  --enable-mtp \
  --mtp-max-draft-tokens 12 \
  --mtp-acceptance-rate 0.9 \
  --temperature 0.1

# For creative writing (lower acceptance rate)
python -m sglang.launch_server \
  --model-path mimo-ai/MiMo-V2-Flash \
  --tp-size 8 \
  --enable-mtp \
  --mtp-max-draft-tokens 6 \
  --mtp-acceptance-rate 0.7 \
  --temperature 0.8
```

### Memory Optimization

For memory-constrained setups:

```bash
# Reduce memory usage (slower but fits 4×A100)
python -m sglang.launch_server \
  --model-path mimo-ai/MiMo-V2-Flash \
  --tp-size 4 \
  --mem-fraction-static 0.75 \
  --context-length 16384 \
  --dtype float16 \
  --disable-cuda-graph  # Saves VRAM
```

## Benchmarking Example

Test MiMo-V2-Flash's speed advantage:

```python
import time
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

def benchmark_generation():
    start_time = time.time()
    
    response = client.chat.completions.create(
        model="mimo-v2-flash",
        messages=[
            {"role": "user", "content": "Write a detailed explanation of quantum computing in exactly 500 words"}
        ],
        max_tokens=600,
        temperature=0.1,
        stream=False
    )
    
    end_time = time.time()
    content = response.choices[0].message.content
    
    tokens = len(content.split())  # Rough token estimate
    duration = end_time - start_time
    tokens_per_second = tokens / duration
    
    print(f"Generated {tokens} tokens in {duration:.2f}s")
    print(f"Speed: {tokens_per_second:.1f} tokens/second")
    
    return tokens_per_second

# Run benchmark
speed = benchmark_generation()
print(f"\nMiMo-V2-Flash achieved {speed:.1f} tok/s")
```

## Tips for Clore.ai Users

* **Multi-GPU Essential**: MiMo-V2-Flash requires minimum 4×A100 80GB. Single-GPU deployment isn't feasible.
* **NVLink Advantage**: Choose Clore.ai hosts with NVLink between GPUs for optimal multi-GPU communication.
* **RAM Requirements**: Ensure 256GB+ system RAM for smooth operation with 8 GPUs.
* **Speculative Tuning**: Adjust `mtp-max-draft-tokens` based on your use case — higher for repetitive tasks, lower for creative work.
* **Context Length**: 32K context is optimal. Longer contexts reduce speculative decoding effectiveness.

## Troubleshooting

| Issue                         | Solution                                                            |
| ----------------------------- | ------------------------------------------------------------------- |
| `OutOfMemoryError` on startup | Reduce `mem-fraction-static` or `tp-size`                           |
| Slow inter-GPU communication  | Verify NVLink: `nvidia-ml-py3` or `nvidia-smi topo -m`              |
| MTP not accelerating          | Check `mtp-acceptance-rate` — too high values disable speculation   |
| Model loading timeout         | Pre-download: `huggingface-cli download mimo-ai/MiMo-V2-Flash`      |
| Poor token acceptance         | Verify temperature settings — very low/high temps reduce acceptance |

## Performance Comparison

| Model             | Size     | Speed (8×H100) | Quality |
| ----------------- | -------- | -------------- | ------- |
| GPT-4 Turbo       | \~1.7T   | \~15-25 tok/s  | ★★★★★   |
| Claude Sonnet 3.5 | \~200B   | \~25-35 tok/s  | ★★★★★   |
| **MiMo-V2-Flash** | **309B** | **150+ tok/s** | ★★★★☆   |
| Llama 3.1 405B    | 405B     | \~30-45 tok/s  | ★★★★☆   |

MiMo-V2-Flash achieves 3-5x speedup over comparable models while maintaining competitive quality.

## Resources

* [MiMo-V2-Flash on Hugging Face](https://huggingface.co/mimo-ai/MiMo-V2-Flash)
* [EAGLE Paper](https://arxiv.org/abs/2401.15077)
* [SGLang Documentation](https://sgl-project.github.io/start/install.html)
* [Multi-Token Prediction](https://arxiv.org/abs/2404.19737)
* [Speculative Decoding Guide](https://huggingface.co/blog/assisted-generation)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/mimo-v2-flash.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.