# MiMo-V2-Flash

> MiMo-V2-Flash is a **309-billion parameter Mixture-of-Experts** language model that activates 15B parameters per token. Built with advanced speculative decoding (EAGLE/MTP), it delivers **150+ tokens/second** on 8×H100 while maintaining frontier-level performance. Released under **MIT license**, it represents the cutting edge of efficient large-scale inference.

## At a Glance

* **Model Size**: 309B total / 15B active parameters (MoE)
* **License**: MIT (fully commercial)
* **Context**: 32K tokens
* **Performance**: State-of-the-art on reasoning benchmarks
* **VRAM**: \~320GB FP16 (minimum 4×A100 80GB)
* **Speed**: 150+ tok/s on 8×H100 with speculative decoding

## Why MiMo-V2-Flash?

**Breakthrough Speed**: MiMo-V2-Flash achieves unprecedented inference speeds through EAGLE (Extrapolation Algorithm for Greater Language model Efficiency) and MTP (Multi-Token Prediction). Where traditional models generate one token at a time, MiMo-V2 predicts and validates multiple tokens in parallel.

**Production-Ready Scale**: At 309B parameters, MiMo-V2-Flash competes with the largest frontier models while remaining deployable on realistic hardware configurations. The 15B active parameters ensure efficient inference despite the massive parameter count.

**Advanced Architecture**: Beyond standard MoE, MiMo-V2-Flash incorporates speculative decoding natively in the model architecture. This isn't a post-training optimization — it's built into the foundation, enabling guaranteed speedups.

**Enterprise Quality**: MIT licensing with no usage restrictions. Deploy at scale, fine-tune, or integrate into commercial products without licensing concerns.

## GPU Recommendations

| Setup           | VRAM  | Performance    | Daily Cost\* |
| --------------- | ----- | -------------- | ------------ |
| **4×A100 80GB** | 320GB | \~80 tok/s     | \~$16.00     |
| **8×A100 40GB** | 320GB | \~70 tok/s     | \~$28.00     |
| **2×H100**      | 160GB | \~90 tok/s     | \~$12.00     |
| **8×H100**      | 640GB | **150+ tok/s** | \~$48.00     |
| 4×H200          | 564GB | \~120 tok/s    | \~$32.00     |

**Best Value**: 4×A100 80GB provides excellent performance per dollar. **Maximum Performance**: 8×H100 unleashes full speculative decoding potential.

\*Estimated Clore.ai marketplace prices

## Deploy with SGLang (Recommended)

SGLang provides the best support for MiMo-V2-Flash's speculative decoding features:

### Install SGLang

```bash
pip install "sglang[all]>=0.3.0"
# or latest
pip install git+https://github.com/sgl-project/sglang.git
```

### Multi-GPU Setup with MTP

```bash
python -m sglang.launch_server \
  --model-path mimo-ai/MiMo-V2-Flash \
  --tp-size 8 \
  --enable-mtp \
  --mtp-max-draft-tokens 8 \
  --mtp-acceptance-rate 0.8 \
  --mem-fraction-static 0.85 \
  --dtype float16 \
  --context-length 32768 \
  --served-model-name mimo-v2-flash
```

### Query with OpenAI API

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1", 
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[
        {"role": "system", "content": "You are an expert AI researcher."},
        {"role": "user", "content": "Explain the EAGLE speculative decoding algorithm and why it enables faster inference"}
    ],
    max_tokens=1024,
    temperature=0.7,
    stream=True  # Recommended for best latency
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)
```

## Deploy with vLLM

vLLM also supports MiMo-V2-Flash with speculative decoding:

```bash
pip install vllm>=0.6.0

vllm serve mimo-ai/MiMo-V2-Flash \
  --tensor-parallel-size 8 \
  --speculative-model mimo-ai/MiMo-V2-Flash-Draft \
  --speculative-max-model-len 32768 \
  --speculative-draft-tensor-parallel-size 2 \
  --use-v2-block-manager \
  --dtype float16 \
  --served-model-name mimo-v2-flash \
  --trust-remote-code
```

## Docker Template

```dockerfile
FROM nvidia/cuda:12.1-devel-ubuntu22.04

# Install dependencies
RUN apt-get update && \
    apt-get install -y python3.10 python3-pip git && \
    rm -rf /var/lib/apt/lists/*

# Install SGLang with MTP support
RUN pip install "sglang[all]>=0.3.0" transformers

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Pre-download model (optional, saves startup time)
# RUN python3 -c "from transformers import AutoModel; AutoModel.from_pretrained('mimo-ai/MiMo-V2-Flash', trust_remote_code=True)"

EXPOSE 30000

CMD ["python", "-m", "sglang.launch_server", \
     "--model-path", "mimo-ai/MiMo-V2-Flash", \
     "--host", "0.0.0.0", \
     "--port", "30000", \
     "--tp-size", "8", \
     "--enable-mtp", \
     "--mtp-max-draft-tokens", "8", \
     "--dtype", "float16"]
```

Run with all GPUs:

```bash
docker build -t mimo-v2-flash .
docker run --gpus all -p 30000:30000 \
  --shm-size=64g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  mimo-v2-flash
```

## Advanced Configuration

### Optimizing Speculative Decoding

Fine-tune speculative parameters based on your workload:

```bash
# For code generation (higher acceptance rate)
python -m sglang.launch_server \
  --model-path mimo-ai/MiMo-V2-Flash \
  --tp-size 8 \
  --enable-mtp \
  --mtp-max-draft-tokens 12 \
  --mtp-acceptance-rate 0.9 \
  --temperature 0.1

# For creative writing (lower acceptance rate)
python -m sglang.launch_server \
  --model-path mimo-ai/MiMo-V2-Flash \
  --tp-size 8 \
  --enable-mtp \
  --mtp-max-draft-tokens 6 \
  --mtp-acceptance-rate 0.7 \
  --temperature 0.8
```

### Memory Optimization

For memory-constrained setups:

```bash
# Reduce memory usage (slower but fits 4×A100)
python -m sglang.launch_server \
  --model-path mimo-ai/MiMo-V2-Flash \
  --tp-size 4 \
  --mem-fraction-static 0.75 \
  --context-length 16384 \
  --dtype float16 \
  --disable-cuda-graph  # Saves VRAM
```

## Benchmarking Example

Test MiMo-V2-Flash's speed advantage:

```python
import time
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

def benchmark_generation():
    start_time = time.time()
    
    response = client.chat.completions.create(
        model="mimo-v2-flash",
        messages=[
            {"role": "user", "content": "Write a detailed explanation of quantum computing in exactly 500 words"}
        ],
        max_tokens=600,
        temperature=0.1,
        stream=False
    )
    
    end_time = time.time()
    content = response.choices[0].message.content
    
    tokens = len(content.split())  # Rough token estimate
    duration = end_time - start_time
    tokens_per_second = tokens / duration
    
    print(f"Generated {tokens} tokens in {duration:.2f}s")
    print(f"Speed: {tokens_per_second:.1f} tokens/second")
    
    return tokens_per_second

# Run benchmark
speed = benchmark_generation()
print(f"\nMiMo-V2-Flash achieved {speed:.1f} tok/s")
```

## Tips for Clore.ai Users

* **Multi-GPU Essential**: MiMo-V2-Flash requires minimum 4×A100 80GB. Single-GPU deployment isn't feasible.
* **NVLink Advantage**: Choose Clore.ai hosts with NVLink between GPUs for optimal multi-GPU communication.
* **RAM Requirements**: Ensure 256GB+ system RAM for smooth operation with 8 GPUs.
* **Speculative Tuning**: Adjust `mtp-max-draft-tokens` based on your use case — higher for repetitive tasks, lower for creative work.
* **Context Length**: 32K context is optimal. Longer contexts reduce speculative decoding effectiveness.

## Troubleshooting

| Issue                         | Solution                                                            |
| ----------------------------- | ------------------------------------------------------------------- |
| `OutOfMemoryError` on startup | Reduce `mem-fraction-static` or `tp-size`                           |
| Slow inter-GPU communication  | Verify NVLink: `nvidia-ml-py3` or `nvidia-smi topo -m`              |
| MTP not accelerating          | Check `mtp-acceptance-rate` — too high values disable speculation   |
| Model loading timeout         | Pre-download: `huggingface-cli download mimo-ai/MiMo-V2-Flash`      |
| Poor token acceptance         | Verify temperature settings — very low/high temps reduce acceptance |

## Performance Comparison

| Model             | Size     | Speed (8×H100) | Quality |
| ----------------- | -------- | -------------- | ------- |
| GPT-4 Turbo       | \~1.7T   | \~15-25 tok/s  | ★★★★★   |
| Claude Sonnet 3.5 | \~200B   | \~25-35 tok/s  | ★★★★★   |
| **MiMo-V2-Flash** | **309B** | **150+ tok/s** | ★★★★☆   |
| Llama 3.1 405B    | 405B     | \~30-45 tok/s  | ★★★★☆   |

MiMo-V2-Flash achieves 3-5x speedup over comparable models while maintaining competitive quality.

## Resources

* [MiMo-V2-Flash on Hugging Face](https://huggingface.co/mimo-ai/MiMo-V2-Flash)
* [EAGLE Paper](https://arxiv.org/abs/2401.15077)
* [SGLang Documentation](https://sgl-project.github.io/start/install.html)
* [Multi-Token Prediction](https://arxiv.org/abs/2404.19737)
* [Speculative Decoding Guide](https://huggingface.co/blog/assisted-generation)
