# Kimi K2.5

Kimi K2.5, released January 27, 2026 by Moonshot AI, is a **1-trillion parameter Mixture-of-Experts multimodal model** with 32B active parameters per token. Built through continual pretraining on \~15 trillion mixed visual and text tokens atop the Kimi-K2-Base, it natively understands text, images, and video. K2.5 introduces **Agent Swarm** technology — coordinating up to 100 specialized AI agents simultaneously — and achieves frontier-level performance on coding (76.8% SWE-bench Verified), vision, and agentic tasks. Available under an **open-weight license** on HuggingFace.

## Key Features

* **1T total / 32B active** — 384-expert MoE architecture with MLA attention and SwiGLU
* **Native multimodal** — pre-trained on vision–language tokens; understands images, video, and text
* **Agent Swarm** — decomposes complex tasks into parallel sub-tasks via dynamically spawned agents
* **256K context window** — process entire codebases, long documents, and video transcripts
* **Hybrid reasoning** — supports both instant mode (fast) and thinking mode (deep reasoning)
* **Strong coding** — 76.8% SWE-bench Verified, 73.0% SWE-bench Multilingual

## Requirements

Kimi K2.5 is a massive model — the FP8 checkpoint is \~630GB. Self-hosting requires serious hardware.

| Component | Quantized (GGUF Q2)     | FP8 Full      |
| --------- | ----------------------- | ------------- |
| GPU       | 1× RTX 4090 + 256GB RAM | 8× H200 141GB |
| VRAM      | 24GB + CPU offload      | 1,128GB       |
| RAM       | 256GB+                  | 256GB         |
| Disk      | 400GB SSD               | 700GB NVMe    |
| CUDA      | 12.0+                   | 12.0+         |

**Clore.ai recommendation**: For full-precision serving, rent 8× H200 (\~$24–48/day). For quantized local inference, a single H100 80GB or even RTX 4090 + heavy CPU offloading works at reduced speed.

## Quick Start with llama.cpp (Quantized)

The most accessible way to run K2.5 locally — using Unsloth's GGUF quantizations:

```bash
# Clone and build llama.cpp
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j

# Download quantized model (Q2_K_XL — 375GB, good quality/size balance)
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  Kimi-K2.5-UD-Q2_K_XL-00001-of-00005.gguf \
  Kimi-K2.5-UD-Q2_K_XL-00002-of-00005.gguf \
  Kimi-K2.5-UD-Q2_K_XL-00003-of-00005.gguf \
  Kimi-K2.5-UD-Q2_K_XL-00004-of-00005.gguf \
  Kimi-K2.5-UD-Q2_K_XL-00005-of-00005.gguf \
  --local-dir ./models

# Run inference (adjust --n-gpu-layers for your VRAM)
./build/bin/llama-server \
  -m ./models/Kimi-K2.5-UD-Q2_K_XL-00001-of-00005.gguf \
  --n-gpu-layers 10 \
  --threads 32 \
  --ctx-size 16384 \
  --host 0.0.0.0 --port 8080
```

> **Note**: Vision is not yet supported in GGUF/llama.cpp for K2.5. For multimodal features, use vLLM.

## vLLM Setup (Production — Full Model)

For production serving with full multimodal support:

```bash
# Install vLLM nightly (K2.5 requires latest)
pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match
```

### Serve on 8× H200 GPUs

```bash
vllm serve moonshotai/Kimi-K2.5 \
  -tp 8 \
  --mm-encoder-tp-mode data \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90
```

### Query with Text

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[
        {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
        {"role": "user", "content": "Write a FastAPI service with WebSocket support for real-time chat"}
    ],
    temperature=0.6,
    max_tokens=4096
)
print(response.choices[0].message.content)
```

### Query with Image (Multimodal)

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY", timeout=3600)

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/diagram.png"}
            },
            {
                "type": "text",
                "text": "Describe this diagram in detail and extract all text."
            }
        ]
    }],
    max_tokens=2048
)
print(response.choices[0].message.content)
```

## API Access (No GPU Required)

If self-hosting is overkill, use Moonshot's official API:

```python
from openai import OpenAI

# Moonshot Platform — OpenAI-compatible API
client = OpenAI(
    api_key="your-moonshot-api-key",
    base_url="https://api.moonshot.ai/v1"
)

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[
        {"role": "user", "content": "Explain the Agent Swarm architecture in Kimi K2.5"}
    ],
    temperature=0.6,
    max_tokens=2048
)
print(response.choices[0].message.content)
```

## Tool Calling

K2.5 excels at agentic tool use:

```python
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

tools = [{
    "type": "function",
    "function": {
        "name": "search_code",
        "description": "Search a codebase for relevant files and functions",
        "parameters": {
            "type": "object",
            "required": ["query"],
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            }
        }
    }
}]

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[{"role": "user", "content": "Find all authentication-related code in the project"}],
    tools=tools,
    tool_choice="auto",
    temperature=0.6
)

for tool_call in response.choices[0].message.tool_calls:
    print(f"Function: {tool_call.function.name}")
    print(f"Args: {json.loads(tool_call.function.arguments)}")
```

## Docker Quick Start

```bash
# Using vLLM Docker with 8 GPUs
docker run --gpus all -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code
```

## Tips for Clore.ai Users

* **API vs self-hosting tradeoff**: Full K2.5 needs 8× H200 at \~$24–48/day. Moonshot's API is free-tier or pay-per-token — use API for exploration, self-host for sustained production loads.
* **Quantized on single GPU**: The Unsloth GGUF Q2\_K\_XL (\~375GB) can run on an RTX 4090 ($0.5–2/day) with 256GB RAM via CPU offloading — expect \~5–10 tok/s. Good enough for personal use and development.
* **Text-only K2 for budget setups**: If you don't need vision, `moonshotai/Kimi-K2-Instruct` is the text-only predecessor — same 1T MoE but lighter to deploy (no vision encoder overhead).
* **Set temperature correctly**: Use `temperature=0.6` for instant mode, `temperature=1.0` for thinking mode. Wrong temperature causes repetition or incoherence.
* **Expert Parallelism for throughput**: On multi-node setups, use `--enable-expert-parallel` in vLLM for higher throughput. Check vLLM docs for EP configuration.

## Troubleshooting

| Issue                              | Solution                                                                           |
| ---------------------------------- | ---------------------------------------------------------------------------------- |
| `OutOfMemoryError` with full model | Need 8× H200 (1128GB total). Use FP8 weights, set `--gpu-memory-utilization 0.90`. |
| GGUF inference very slow           | Ensure enough RAM for the quant size. Q2\_K\_XL needs \~375GB RAM+VRAM combined.   |
| Vision not working in llama.cpp    | Vision support for K2.5 GGUF is not available yet — use vLLM for multimodal.       |
| Repetitive output                  | Set `temperature=0.6` (instant) or `1.0` (thinking). Add `min_p=0.01`.             |
| Model download takes forever       | \~630GB FP8 checkpoint. Use `huggingface-cli download` with `--resume-download`.   |
| Tool calls not parsed              | Add `--tool-call-parser kimi_k2 --enable-auto-tool-choice` to vLLM serve command.  |

## Further Reading

* [Kimi K2.5 on HuggingFace](https://huggingface.co/moonshotai/Kimi-K2.5)
* [Kimi K2.5 Tech Blog](https://www.kimi.com/blog/kimi-k2-5.html)
* [Kimi K2.5 Paper](https://arxiv.org/abs/2602.02276)
* [vLLM K2.5 Recipe](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html)
* [Unsloth GGUF Quantizations](https://huggingface.co/unsloth/Kimi-K2.5-GGUF)
* [Moonshot API Platform](https://platform.moonshot.ai)
* [Kimi K2 GitHub](https://github.com/MoonshotAI/Kimi-K2)
