# Kimi K2.5

Kimi K2.5, released January 27, 2026 by Moonshot AI, is a **1-trillion parameter Mixture-of-Experts multimodal model** with 32B active parameters per token. Built through continual pretraining on \~15 trillion mixed visual and text tokens atop the Kimi-K2-Base, it natively understands text, images, and video. K2.5 introduces **Agent Swarm** technology — coordinating up to 100 specialized AI agents simultaneously — and achieves frontier-level performance on coding (76.8% SWE-bench Verified), vision, and agentic tasks. Available under an **open-weight license** on HuggingFace.

## Key Features

* **1T total / 32B active** — 384-expert MoE architecture with MLA attention and SwiGLU
* **Native multimodal** — pre-trained on vision–language tokens; understands images, video, and text
* **Agent Swarm** — decomposes complex tasks into parallel sub-tasks via dynamically spawned agents
* **256K context window** — process entire codebases, long documents, and video transcripts
* **Hybrid reasoning** — supports both instant mode (fast) and thinking mode (deep reasoning)
* **Strong coding** — 76.8% SWE-bench Verified, 73.0% SWE-bench Multilingual

## Requirements

Kimi K2.5 is a massive model — the FP8 checkpoint is \~630GB. Self-hosting requires serious hardware.

| Component | Quantized (GGUF Q2)     | FP8 Full      |
| --------- | ----------------------- | ------------- |
| GPU       | 1× RTX 4090 + 256GB RAM | 8× H200 141GB |
| VRAM      | 24GB + CPU offload      | 1,128GB       |
| RAM       | 256GB+                  | 256GB         |
| Disk      | 400GB SSD               | 700GB NVMe    |
| CUDA      | 12.0+                   | 12.0+         |

**Clore.ai recommendation**: For full-precision serving, rent 8× H200 (\~$24–48/day). For quantized local inference, a single H100 80GB or even RTX 4090 + heavy CPU offloading works at reduced speed.

## Quick Start with llama.cpp (Quantized)

The most accessible way to run K2.5 locally — using Unsloth's GGUF quantizations:

```bash
# Clone and build llama.cpp
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j

# Download quantized model (Q2_K_XL — 375GB, good quality/size balance)
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  Kimi-K2.5-UD-Q2_K_XL-00001-of-00005.gguf \
  Kimi-K2.5-UD-Q2_K_XL-00002-of-00005.gguf \
  Kimi-K2.5-UD-Q2_K_XL-00003-of-00005.gguf \
  Kimi-K2.5-UD-Q2_K_XL-00004-of-00005.gguf \
  Kimi-K2.5-UD-Q2_K_XL-00005-of-00005.gguf \
  --local-dir ./models

# Run inference (adjust --n-gpu-layers for your VRAM)
./build/bin/llama-server \
  -m ./models/Kimi-K2.5-UD-Q2_K_XL-00001-of-00005.gguf \
  --n-gpu-layers 10 \
  --threads 32 \
  --ctx-size 16384 \
  --host 0.0.0.0 --port 8080
```

> **Note**: Vision is not yet supported in GGUF/llama.cpp for K2.5. For multimodal features, use vLLM.

## vLLM Setup (Production — Full Model)

For production serving with full multimodal support:

```bash
# Install vLLM nightly (K2.5 requires latest)
pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match
```

### Serve on 8× H200 GPUs

```bash
vllm serve moonshotai/Kimi-K2.5 \
  -tp 8 \
  --mm-encoder-tp-mode data \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90
```

### Query with Text

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[
        {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
        {"role": "user", "content": "Write a FastAPI service with WebSocket support for real-time chat"}
    ],
    temperature=0.6,
    max_tokens=4096
)
print(response.choices[0].message.content)
```

### Query with Image (Multimodal)

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY", timeout=3600)

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/diagram.png"}
            },
            {
                "type": "text",
                "text": "Describe this diagram in detail and extract all text."
            }
        ]
    }],
    max_tokens=2048
)
print(response.choices[0].message.content)
```

## API Access (No GPU Required)

If self-hosting is overkill, use Moonshot's official API:

```python
from openai import OpenAI

# Moonshot Platform — OpenAI-compatible API
client = OpenAI(
    api_key="your-moonshot-api-key",
    base_url="https://api.moonshot.ai/v1"
)

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[
        {"role": "user", "content": "Explain the Agent Swarm architecture in Kimi K2.5"}
    ],
    temperature=0.6,
    max_tokens=2048
)
print(response.choices[0].message.content)
```

## Tool Calling

K2.5 excels at agentic tool use:

```python
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

tools = [{
    "type": "function",
    "function": {
        "name": "search_code",
        "description": "Search a codebase for relevant files and functions",
        "parameters": {
            "type": "object",
            "required": ["query"],
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            }
        }
    }
}]

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[{"role": "user", "content": "Find all authentication-related code in the project"}],
    tools=tools,
    tool_choice="auto",
    temperature=0.6
)

for tool_call in response.choices[0].message.tool_calls:
    print(f"Function: {tool_call.function.name}")
    print(f"Args: {json.loads(tool_call.function.arguments)}")
```

## Docker Quick Start

```bash
# Using vLLM Docker with 8 GPUs
docker run --gpus all -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code
```

## Tips for Clore.ai Users

* **API vs self-hosting tradeoff**: Full K2.5 needs 8× H200 at \~$24–48/day. Moonshot's API is free-tier or pay-per-token — use API for exploration, self-host for sustained production loads.
* **Quantized on single GPU**: The Unsloth GGUF Q2\_K\_XL (\~375GB) can run on an RTX 4090 ($0.5–2/day) with 256GB RAM via CPU offloading — expect \~5–10 tok/s. Good enough for personal use and development.
* **Text-only K2 for budget setups**: If you don't need vision, `moonshotai/Kimi-K2-Instruct` is the text-only predecessor — same 1T MoE but lighter to deploy (no vision encoder overhead).
* **Set temperature correctly**: Use `temperature=0.6` for instant mode, `temperature=1.0` for thinking mode. Wrong temperature causes repetition or incoherence.
* **Expert Parallelism for throughput**: On multi-node setups, use `--enable-expert-parallel` in vLLM for higher throughput. Check vLLM docs for EP configuration.

## Troubleshooting

| Issue                              | Solution                                                                           |
| ---------------------------------- | ---------------------------------------------------------------------------------- |
| `OutOfMemoryError` with full model | Need 8× H200 (1128GB total). Use FP8 weights, set `--gpu-memory-utilization 0.90`. |
| GGUF inference very slow           | Ensure enough RAM for the quant size. Q2\_K\_XL needs \~375GB RAM+VRAM combined.   |
| Vision not working in llama.cpp    | Vision support for K2.5 GGUF is not available yet — use vLLM for multimodal.       |
| Repetitive output                  | Set `temperature=0.6` (instant) or `1.0` (thinking). Add `min_p=0.01`.             |
| Model download takes forever       | \~630GB FP8 checkpoint. Use `huggingface-cli download` with `--resume-download`.   |
| Tool calls not parsed              | Add `--tool-call-parser kimi_k2 --enable-auto-tool-choice` to vLLM serve command.  |

## Further Reading

* [Kimi K2.5 on HuggingFace](https://huggingface.co/moonshotai/Kimi-K2.5)
* [Kimi K2.5 Tech Blog](https://www.kimi.com/blog/kimi-k2-5.html)
* [Kimi K2.5 Paper](https://arxiv.org/abs/2602.02276)
* [vLLM K2.5 Recipe](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html)
* [Unsloth GGUF Quantizations](https://huggingface.co/unsloth/Kimi-K2.5-GGUF)
* [Moonshot API Platform](https://platform.moonshot.ai)
* [Kimi K2 GitHub](https://github.com/MoonshotAI/Kimi-K2)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/kimi-k2.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
