Kimi K2.5

Deploy Kimi K2.5 (1T MoE multimodal) by Moonshot AI on Clore.ai GPUs

Kimi K2.5, released January 27, 2026 by Moonshot AI, is a 1-trillion parameter Mixture-of-Experts multimodal model with 32B active parameters per token. Built through continual pretraining on ~15 trillion mixed visual and text tokens atop the Kimi-K2-Base, it natively understands text, images, and video. K2.5 introduces Agent Swarm technology — coordinating up to 100 specialized AI agents simultaneously — and achieves frontier-level performance on coding (76.8% SWE-bench Verified), vision, and agentic tasks. Available under an open-weight license on HuggingFace.

Key Features

  • 1T total / 32B active — 384-expert MoE architecture with MLA attention and SwiGLU

  • Native multimodal — pre-trained on vision–language tokens; understands images, video, and text

  • Agent Swarm — decomposes complex tasks into parallel sub-tasks via dynamically spawned agents

  • 256K context window — process entire codebases, long documents, and video transcripts

  • Hybrid reasoning — supports both instant mode (fast) and thinking mode (deep reasoning)

  • Strong coding — 76.8% SWE-bench Verified, 73.0% SWE-bench Multilingual

Requirements

Kimi K2.5 is a massive model — the FP8 checkpoint is ~630GB. Self-hosting requires serious hardware.

Component
Quantized (GGUF Q2)
FP8 Full

GPU

1× RTX 4090 + 256GB RAM

8× H200 141GB

VRAM

24GB + CPU offload

1,128GB

RAM

256GB+

256GB

Disk

400GB SSD

700GB NVMe

CUDA

12.0+

12.0+

Clore.ai recommendation: For full-precision serving, rent 8× H200 (~$24–48/day). For quantized local inference, a single H100 80GB or even RTX 4090 + heavy CPU offloading works at reduced speed.

Quick Start with llama.cpp (Quantized)

The most accessible way to run K2.5 locally — using Unsloth's GGUF quantizations:

Note: Vision is not yet supported in GGUF/llama.cpp for K2.5. For multimodal features, use vLLM.

vLLM Setup (Production — Full Model)

For production serving with full multimodal support:

Serve on 8× H200 GPUs

Query with Text

Query with Image (Multimodal)

API Access (No GPU Required)

If self-hosting is overkill, use Moonshot's official API:

Tool Calling

K2.5 excels at agentic tool use:

Docker Quick Start

Tips for Clore.ai Users

  • API vs self-hosting tradeoff: Full K2.5 needs 8× H200 at ~$24–48/day. Moonshot's API is free-tier or pay-per-token — use API for exploration, self-host for sustained production loads.

  • Quantized on single GPU: The Unsloth GGUF Q2_K_XL (~375GB) can run on an RTX 4090 ($0.5–2/day) with 256GB RAM via CPU offloading — expect ~5–10 tok/s. Good enough for personal use and development.

  • Text-only K2 for budget setups: If you don't need vision, moonshotai/Kimi-K2-Instruct is the text-only predecessor — same 1T MoE but lighter to deploy (no vision encoder overhead).

  • Set temperature correctly: Use temperature=0.6 for instant mode, temperature=1.0 for thinking mode. Wrong temperature causes repetition or incoherence.

  • Expert Parallelism for throughput: On multi-node setups, use --enable-expert-parallel in vLLM for higher throughput. Check vLLM docs for EP configuration.

Troubleshooting

Issue
Solution

OutOfMemoryError with full model

Need 8× H200 (1128GB total). Use FP8 weights, set --gpu-memory-utilization 0.90.

GGUF inference very slow

Ensure enough RAM for the quant size. Q2_K_XL needs ~375GB RAM+VRAM combined.

Vision not working in llama.cpp

Vision support for K2.5 GGUF is not available yet — use vLLM for multimodal.

Repetitive output

Set temperature=0.6 (instant) or 1.0 (thinking). Add min_p=0.01.

Model download takes forever

~630GB FP8 checkpoint. Use huggingface-cli download with --resume-download.

Tool calls not parsed

Add --tool-call-parser kimi_k2 --enable-auto-tool-choice to vLLM serve command.

Further Reading

Last updated

Was this helpful?