Kimi K2.5
Deploy Kimi K2.5 (1T MoE multimodal) by Moonshot AI on Clore.ai GPUs
Kimi K2.5, released January 27, 2026 by Moonshot AI, is a 1-trillion parameter Mixture-of-Experts multimodal model with 32B active parameters per token. Built through continual pretraining on ~15 trillion mixed visual and text tokens atop the Kimi-K2-Base, it natively understands text, images, and video. K2.5 introduces Agent Swarm technology — coordinating up to 100 specialized AI agents simultaneously — and achieves frontier-level performance on coding (76.8% SWE-bench Verified), vision, and agentic tasks. Available under an open-weight license on HuggingFace.
Key Features
1T total / 32B active — 384-expert MoE architecture with MLA attention and SwiGLU
Native multimodal — pre-trained on vision–language tokens; understands images, video, and text
Agent Swarm — decomposes complex tasks into parallel sub-tasks via dynamically spawned agents
256K context window — process entire codebases, long documents, and video transcripts
Hybrid reasoning — supports both instant mode (fast) and thinking mode (deep reasoning)
Strong coding — 76.8% SWE-bench Verified, 73.0% SWE-bench Multilingual
Requirements
Kimi K2.5 is a massive model — the FP8 checkpoint is ~630GB. Self-hosting requires serious hardware.
GPU
1× RTX 4090 + 256GB RAM
8× H200 141GB
VRAM
24GB + CPU offload
1,128GB
RAM
256GB+
256GB
Disk
400GB SSD
700GB NVMe
CUDA
12.0+
12.0+
Clore.ai recommendation: For full-precision serving, rent 8× H200 (~$24–48/day). For quantized local inference, a single H100 80GB or even RTX 4090 + heavy CPU offloading works at reduced speed.
Quick Start with llama.cpp (Quantized)
The most accessible way to run K2.5 locally — using Unsloth's GGUF quantizations:
Note: Vision is not yet supported in GGUF/llama.cpp for K2.5. For multimodal features, use vLLM.
vLLM Setup (Production — Full Model)
For production serving with full multimodal support:
Serve on 8× H200 GPUs
Query with Text
Query with Image (Multimodal)
API Access (No GPU Required)
If self-hosting is overkill, use Moonshot's official API:
Tool Calling
K2.5 excels at agentic tool use:
Docker Quick Start
Tips for Clore.ai Users
API vs self-hosting tradeoff: Full K2.5 needs 8× H200 at ~$24–48/day. Moonshot's API is free-tier or pay-per-token — use API for exploration, self-host for sustained production loads.
Quantized on single GPU: The Unsloth GGUF Q2_K_XL (~375GB) can run on an RTX 4090 ($0.5–2/day) with 256GB RAM via CPU offloading — expect ~5–10 tok/s. Good enough for personal use and development.
Text-only K2 for budget setups: If you don't need vision,
moonshotai/Kimi-K2-Instructis the text-only predecessor — same 1T MoE but lighter to deploy (no vision encoder overhead).Set temperature correctly: Use
temperature=0.6for instant mode,temperature=1.0for thinking mode. Wrong temperature causes repetition or incoherence.Expert Parallelism for throughput: On multi-node setups, use
--enable-expert-parallelin vLLM for higher throughput. Check vLLM docs for EP configuration.
Troubleshooting
OutOfMemoryError with full model
Need 8× H200 (1128GB total). Use FP8 weights, set --gpu-memory-utilization 0.90.
GGUF inference very slow
Ensure enough RAM for the quant size. Q2_K_XL needs ~375GB RAM+VRAM combined.
Vision not working in llama.cpp
Vision support for K2.5 GGUF is not available yet — use vLLM for multimodal.
Repetitive output
Set temperature=0.6 (instant) or 1.0 (thinking). Add min_p=0.01.
Model download takes forever
~630GB FP8 checkpoint. Use huggingface-cli download with --resume-download.
Tool calls not parsed
Add --tool-call-parser kimi_k2 --enable-auto-tool-choice to vLLM serve command.
Further Reading
Last updated
Was this helpful?