GLM-5
Deploy GLM-5 (744B MoE) by Zhipu AI on Clore.ai — API access and self-hosting with vLLM
GLM-5, released February 2026 by Zhipu AI (Z.AI), is a 744-billion parameter Mixture-of-Experts language model that activates only 40B parameters per token. It achieves best-in-class open-source performance on reasoning, coding, and agentic tasks — scoring 77.8% on SWE-bench Verified and rivaling frontier models like Claude Opus 4.5 and GPT-5.2. The model is available under the MIT license on HuggingFace.
Key Features
744B total / 40B active — 256-expert MoE with highly efficient routing
Frontier coding performance — 77.8% SWE-bench Verified, 73.3% SWE-bench Multilingual
Deep reasoning — 92.7% on AIME 2026, 96.9% on HMMT Nov 2025, built-in thinking mode
Agentic capabilities — native tool calling, function execution, and long-horizon task planning
200K+ context window — handles massive codebases and long documents
MIT license — fully open weights, commercial use permitted
Requirements
Self-hosting GLM-5 is a serious undertaking — the FP8 checkpoint requires ~860GB VRAM.
GPU
8× H100 80GB
8× H200 141GB
VRAM
640GB
1,128GB
RAM
256GB
512GB
Disk
1.5TB NVMe
2TB NVMe
CUDA
12.0+
12.4+
Clore.ai recommendation: For most users, access GLM-5 via API (Z.AI, OpenRouter). Self-hosting only makes sense if you can rent 8× H100/H200 (~$24–48/day on Clore.ai).
API Access (Recommended for Most Users)
The most practical way to use GLM-5 from a Clore.ai machine or anywhere:
Via Z.AI Platform
Via OpenRouter
vLLM Setup (Self-Hosting)
For those with access to high-end multi-GPU machines on Clore.ai:
Serve FP8 on 8× H200 GPUs
Query the Server
SGLang Alternative
SGLang also supports GLM-5 and may offer better performance on some hardware:
Docker Quick Start
Tool Calling Example
GLM-5 has native tool-calling support — ideal for building agentic applications:
Tips for Clore.ai Users
API first, self-host second: GLM-5 requires 8× H200 (~$24–48/day on Clore.ai). For occasional use, the Z.AI API or OpenRouter is far more cost-effective. Self-host only if you need sustained throughput or data privacy.
Consider GLM-4.7 instead: If 8× H200 is too much, the predecessor GLM-4.7 (355B, 32B active) runs on 4× H200 or 4× H100 (~$12–24/day) and still delivers excellent performance.
Use FP8 weights: Always use
zai-org/GLM-5-FP8— same quality as BF16 but nearly half the memory footprint. The BF16 version requires 16× GPUs.Monitor VRAM usage:
watch nvidia-smi— long context queries can spike memory. Set--gpu-memory-utilization 0.85to leave headroom.Thinking mode tradeoff: Thinking mode produces better results for complex tasks but uses more tokens and time. Disable it for simple queries with
enable_thinking: false.
Troubleshooting
OutOfMemoryError on startup
Ensure you have 8× H200 (141GB each). FP8 needs ~860GB total VRAM.
Slow downloads (~800GB)
Use huggingface-cli download zai-org/GLM-5-FP8 with --local-dir to resume.
vLLM version mismatch
GLM-5 requires vLLM nightly. Install via pip install -U vllm --pre.
Tool calls not working
Add --tool-call-parser glm47 --enable-auto-tool-choice to serve command.
DeepGEMM errors
Install DeepGEMM for FP8: use the install_deepgemm.sh script from vLLM repo.
Thinking mode output empty
Set temperature=1.0 — thinking mode requires non-zero temperature.
Further Reading
Last updated
Was this helpful?