GLM-5

Deploy GLM-5 (744B MoE) by Zhipu AI on Clore.ai — API access and self-hosting with vLLM

GLM-5, released February 2026 by Zhipu AI (Z.AI), is a 744-billion parameter Mixture-of-Experts language model that activates only 40B parameters per token. It achieves best-in-class open-source performance on reasoning, coding, and agentic tasks — scoring 77.8% on SWE-bench Verified and rivaling frontier models like Claude Opus 4.5 and GPT-5.2. The model is available under the MIT license on HuggingFace.

Key Features

  • 744B total / 40B active — 256-expert MoE with highly efficient routing

  • Frontier coding performance — 77.8% SWE-bench Verified, 73.3% SWE-bench Multilingual

  • Deep reasoning — 92.7% on AIME 2026, 96.9% on HMMT Nov 2025, built-in thinking mode

  • Agentic capabilities — native tool calling, function execution, and long-horizon task planning

  • 200K+ context window — handles massive codebases and long documents

  • MIT license — fully open weights, commercial use permitted

Requirements

Self-hosting GLM-5 is a serious undertaking — the FP8 checkpoint requires ~860GB VRAM.

Component
Minimum (FP8)
Recommended

GPU

8× H100 80GB

8× H200 141GB

VRAM

640GB

1,128GB

RAM

256GB

512GB

Disk

1.5TB NVMe

2TB NVMe

CUDA

12.0+

12.4+

Clore.ai recommendation: For most users, access GLM-5 via API (Z.AI, OpenRouter). Self-hosting only makes sense if you can rent 8× H100/H200 (~$24–48/day on Clore.ai).

The most practical way to use GLM-5 from a Clore.ai machine or anywhere:

Via Z.AI Platform

Via OpenRouter

vLLM Setup (Self-Hosting)

For those with access to high-end multi-GPU machines on Clore.ai:

Serve FP8 on 8× H200 GPUs

Query the Server

SGLang Alternative

SGLang also supports GLM-5 and may offer better performance on some hardware:

Docker Quick Start

Tool Calling Example

GLM-5 has native tool-calling support — ideal for building agentic applications:

Tips for Clore.ai Users

  • API first, self-host second: GLM-5 requires 8× H200 (~$24–48/day on Clore.ai). For occasional use, the Z.AI API or OpenRouter is far more cost-effective. Self-host only if you need sustained throughput or data privacy.

  • Consider GLM-4.7 instead: If 8× H200 is too much, the predecessor GLM-4.7 (355B, 32B active) runs on 4× H200 or 4× H100 (~$12–24/day) and still delivers excellent performance.

  • Use FP8 weights: Always use zai-org/GLM-5-FP8 — same quality as BF16 but nearly half the memory footprint. The BF16 version requires 16× GPUs.

  • Monitor VRAM usage: watch nvidia-smi — long context queries can spike memory. Set --gpu-memory-utilization 0.85 to leave headroom.

  • Thinking mode tradeoff: Thinking mode produces better results for complex tasks but uses more tokens and time. Disable it for simple queries with enable_thinking: false.

Troubleshooting

Issue
Solution

OutOfMemoryError on startup

Ensure you have 8× H200 (141GB each). FP8 needs ~860GB total VRAM.

Slow downloads (~800GB)

Use huggingface-cli download zai-org/GLM-5-FP8 with --local-dir to resume.

vLLM version mismatch

GLM-5 requires vLLM nightly. Install via pip install -U vllm --pre.

Tool calls not working

Add --tool-call-parser glm47 --enable-auto-tool-choice to serve command.

DeepGEMM errors

Install DeepGEMM for FP8: use the install_deepgemm.sh script from vLLM repo.

Thinking mode output empty

Set temperature=1.0 — thinking mode requires non-zero temperature.

Further Reading

Last updated

Was this helpful?