GLM-5.1 (744B MoE, #1 SWE-Bench Pro)
Deploy GLM-5.1 (744B MoE, 40B active) by Z.ai on Clore.ai — the open-weight model that topped SWE-Bench Pro in April 2026
Status (April 2026): GLM-5.1 was released on April 7, 2026 by Z.ai (formerly Zhipu AI) as an incremental-but-serious upgrade to GLM-5. It is the first open-weight model to top SWE-Bench Pro (58.4%), edging out GPT-5.4 (57.7) and Claude Opus 4.6 (57.3) according to vendor-published numbers. Weights live at huggingface.co/zai-org/GLM-5.1 under the MIT license.
GLM-5.1 is a 744-billion parameter Mixture-of-Experts language model that activates only ~40B parameters per token. Compared to its predecessor GLM-5, the 5.1 release keeps the same MoE skeleton but ships refined expert routing, a 200K-token context window, a 131K-token max output, and training focused on long-horizon agentic coding — the model is explicitly tuned to sustain thousands of tool calls and hundreds of refactor rounds without drifting.
For Clore.ai users, the interesting part is the 40B active number: you don't need a full 8×H200 rack to serve it. A tensor-parallel setup across 2×H100 80GB (FP8) or 4×A100 80GB (BF16 with sharding) is enough for practical throughput — putting frontier-class coding within reach at ~$12–24/day on the marketplace.
Key Specs
Total Parameters
744B (MoE)
Active Parameters
~40B per forward pass
Context Window
200,000 tokens
Max Output
131,072 tokens
License
MIT
Release Date
April 7, 2026
Organization
Z.ai (zai-org on HuggingFace)
Primary Tooling
vLLM, SGLang, llama.cpp (GGUF), xLLM, KTransformers
Why GLM-5.1?
#1 on SWE-Bench Pro — 58.4% vendor-claimed, ahead of GPT-5.4 and Claude Opus 4.6
Long-horizon agents — sustains optimization across hundreds of rounds and thousands of tool calls
200K context — enough for an entire mid-sized codebase plus test suite
40B active MoE — you pay the inference cost of a 40B dense model, not a 744B one
MIT license — fully open weights, no restrictions on commercial use or fine-tuning
Open training stack — Z.ai published the model, reportedly trained without Nvidia data-center GPUs
Requirements
Still a big model. While "40B active" sounds friendly, the full 744B weights must be loaded into VRAM (or offloaded). FP8 weights are ~860GB; BF16 is ~1.5TB. Plan accordingly.
GPU VRAM
~80GB (Q4 + RAM offload)
2× H100 80GB active, 8× total
8× H200 141GB
RAM
256GB
256GB
512GB
Disk
500GB NVMe
1TB NVMe
2TB NVMe
CUDA
12.4+
12.4+
12.6+
Clore.ai pick: For most teams, 2× H100 80GB running the FP8 checkpoint with aggressive offloading is the sweet spot (~$12–16/day). If you need full BF16 throughput, jump to 8× H200 or use the Z.ai API for occasional calls.
Option A — Ollama / GGUF (Quantized, community builds)
Heads-up: Community GGUF quants typically land 1–2 weeks after a Z.ai release. If ollama pull fails, check huggingface.co/models?search=glm-5.1+gguf and point llama.cpp at the file directly.
Option B — vLLM (Production API, recommended)
vLLM is Z.ai's first-class serving target. The FP8 checkpoint (zai-org/GLM-5.1-FP8) is the one you want — same quality as BF16, roughly half the memory.
Use --tensor-parallel-size 2 on 2× H100 if you're running tight on GPU count, but plan for slower prefill on 200K contexts. --enable-chunked-prefill helps a lot.
Option C — SGLang (alternative, often faster on Hopper)
SGLang's EAGLE speculative decoding typically gives a 1.5–2× throughput boost on long coding completions.
Clore.ai GPU Recommendations
2× H100 80GB
160GB
FP8 with offload, ~15–25 tok/s
~$12–16/day
4× A100 80GB
320GB
BF16 sharded, ~20–30 tok/s
~$15–22/day
8× H100 80GB
640GB
FP8 full, ~60+ tok/s
~$40–55/day
8× H200 141GB
1,128GB
BF16 full, maximum throughput
~$70+/day
Best value: 2× H100 80GB with the FP8 checkpoint. You get frontier-class coding performance for roughly the price of a Claude Opus subscription — and the weights stay on your box.
Use Cases
Autonomous SWE agents — GLM-5.1 is explicitly trained for long tool-calling loops; pair it with something like SWE-agent or OpenHands
Codebase understanding — drop 100K+ tokens of Go/Rust/Python into context and ask for architectural reviews
Long-context RAG — 200K ctx handles entire product docs + support tickets in one shot
Refactor pipelines — sustained correctness across hundreds of file edits
Agent-of-agents orchestration — use GLM-5.1 as a planner and smaller models (Qwen3.5-35B, GLM-4.7) as workers
Benchmarks
Vendor-claimed — verify independently. The numbers below come from Z.ai's April 7, 2026 announcement. Independent reproductions on SWE-Bench Pro are still rolling in.
SWE-Bench Pro
58.4%
57.7%
57.3%
~52%
SWE-Bench Verified
~79%
~78%
~80%
77.8%
HumanEval
~94%
~95%
~94%
~93%
LiveCodeBench
~72%
~73%
~70%
~68%
Troubleshooting
OutOfMemoryError on load
FP8 checkpoint needs ~860GB total VRAM. Use 8× H100/H200 or drop to GGUF Q4 with RAM offload.
Slow HuggingFace download
Use huggingface-cli download zai-org/GLM-5.1-FP8 --local-dir ./weights --resume-download. Expect 800GB+.
Tool calls silently dropped
Ensure --tool-call-parser glm47 and --enable-auto-tool-choice are both set in vLLM.
Thinking mode empty
Requires temperature=1.0 — zero-temp sampling breaks the reasoning trace.
vLLM rejects the config
GLM-5.1 needs vLLM ≥ 0.7.x (April 2026 release). Use pip install -U vllm --pre if on older versions.
200K context OOMs
Start with --max-model-len 65536 and add --enable-chunked-prefill; raise once stable.
Next Steps
Predecessor: GLM-5 — same MoE shape, slightly less coding-focused
Cheaper alternative: Qwen3.5 — 35B dense fits on a single RTX 4090
Massive-context alternative: DeepSeek V4 — 1M ctx, multimodal, ~1T params
Clore.ai Marketplace: clore.ai/marketplace — rent H100/H200/A100 from $0.50/day
Links
Last updated
Was this helpful?