For the complete documentation index, see llms.txt. This page is also available as Markdown.

GLM-5.1 (744B MoE, #1 SWE-Bench Pro)

Deploy GLM-5.1 (744B MoE, 40B active) by Z.ai on Clore.ai — the open-weight model that topped SWE-Bench Pro in April 2026

Status (April 2026): GLM-5.1 was released on April 7, 2026 by Z.ai (formerly Zhipu AI) as an incremental-but-serious upgrade to GLM-5. It is the first open-weight model to top SWE-Bench Pro (58.4%), edging out GPT-5.4 (57.7) and Claude Opus 4.6 (57.3) according to vendor-published numbers. Weights live at huggingface.co/zai-org/GLM-5.1 under the MIT license.

GLM-5.1 is a 744-billion parameter Mixture-of-Experts language model that activates only ~40B parameters per token. Compared to its predecessor GLM-5, the 5.1 release keeps the same MoE skeleton but ships refined expert routing, a 200K-token context window, a 131K-token max output, and training focused on long-horizon agentic coding — the model is explicitly tuned to sustain thousands of tool calls and hundreds of refactor rounds without drifting.

For Clore.ai users, the interesting part is the 40B active number: you don't need a full 8×H200 rack to serve it. A tensor-parallel setup across 2×H100 80GB (FP8) or 4×A100 80GB (BF16 with sharding) is enough for practical throughput — putting frontier-class coding within reach at ~$12–24/day on the marketplace.

Key Specs

Property
Value

Total Parameters

744B (MoE)

Active Parameters

~40B per forward pass

Context Window

200,000 tokens

Max Output

131,072 tokens

License

MIT

Release Date

April 7, 2026

Organization

Z.ai (zai-org on HuggingFace)

Primary Tooling

vLLM, SGLang, llama.cpp (GGUF), xLLM, KTransformers

Why GLM-5.1?

  • #1 on SWE-Bench Pro — 58.4% vendor-claimed, ahead of GPT-5.4 and Claude Opus 4.6

  • Long-horizon agents — sustains optimization across hundreds of rounds and thousands of tool calls

  • 200K context — enough for an entire mid-sized codebase plus test suite

  • 40B active MoE — you pay the inference cost of a 40B dense model, not a 744B one

  • MIT license — fully open weights, no restrictions on commercial use or fine-tuning

  • Open training stack — Z.ai published the model, reportedly trained without Nvidia data-center GPUs


Requirements

Component
Minimum (Q4 GGUF, offload)
Recommended (FP8)
Full BF16

GPU VRAM

~80GB (Q4 + RAM offload)

2× H100 80GB active, 8× total

8× H200 141GB

RAM

256GB

256GB

512GB

Disk

500GB NVMe

1TB NVMe

2TB NVMe

CUDA

12.4+

12.4+

12.6+

Clore.ai pick: For most teams, 2× H100 80GB running the FP8 checkpoint with aggressive offloading is the sweet spot (~$12–16/day). If you need full BF16 throughput, jump to 8× H200 or use the Z.ai API for occasional calls.


Option A — Ollama / GGUF (Quantized, community builds)


vLLM is Z.ai's first-class serving target. The FP8 checkpoint (zai-org/GLM-5.1-FP8) is the one you want — same quality as BF16, roughly half the memory.

Use --tensor-parallel-size 2 on 2× H100 if you're running tight on GPU count, but plan for slower prefill on 200K contexts. --enable-chunked-prefill helps a lot.


Option C — SGLang (alternative, often faster on Hopper)

SGLang's EAGLE speculative decoding typically gives a 1.5–2× throughput boost on long coding completions.


Clore.ai GPU Recommendations

Setup
VRAM
Expected Performance
Clore.ai Cost

2× H100 80GB

160GB

FP8 with offload, ~15–25 tok/s

~$12–16/day

4× A100 80GB

320GB

BF16 sharded, ~20–30 tok/s

~$15–22/day

8× H100 80GB

640GB

FP8 full, ~60+ tok/s

~$40–55/day

8× H200 141GB

1,128GB

BF16 full, maximum throughput

~$70+/day


Use Cases

  • Autonomous SWE agents — GLM-5.1 is explicitly trained for long tool-calling loops; pair it with something like SWE-agent or OpenHands

  • Codebase understanding — drop 100K+ tokens of Go/Rust/Python into context and ask for architectural reviews

  • Long-context RAG — 200K ctx handles entire product docs + support tickets in one shot

  • Refactor pipelines — sustained correctness across hundreds of file edits

  • Agent-of-agents orchestration — use GLM-5.1 as a planner and smaller models (Qwen3.5-35B, GLM-4.7) as workers


Benchmarks

Benchmark
GLM-5.1
GPT-5.4
Claude Opus 4.6
GLM-5

SWE-Bench Pro

58.4%

57.7%

57.3%

~52%

SWE-Bench Verified

~79%

~78%

~80%

77.8%

HumanEval

~94%

~95%

~94%

~93%

LiveCodeBench

~72%

~73%

~70%

~68%


Troubleshooting

Issue
Solution

OutOfMemoryError on load

FP8 checkpoint needs ~860GB total VRAM. Use 8× H100/H200 or drop to GGUF Q4 with RAM offload.

Slow HuggingFace download

Use huggingface-cli download zai-org/GLM-5.1-FP8 --local-dir ./weights --resume-download. Expect 800GB+.

Tool calls silently dropped

Ensure --tool-call-parser glm47 and --enable-auto-tool-choice are both set in vLLM.

Thinking mode empty

Requires temperature=1.0 — zero-temp sampling breaks the reasoning trace.

vLLM rejects the config

GLM-5.1 needs vLLM ≥ 0.7.x (April 2026 release). Use pip install -U vllm --pre if on older versions.

200K context OOMs

Start with --max-model-len 65536 and add --enable-chunked-prefill; raise once stable.


Next Steps

  • Predecessor: GLM-5 — same MoE shape, slightly less coding-focused

  • Cheaper alternative: Qwen3.5 — 35B dense fits on a single RTX 4090

  • Massive-context alternative: DeepSeek V4 — 1M ctx, multimodal, ~1T params

  • Clore.ai Marketplace: clore.ai/marketplace — rent H100/H200/A100 from $0.50/day

Last updated

Was this helpful?