# GLM-5.1 (744B MoE, #1 SWE-Bench Pro)

{% hint style="info" %}
**Status (April 2026):** GLM-5.1 was released on **April 7, 2026** by Z.ai (formerly Zhipu AI) as an incremental-but-serious upgrade to [GLM-5](/guides/language-models/glm5.md). It is the first open-weight model to top **SWE-Bench Pro (58.4%)**, edging out GPT-5.4 (57.7) and Claude Opus 4.6 (57.3) according to vendor-published numbers. Weights live at [huggingface.co/zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) under the **MIT license**.
{% endhint %}

GLM-5.1 is a **744-billion parameter Mixture-of-Experts** language model that activates only **\~40B parameters per token**. Compared to its predecessor [GLM-5](/guides/language-models/glm5.md), the 5.1 release keeps the same MoE skeleton but ships refined expert routing, a **200K-token context window**, a **131K-token max output**, and training focused on **long-horizon agentic coding** — the model is explicitly tuned to sustain thousands of tool calls and hundreds of refactor rounds without drifting.

For Clore.ai users, the interesting part is the **40B active** number: you don't need a full 8×H200 rack to serve it. A tensor-parallel setup across **2×H100 80GB** (FP8) or **4×A100 80GB** (BF16 with sharding) is enough for practical throughput — putting frontier-class coding within reach at \~$12–24/day on the marketplace.

### Key Specs

| Property          | Value                                               |
| ----------------- | --------------------------------------------------- |
| Total Parameters  | 744B (MoE)                                          |
| Active Parameters | \~40B per forward pass                              |
| Context Window    | 200,000 tokens                                      |
| Max Output        | 131,072 tokens                                      |
| License           | MIT                                                 |
| Release Date      | April 7, 2026                                       |
| Organization      | Z.ai (zai-org on HuggingFace)                       |
| Primary Tooling   | vLLM, SGLang, llama.cpp (GGUF), xLLM, KTransformers |

### Why GLM-5.1?

* **#1 on SWE-Bench Pro** — 58.4% vendor-claimed, ahead of GPT-5.4 and Claude Opus 4.6
* **Long-horizon agents** — sustains optimization across hundreds of rounds and thousands of tool calls
* **200K context** — enough for an entire mid-sized codebase plus test suite
* **40B active MoE** — you pay the inference cost of a 40B dense model, not a 744B one
* **MIT license** — fully open weights, no restrictions on commercial use or fine-tuning
* **Open training stack** — Z.ai published the model, reportedly trained without Nvidia data-center GPUs

***

## Requirements

{% hint style="warning" %}
**Still a big model.** While "40B active" sounds friendly, the full 744B weights must be loaded into VRAM (or offloaded). FP8 weights are \~860GB; BF16 is \~1.5TB. Plan accordingly.
{% endhint %}

| Component | Minimum (Q4 GGUF, offload) | Recommended (FP8)             | Full BF16     |
| --------- | -------------------------- | ----------------------------- | ------------- |
| GPU VRAM  | \~80GB (Q4 + RAM offload)  | 2× H100 80GB active, 8× total | 8× H200 141GB |
| RAM       | 256GB                      | 256GB                         | 512GB         |
| Disk      | 500GB NVMe                 | 1TB NVMe                      | 2TB NVMe      |
| CUDA      | 12.4+                      | 12.4+                         | 12.6+         |

**Clore.ai pick:** For most teams, 2× H100 80GB running the FP8 checkpoint with aggressive offloading is the sweet spot (\~$12–16/day). If you need full BF16 throughput, jump to 8× H200 or use the Z.ai API for occasional calls.

***

## Option A — Ollama / GGUF (Quantized, community builds)

{% hint style="warning" %}
**Heads-up:** Community GGUF quants typically land 1–2 weeks after a Z.ai release. If `ollama pull` fails, check [huggingface.co/models?search=glm-5.1+gguf](https://huggingface.co/models?search=glm-5.1+gguf) and point llama.cpp at the file directly.
{% endhint %}

```bash
# Once a Q4_K_M build is available
docker exec ollama ollama pull glm-5.1:q4_K_M
docker exec ollama ollama run glm-5.1:q4_K_M

# Or with llama.cpp directly on a GGUF file
docker run --gpus all -it --rm -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/glm-5.1-q4_k_m.gguf \
  --n-gpu-layers 80 --ctx-size 32768 \
  --port 8080 --host 0.0.0.0
```

***

## Option B — vLLM (Production API, recommended)

vLLM is Z.ai's first-class serving target. The FP8 checkpoint (`zai-org/GLM-5.1-FP8`) is the one you want — same quality as BF16, roughly half the memory.

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model zai-org/GLM-5.1-FP8
      --tensor-parallel-size 8
      --max-model-len 65536
      --gpu-memory-utilization 0.88
      --tool-call-parser glm47
      --reasoning-parser glm45
      --enable-auto-tool-choice
      --served-model-name glm-5.1
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

```bash
# Test the API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1",
    "messages": [
      {"role": "system", "content": "You are a senior software engineer."},
      {"role": "user", "content": "Refactor this Go handler to use context.Context properly and add retries."}
    ],
    "max_tokens": 4096,
    "temperature": 1.0
  }'
```

{% hint style="info" %}
Use `--tensor-parallel-size 2` on 2× H100 if you're running tight on GPU count, but plan for slower prefill on 200K contexts. `--enable-chunked-prefill` helps a lot.
{% endhint %}

***

## Option C — SGLang (alternative, often faster on Hopper)

```bash
docker pull lmsysorg/sglang:latest

python3 -m sglang.launch_server \
  --model-path zai-org/GLM-5.1-FP8 \
  --tp-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.88 \
  --context-length 65536 \
  --served-model-name glm-5.1
```

SGLang's EAGLE speculative decoding typically gives a 1.5–2× throughput boost on long coding completions.

***

## Clore.ai GPU Recommendations

| Setup         | VRAM    | Expected Performance            | Clore.ai Cost |
| ------------- | ------- | ------------------------------- | ------------- |
| 2× H100 80GB  | 160GB   | FP8 with offload, \~15–25 tok/s | \~$12–16/day  |
| 4× A100 80GB  | 320GB   | BF16 sharded, \~20–30 tok/s     | \~$15–22/day  |
| 8× H100 80GB  | 640GB   | FP8 full, \~60+ tok/s           | \~$40–55/day  |
| 8× H200 141GB | 1,128GB | BF16 full, maximum throughput   | \~$70+/day    |

{% hint style="success" %}
**Best value:** 2× H100 80GB with the FP8 checkpoint. You get frontier-class coding performance for roughly the price of a Claude Opus subscription — and the weights stay on your box.
{% endhint %}

***

## Use Cases

* **Autonomous SWE agents** — GLM-5.1 is explicitly trained for long tool-calling loops; pair it with something like SWE-agent or OpenHands
* **Codebase understanding** — drop 100K+ tokens of Go/Rust/Python into context and ask for architectural reviews
* **Long-context RAG** — 200K ctx handles entire product docs + support tickets in one shot
* **Refactor pipelines** — sustained correctness across hundreds of file edits
* **Agent-of-agents orchestration** — use GLM-5.1 as a planner and smaller models (Qwen3.5-35B, GLM-4.7) as workers

***

## Benchmarks

{% hint style="warning" %}
**Vendor-claimed — verify independently.** The numbers below come from Z.ai's April 7, 2026 announcement. Independent reproductions on SWE-Bench Pro are still rolling in.
{% endhint %}

| Benchmark          | GLM-5.1   | GPT-5.4 | Claude Opus 4.6 | GLM-5 |
| ------------------ | --------- | ------- | --------------- | ----- |
| SWE-Bench Pro      | **58.4%** | 57.7%   | 57.3%           | \~52% |
| SWE-Bench Verified | \~79%     | \~78%   | \~80%           | 77.8% |
| HumanEval          | \~94%     | \~95%   | \~94%           | \~93% |
| LiveCodeBench      | \~72%     | \~73%   | \~70%           | \~68% |

***

## Troubleshooting

| Issue                       | Solution                                                                                                   |
| --------------------------- | ---------------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` on load  | FP8 checkpoint needs \~860GB total VRAM. Use 8× H100/H200 or drop to GGUF Q4 with RAM offload.             |
| Slow HuggingFace download   | Use `huggingface-cli download zai-org/GLM-5.1-FP8 --local-dir ./weights --resume-download`. Expect 800GB+. |
| Tool calls silently dropped | Ensure `--tool-call-parser glm47` and `--enable-auto-tool-choice` are both set in vLLM.                    |
| Thinking mode empty         | Requires `temperature=1.0` — zero-temp sampling breaks the reasoning trace.                                |
| vLLM rejects the config     | GLM-5.1 needs vLLM ≥ 0.7.x (April 2026 release). Use `pip install -U vllm --pre` if on older versions.     |
| 200K context OOMs           | Start with `--max-model-len 65536` and add `--enable-chunked-prefill`; raise once stable.                  |

***

## Next Steps

* **Predecessor:** [GLM-5](/guides/language-models/glm5.md) — same MoE shape, slightly less coding-focused
* **Cheaper alternative:** [Qwen3.5](/guides/language-models/qwen35.md) — 35B dense fits on a single RTX 4090
* **Massive-context alternative:** [DeepSeek V4](/guides/language-models/deepseek-v4.md) — 1M ctx, multimodal, \~1T params
* **Clore.ai Marketplace:** [clore.ai/marketplace](https://clore.ai/marketplace) — rent H100/H200/A100 from $0.50/day

### Links

* [GLM-5.1 on HuggingFace](https://huggingface.co/zai-org/GLM-5.1)
* [Z.ai Blog — GLM-5.1 announcement](https://z.ai/blog/glm-5.1)
* [Z.ai Platform (hosted API)](https://chat.z.ai)
* [vLLM docs](https://docs.vllm.ai)
* [SGLang repo](https://github.com/sgl-project/sglang)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/glm-5-1.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.