# GLM-5

GLM-5, released February 2026 by Zhipu AI (Z.AI), is a **744-billion parameter Mixture-of-Experts** language model that activates only 40B parameters per token. It achieves best-in-class open-source performance on reasoning, coding, and agentic tasks — scoring 77.8% on SWE-bench Verified and rivaling frontier models like Claude Opus 4.5 and GPT-5.2. The model is available under the **MIT license** on HuggingFace.

## Key Features

* **744B total / 40B active** — 256-expert MoE with highly efficient routing
* **Frontier coding performance** — 77.8% SWE-bench Verified, 73.3% SWE-bench Multilingual
* **Deep reasoning** — 92.7% on AIME 2026, 96.9% on HMMT Nov 2025, built-in thinking mode
* **Agentic capabilities** — native tool calling, function execution, and long-horizon task planning
* **200K+ context window** — handles massive codebases and long documents
* **MIT license** — fully open weights, commercial use permitted

## Requirements

Self-hosting GLM-5 is a serious undertaking — the FP8 checkpoint requires **\~860GB VRAM**.

| Component | Minimum (FP8) | Recommended   |
| --------- | ------------- | ------------- |
| GPU       | 8× H100 80GB  | 8× H200 141GB |
| VRAM      | 640GB         | 1,128GB       |
| RAM       | 256GB         | 512GB         |
| Disk      | 1.5TB NVMe    | 2TB NVMe      |
| CUDA      | 12.0+         | 12.4+         |

**Clore.ai recommendation**: For most users, **access GLM-5 via API** (Z.AI, OpenRouter). Self-hosting only makes sense if you can rent 8× H100/H200 (\~$24–48/day on Clore.ai).

## API Access (Recommended for Most Users)

The most practical way to use GLM-5 from a Clore.ai machine or anywhere:

### Via Z.AI Platform

```python
from openai import OpenAI

client = OpenAI(
    api_key="your-zai-api-key",
    base_url="https://api.z.ai/v1"
)

response = client.chat.completions.create(
    model="glm-5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python async web scraper using aiohttp and BeautifulSoup"}
    ],
    temperature=1.0,
    max_tokens=4096
)
print(response.choices[0].message.content)
```

### Via OpenRouter

```python
from openai import OpenAI

client = OpenAI(
    api_key="your-openrouter-key",
    base_url="https://openrouter.ai/api/v1"
)

response = client.chat.completions.create(
    model="zai-org/glm-5",
    messages=[
        {"role": "user", "content": "Explain the MoE architecture used in GLM-5"}
    ],
    max_tokens=2048
)
print(response.choices[0].message.content)
```

## vLLM Setup (Self-Hosting)

For those with access to high-end multi-GPU machines on Clore.ai:

```bash
# Install vLLM (nightly required for GLM-5 support)
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

# Install latest transformers (required)
pip install git+https://github.com/huggingface/transformers.git
```

### Serve FP8 on 8× H200 GPUs

```bash
vllm serve zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5-fp8 \
  --gpu-memory-utilization 0.85
```

### Query the Server

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# With thinking mode (default)
response = client.chat.completions.create(
    model="glm-5-fp8",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Solve: find all primes p where p^2 + 2 is also prime"}
    ],
    temperature=1.0,
    max_tokens=4096
)
print(response.choices[0].message.content)

# Without thinking mode (faster, shorter responses)
response = client.chat.completions.create(
    model="glm-5-fp8",
    messages=[
        {"role": "user", "content": "Write a quicksort in Rust"}
    ],
    temperature=1.0,
    max_tokens=4096,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": False}
    }
)
print(response.choices[0].message.content)
```

## SGLang Alternative

SGLang also supports GLM-5 and may offer better performance on some hardware:

```bash
# Using Docker (Hopper GPUs)
docker pull lmsysorg/sglang:glm5-hopper

# Launch server
python3 -m sglang.launch_server \
  --model-path zai-org/GLM-5-FP8 \
  --tp-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.85 \
  --served-model-name glm-5-fp8
```

## Docker Quick Start

```bash
# vLLM Docker image with GLM-5 support
docker run --gpus all -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:glm5 zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm5 \
  --trust-remote-code
```

## Tool Calling Example

GLM-5 has native tool-calling support — ideal for building agentic applications:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "required": ["city"],
            "properties": {
                "city": {"type": "string", "description": "City name"}
            }
        }
    }
}]

response = client.chat.completions.create(
    model="glm-5-fp8",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto"
)
print(response.choices[0].message.tool_calls)
```

## Tips for Clore.ai Users

* **API first, self-host second**: GLM-5 requires 8× H200 (\~$24–48/day on Clore.ai). For occasional use, the Z.AI API or OpenRouter is far more cost-effective. Self-host only if you need sustained throughput or data privacy.
* **Consider GLM-4.7 instead**: If 8× H200 is too much, the predecessor GLM-4.7 (355B, 32B active) runs on 4× H200 or 4× H100 (\~$12–24/day) and still delivers excellent performance.
* **Use FP8 weights**: Always use `zai-org/GLM-5-FP8` — same quality as BF16 but nearly half the memory footprint. The BF16 version requires 16× GPUs.
* **Monitor VRAM usage**: `watch nvidia-smi` — long context queries can spike memory. Set `--gpu-memory-utilization 0.85` to leave headroom.
* **Thinking mode tradeoff**: Thinking mode produces better results for complex tasks but uses more tokens and time. Disable it for simple queries with `enable_thinking: false`.

## Troubleshooting

| Issue                         | Solution                                                                       |
| ----------------------------- | ------------------------------------------------------------------------------ |
| `OutOfMemoryError` on startup | Ensure you have 8× H200 (141GB each). FP8 needs \~860GB total VRAM.            |
| Slow downloads (\~800GB)      | Use `huggingface-cli download zai-org/GLM-5-FP8` with `--local-dir` to resume. |
| vLLM version mismatch         | GLM-5 requires vLLM nightly. Install via `pip install -U vllm --pre`.          |
| Tool calls not working        | Add `--tool-call-parser glm47 --enable-auto-tool-choice` to serve command.     |
| DeepGEMM errors               | Install DeepGEMM for FP8: use the `install_deepgemm.sh` script from vLLM repo. |
| Thinking mode output empty    | Set `temperature=1.0` — thinking mode requires non-zero temperature.           |

## Further Reading

* [GLM-5 on HuggingFace](https://huggingface.co/zai-org/GLM-5)
* [GLM-5 FP8 Checkpoint](https://huggingface.co/zai-org/GLM-5-FP8)
* [Z.AI Platform](https://chat.z.ai)
* [Z.AI API Docs](https://docs.z.ai/guides/llm/glm-5)
* [vLLM GLM-5 Recipe](https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM5.html)
* [GLM-5 Technical Blog](https://z.ai/blog/glm-5)
* [Slime RL Infrastructure](https://github.com/THUDM/slime)
