# GLM-5

GLM-5, released February 2026 by Zhipu AI (Z.AI), is a **744-billion parameter Mixture-of-Experts** language model that activates only 40B parameters per token. It achieves best-in-class open-source performance on reasoning, coding, and agentic tasks — scoring 77.8% on SWE-bench Verified and rivaling frontier models like Claude Opus 4.5 and GPT-5.2. The model is available under the **MIT license** on HuggingFace.

## Key Features

* **744B total / 40B active** — 256-expert MoE with highly efficient routing
* **Frontier coding performance** — 77.8% SWE-bench Verified, 73.3% SWE-bench Multilingual
* **Deep reasoning** — 92.7% on AIME 2026, 96.9% on HMMT Nov 2025, built-in thinking mode
* **Agentic capabilities** — native tool calling, function execution, and long-horizon task planning
* **200K+ context window** — handles massive codebases and long documents
* **MIT license** — fully open weights, commercial use permitted

## Requirements

Self-hosting GLM-5 is a serious undertaking — the FP8 checkpoint requires **\~860GB VRAM**.

| Component | Minimum (FP8) | Recommended   |
| --------- | ------------- | ------------- |
| GPU       | 8× H100 80GB  | 8× H200 141GB |
| VRAM      | 640GB         | 1,128GB       |
| RAM       | 256GB         | 512GB         |
| Disk      | 1.5TB NVMe    | 2TB NVMe      |
| CUDA      | 12.0+         | 12.4+         |

**Clore.ai recommendation**: For most users, **access GLM-5 via API** (Z.AI, OpenRouter). Self-hosting only makes sense if you can rent 8× H100/H200 (\~$24–48/day on Clore.ai).

## API Access (Recommended for Most Users)

The most practical way to use GLM-5 from a Clore.ai machine or anywhere:

### Via Z.AI Platform

```python
from openai import OpenAI

client = OpenAI(
    api_key="your-zai-api-key",
    base_url="https://api.z.ai/v1"
)

response = client.chat.completions.create(
    model="glm-5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python async web scraper using aiohttp and BeautifulSoup"}
    ],
    temperature=1.0,
    max_tokens=4096
)
print(response.choices[0].message.content)
```

### Via OpenRouter

```python
from openai import OpenAI

client = OpenAI(
    api_key="your-openrouter-key",
    base_url="https://openrouter.ai/api/v1"
)

response = client.chat.completions.create(
    model="zai-org/glm-5",
    messages=[
        {"role": "user", "content": "Explain the MoE architecture used in GLM-5"}
    ],
    max_tokens=2048
)
print(response.choices[0].message.content)
```

## vLLM Setup (Self-Hosting)

For those with access to high-end multi-GPU machines on Clore.ai:

```bash
# Install vLLM (nightly required for GLM-5 support)
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

# Install latest transformers (required)
pip install git+https://github.com/huggingface/transformers.git
```

### Serve FP8 on 8× H200 GPUs

```bash
vllm serve zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5-fp8 \
  --gpu-memory-utilization 0.85
```

### Query the Server

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# With thinking mode (default)
response = client.chat.completions.create(
    model="glm-5-fp8",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Solve: find all primes p where p^2 + 2 is also prime"}
    ],
    temperature=1.0,
    max_tokens=4096
)
print(response.choices[0].message.content)

# Without thinking mode (faster, shorter responses)
response = client.chat.completions.create(
    model="glm-5-fp8",
    messages=[
        {"role": "user", "content": "Write a quicksort in Rust"}
    ],
    temperature=1.0,
    max_tokens=4096,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": False}
    }
)
print(response.choices[0].message.content)
```

## SGLang Alternative

SGLang also supports GLM-5 and may offer better performance on some hardware:

```bash
# Using Docker (Hopper GPUs)
docker pull lmsysorg/sglang:glm5-hopper

# Launch server
python3 -m sglang.launch_server \
  --model-path zai-org/GLM-5-FP8 \
  --tp-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.85 \
  --served-model-name glm-5-fp8
```

## Docker Quick Start

```bash
# vLLM Docker image with GLM-5 support
docker run --gpus all -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:glm5 zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm5 \
  --trust-remote-code
```

## Tool Calling Example

GLM-5 has native tool-calling support — ideal for building agentic applications:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "required": ["city"],
            "properties": {
                "city": {"type": "string", "description": "City name"}
            }
        }
    }
}]

response = client.chat.completions.create(
    model="glm-5-fp8",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto"
)
print(response.choices[0].message.tool_calls)
```

## Tips for Clore.ai Users

* **API first, self-host second**: GLM-5 requires 8× H200 (\~$24–48/day on Clore.ai). For occasional use, the Z.AI API or OpenRouter is far more cost-effective. Self-host only if you need sustained throughput or data privacy.
* **Consider GLM-4.7 instead**: If 8× H200 is too much, the predecessor GLM-4.7 (355B, 32B active) runs on 4× H200 or 4× H100 (\~$12–24/day) and still delivers excellent performance.
* **Use FP8 weights**: Always use `zai-org/GLM-5-FP8` — same quality as BF16 but nearly half the memory footprint. The BF16 version requires 16× GPUs.
* **Monitor VRAM usage**: `watch nvidia-smi` — long context queries can spike memory. Set `--gpu-memory-utilization 0.85` to leave headroom.
* **Thinking mode tradeoff**: Thinking mode produces better results for complex tasks but uses more tokens and time. Disable it for simple queries with `enable_thinking: false`.

## Troubleshooting

| Issue                         | Solution                                                                       |
| ----------------------------- | ------------------------------------------------------------------------------ |
| `OutOfMemoryError` on startup | Ensure you have 8× H200 (141GB each). FP8 needs \~860GB total VRAM.            |
| Slow downloads (\~800GB)      | Use `huggingface-cli download zai-org/GLM-5-FP8` with `--local-dir` to resume. |
| vLLM version mismatch         | GLM-5 requires vLLM nightly. Install via `pip install -U vllm --pre`.          |
| Tool calls not working        | Add `--tool-call-parser glm47 --enable-auto-tool-choice` to serve command.     |
| DeepGEMM errors               | Install DeepGEMM for FP8: use the `install_deepgemm.sh` script from vLLM repo. |
| Thinking mode output empty    | Set `temperature=1.0` — thinking mode requires non-zero temperature.           |

## Further Reading

* [GLM-5 on HuggingFace](https://huggingface.co/zai-org/GLM-5)
* [GLM-5 FP8 Checkpoint](https://huggingface.co/zai-org/GLM-5-FP8)
* [Z.AI Platform](https://chat.z.ai)
* [Z.AI API Docs](https://docs.z.ai/guides/llm/glm-5)
* [vLLM GLM-5 Recipe](https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM5.html)
* [GLM-5 Technical Blog](https://z.ai/blog/glm-5)
* [Slime RL Infrastructure](https://github.com/THUDM/slime)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/glm5.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
