# GLM-4.7-Flash

> GLM-4.7-Flash is a **30-billion parameter Mixture-of-Experts** language model by Zhipu AI that activates only 3B parameters per token. It delivers exceptional performance on coding and reasoning tasks, achieving 59.2% on SWE-bench while requiring only 10-12GB VRAM for FP16 inference. Released under the **MIT license**, it's an ideal choice for developers seeking frontier model quality at affordable single-GPU costs.

## At a Glance

* **Model Size**: 30B total / 3B active parameters (MoE)
* **License**: MIT (fully commercial)
* **Context**: 128K tokens
* **Performance**: 59.2% SWE-bench, 75.4% HumanEval
* **VRAM**: \~10-12GB FP16, \~6GB INT8
* **Speed**: \~45-60 tok/s on RTX 4090

## Why GLM-4.7-Flash?

**Efficient Performance**: GLM-4.7-Flash punches above its weight class. Despite using only 3B active parameters, it outperforms many 70B+ dense models on coding benchmarks. The MoE architecture provides 30B model quality at 7B model inference cost.

**Single-GPU Friendly**: Unlike massive models requiring multi-GPU setups, GLM-4.7-Flash runs comfortably on a single RTX 4090 or A100 40GB. This makes it perfect for development, fine-tuning, and cost-effective production deployments.

**Coding Specialist**: With 59.2% SWE-bench performance, GLM-4.7-Flash excels at software engineering tasks — code generation, debugging, refactoring, and technical documentation. It understands 20+ programming languages with deep context awareness.

**MIT Licensed**: No usage restrictions. Deploy commercially, fine-tune, or modify without licensing concerns. The complete weights and training recipes are freely available.

## GPU Recommendations

| GPU          | VRAM | Performance | Daily Cost\* |
| ------------ | ---- | ----------- | ------------ |
| **RTX 4090** | 24GB | \~50 tok/s  | \~$2.10      |
| **RTX 3090** | 24GB | \~35 tok/s  | \~$1.10      |
| A100 40GB    | 40GB | \~80 tok/s  | \~$3.50      |
| A100 80GB    | 80GB | \~90 tok/s  | \~$4.00      |
| H100         | 80GB | \~120 tok/s | \~$6.00      |

**Best Value**: RTX 4090 offers the sweet spot of performance and cost for GLM-4.7-Flash.

\*Estimated Clore.ai marketplace prices

## Deploy with vLLM

### Install vLLM

```bash
pip install vllm>=0.6.0
# or latest
pip install git+https://github.com/vllm-project/vllm.git
```

### Single GPU Setup

```bash
vllm serve THUDM/glm-4-flash \
  --model THUDM/glm-4-flash \
  --tensor-parallel-size 1 \
  --dtype float16 \
  --max-model-len 32768 \
  --served-model-name glm-4.7-flash \
  --trust-remote-code
```

### Query the Server

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1", 
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="glm-4.7-flash",
    messages=[
        {"role": "system", "content": "You are an expert Python developer."},
        {"role": "user", "content": "Write a FastAPI app with async SQLAlchemy and JWT auth"}
    ],
    max_tokens=2048,
    temperature=0.7
)

print(response.choices[0].message.content)
```

## Deploy with SGLang

SGLang often provides better throughput for MoE models:

```bash
pip install "sglang[all]>=0.3.0"

# Launch server
python -m sglang.launch_server \
  --model-path THUDM/glm-4-flash \
  --port 30000 \
  --host 0.0.0.0 \
  --dtype float16 \
  --tp-size 1 \
  --context-length 32768
```

## Deploy with Ollama

Simple setup for local development:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull model (will download ~18GB)
ollama pull glm4:7b-chat

# Run interactively
ollama run glm4:7b-chat

# API mode
ollama serve
```

Then query via REST API:

```python
import requests

response = requests.post('http://localhost:11434/api/generate',
    json={
        'model': 'glm4:7b-chat',
        'prompt': 'Explain the MoE architecture in GLM-4.7-Flash',
        'stream': False
    }
)

print(response.json()['response'])
```

## Docker Template

```dockerfile
FROM nvidia/cuda:12.1-devel-ubuntu22.04

# Install Python 3.10
RUN apt-get update && apt-get install -y python3.10 python3-pip curl

# Install vLLM
RUN pip install vllm>=0.6.0 transformers

# Pre-download model (optional)
# RUN python3 -c "from transformers import AutoModel; AutoModel.from_pretrained('THUDM/glm-4-flash', trust_remote_code=True)"

EXPOSE 8000

CMD ["vllm", "serve", "THUDM/glm-4-flash", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--tensor-parallel-size", "1", \
     "--dtype", "float16", \
     "--trust-remote-code"]
```

Build and run:

```bash
docker build -t glm-4.7-flash .
docker run --gpus all -p 8000:8000 glm-4.7-flash
```

## Code Generation Example

GLM-4.7-Flash excels at complex code generation:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="glm-4.7-flash",
    messages=[
        {"role": "user", 
         "content": """Create a Python class for a rate limiter with:
- Token bucket algorithm
- Async/await support  
- Redis backend
- Decorator for function rate limiting
- Proper error handling"""}
    ],
    max_tokens=2048,
    temperature=0.3
)

print(response.choices[0].message.content)
```

## Tips for Clore.ai Users

* **Memory Optimization**: Use `--dtype float16` to reduce VRAM usage. For 16GB GPUs, add `--max-model-len 16384` to limit context.
* **Batch Processing**: Increase `--max-num-seqs` for higher throughput when serving multiple requests.
* **Quantization**: For RTX 3060/4060 (12GB), use AWQ or GPTQ quantized versions for \~6GB VRAM usage.
* **Preemption**: GLM-4.7-Flash handles interruptions gracefully — good for preemptible Clore.ai instances.
* **Context Length**: Default 128K context may be overkill. Set `--max-model-len 32768` for most applications.

## Troubleshooting

| Issue              | Solution                                                    |
| ------------------ | ----------------------------------------------------------- |
| `OutOfMemoryError` | Reduce `--max-model-len` or use `--dtype float16`           |
| Slow model loading | Pre-cache with `huggingface-cli download THUDM/glm-4-flash` |
| Import errors      | Update transformers: `pip install transformers>=4.40.0`     |
| Poor performance   | Enable Flash Attention: `pip install flash-attn`            |
| Connection refused | Check firewall: `ufw allow 8000`                            |

## Alternative Models

If GLM-4.7-Flash doesn't fit your needs:

* **Qwen2.5-Coder-7B**: Better pure coding, smaller footprint
* **CodeQwen1.5-7B**: Chinese + English coding specialist
* **GLM-4-9B**: Larger sibling with better reasoning
* **DeepSeek-V3**: 671B MoE for ultimate performance (multi-GPU)

## Resources

* [GLM-4-Flash on Hugging Face](https://huggingface.co/THUDM/glm-4-flash)
* [GLM-4 Technical Report](https://arxiv.org/abs/2406.12793)
* [vLLM Documentation](https://docs.vllm.ai/)
* [SGLang GitHub](https://github.com/sgl-project/sglang)
* [Zhipu AI Platform](https://open.bigmodel.cn/)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/glm-47-flash.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
