# GLM-4.7-Flash

> GLM-4.7-Flash is a **30-billion parameter Mixture-of-Experts** language model by Zhipu AI that activates only 3B parameters per token. It delivers exceptional performance on coding and reasoning tasks, achieving 59.2% on SWE-bench while requiring only 10-12GB VRAM for FP16 inference. Released under the **MIT license**, it's an ideal choice for developers seeking frontier model quality at affordable single-GPU costs.

## At a Glance

* **Model Size**: 30B total / 3B active parameters (MoE)
* **License**: MIT (fully commercial)
* **Context**: 128K tokens
* **Performance**: 59.2% SWE-bench, 75.4% HumanEval
* **VRAM**: \~10-12GB FP16, \~6GB INT8
* **Speed**: \~45-60 tok/s on RTX 4090

## Why GLM-4.7-Flash?

**Efficient Performance**: GLM-4.7-Flash punches above its weight class. Despite using only 3B active parameters, it outperforms many 70B+ dense models on coding benchmarks. The MoE architecture provides 30B model quality at 7B model inference cost.

**Single-GPU Friendly**: Unlike massive models requiring multi-GPU setups, GLM-4.7-Flash runs comfortably on a single RTX 4090 or A100 40GB. This makes it perfect for development, fine-tuning, and cost-effective production deployments.

**Coding Specialist**: With 59.2% SWE-bench performance, GLM-4.7-Flash excels at software engineering tasks — code generation, debugging, refactoring, and technical documentation. It understands 20+ programming languages with deep context awareness.

**MIT Licensed**: No usage restrictions. Deploy commercially, fine-tune, or modify without licensing concerns. The complete weights and training recipes are freely available.

## GPU Recommendations

| GPU          | VRAM | Performance | Daily Cost\* |
| ------------ | ---- | ----------- | ------------ |
| **RTX 4090** | 24GB | \~50 tok/s  | \~$2.10      |
| **RTX 3090** | 24GB | \~35 tok/s  | \~$1.10      |
| A100 40GB    | 40GB | \~80 tok/s  | \~$3.50      |
| A100 80GB    | 80GB | \~90 tok/s  | \~$4.00      |
| H100         | 80GB | \~120 tok/s | \~$6.00      |

**Best Value**: RTX 4090 offers the sweet spot of performance and cost for GLM-4.7-Flash.

\*Estimated Clore.ai marketplace prices

## Deploy with vLLM

### Install vLLM

```bash
pip install vllm>=0.6.0
# or latest
pip install git+https://github.com/vllm-project/vllm.git
```

### Single GPU Setup

```bash
vllm serve THUDM/glm-4-flash \
  --model THUDM/glm-4-flash \
  --tensor-parallel-size 1 \
  --dtype float16 \
  --max-model-len 32768 \
  --served-model-name glm-4.7-flash \
  --trust-remote-code
```

### Query the Server

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1", 
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="glm-4.7-flash",
    messages=[
        {"role": "system", "content": "You are an expert Python developer."},
        {"role": "user", "content": "Write a FastAPI app with async SQLAlchemy and JWT auth"}
    ],
    max_tokens=2048,
    temperature=0.7
)

print(response.choices[0].message.content)
```

## Deploy with SGLang

SGLang often provides better throughput for MoE models:

```bash
pip install "sglang[all]>=0.3.0"

# Launch server
python -m sglang.launch_server \
  --model-path THUDM/glm-4-flash \
  --port 30000 \
  --host 0.0.0.0 \
  --dtype float16 \
  --tp-size 1 \
  --context-length 32768
```

## Deploy with Ollama

Simple setup for local development:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull model (will download ~18GB)
ollama pull glm4:7b-chat

# Run interactively
ollama run glm4:7b-chat

# API mode
ollama serve
```

Then query via REST API:

```python
import requests

response = requests.post('http://localhost:11434/api/generate',
    json={
        'model': 'glm4:7b-chat',
        'prompt': 'Explain the MoE architecture in GLM-4.7-Flash',
        'stream': False
    }
)

print(response.json()['response'])
```

## Docker Template

```dockerfile
FROM nvidia/cuda:12.1-devel-ubuntu22.04

# Install Python 3.10
RUN apt-get update && apt-get install -y python3.10 python3-pip curl

# Install vLLM
RUN pip install vllm>=0.6.0 transformers

# Pre-download model (optional)
# RUN python3 -c "from transformers import AutoModel; AutoModel.from_pretrained('THUDM/glm-4-flash', trust_remote_code=True)"

EXPOSE 8000

CMD ["vllm", "serve", "THUDM/glm-4-flash", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--tensor-parallel-size", "1", \
     "--dtype", "float16", \
     "--trust-remote-code"]
```

Build and run:

```bash
docker build -t glm-4.7-flash .
docker run --gpus all -p 8000:8000 glm-4.7-flash
```

## Code Generation Example

GLM-4.7-Flash excels at complex code generation:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="glm-4.7-flash",
    messages=[
        {"role": "user", 
         "content": """Create a Python class for a rate limiter with:
- Token bucket algorithm
- Async/await support  
- Redis backend
- Decorator for function rate limiting
- Proper error handling"""}
    ],
    max_tokens=2048,
    temperature=0.3
)

print(response.choices[0].message.content)
```

## Tips for Clore.ai Users

* **Memory Optimization**: Use `--dtype float16` to reduce VRAM usage. For 16GB GPUs, add `--max-model-len 16384` to limit context.
* **Batch Processing**: Increase `--max-num-seqs` for higher throughput when serving multiple requests.
* **Quantization**: For RTX 3060/4060 (12GB), use AWQ or GPTQ quantized versions for \~6GB VRAM usage.
* **Preemption**: GLM-4.7-Flash handles interruptions gracefully — good for preemptible Clore.ai instances.
* **Context Length**: Default 128K context may be overkill. Set `--max-model-len 32768` for most applications.

## Troubleshooting

| Issue              | Solution                                                    |
| ------------------ | ----------------------------------------------------------- |
| `OutOfMemoryError` | Reduce `--max-model-len` or use `--dtype float16`           |
| Slow model loading | Pre-cache with `huggingface-cli download THUDM/glm-4-flash` |
| Import errors      | Update transformers: `pip install transformers>=4.40.0`     |
| Poor performance   | Enable Flash Attention: `pip install flash-attn`            |
| Connection refused | Check firewall: `ufw allow 8000`                            |

## Alternative Models

If GLM-4.7-Flash doesn't fit your needs:

* **Qwen2.5-Coder-7B**: Better pure coding, smaller footprint
* **CodeQwen1.5-7B**: Chinese + English coding specialist
* **GLM-4-9B**: Larger sibling with better reasoning
* **DeepSeek-V3**: 671B MoE for ultimate performance (multi-GPU)

## Resources

* [GLM-4-Flash on Hugging Face](https://huggingface.co/THUDM/glm-4-flash)
* [GLM-4 Technical Report](https://arxiv.org/abs/2406.12793)
* [vLLM Documentation](https://docs.vllm.ai/)
* [SGLang GitHub](https://github.com/sgl-project/sglang)
* [Zhipu AI Platform](https://open.bigmodel.cn/)
