# Gemma 4 (26B MoE, 4B active)

{% hint style="info" %}
**Status (April 2026):** Gemma 4 was released on **April 2, 2026** by Google as the next generation of the Gemma open-weight family. Two variants ship: a **31B dense** model (`google/gemma-4-31b-it`) and a **26B MoE with \~4B active parameters** (`google/gemma-4-26b-it`). Both are published under the standard **Gemma terms of use** at [huggingface.co/google/gemma-4-26b-it](https://huggingface.co/google/gemma-4-26b-it) and [huggingface.co/google/gemma-4-31b-it](https://huggingface.co/google/gemma-4-31b-it).
{% endhint %}

Gemma 4 is Google's first MoE entry in the Gemma line and the first Gemma release that climbed into the top of the LMSYS Arena (vendor reports **#3 overall at release**, edging out several closed models on factuality and instruction-following). The headline number is the MoE variant: **26B total parameters, \~4B active per token**, which gives you near-frontier instruction-following at the inference cost of a small dense model.

For Clore.ai users the practical takeaway is simple — the 26B MoE runs comfortably on a single **RTX 4090 (24GB)** with FP8 or 4-bit quantization (\~10 tok/s) and hits production-grade throughput on a single **H100 80GB** (\~40+ tok/s), putting Gemma-quality instruction-following within reach at roughly $0.5–2/day on the marketplace. The 31B dense variant is the more capable but more expensive sibling, needing 2× RTX 4090 or 1× H100 to serve.

## Key Features

* **MoE architecture (26B variant)** — 26B total parameters, \~4B activated per token; pay 4B-class inference cost for 26B-class quality
* **Dense fallback (31B variant)** — for teams that prefer the predictability and tooling maturity of dense inference
* **128K context window** — long-document Q\&A, RAG over mid-sized codebases, multi-turn agent loops
* **Strong instruction-following** — Gemma 4 is explicitly tuned for tool use, structured output, and faithful constraint following
* **Multilingual** — full multilingual coverage out of Gemma 3 carried forward, plus an expanded non-English benchmark suite
* **Open weights, Gemma terms** — free for most commercial use; review the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy) before shipping
* **First-class tooling** — supported out of the box in vLLM, SGLang, Ollama, and Hugging Face Transformers

## Choose Your Variant

| Variant                                  | Total Params | Active         | Context | Recommended Quant | Recommended Clore GPU                                                                                                   |
| ---------------------------------------- | ------------ | -------------- | ------- | ----------------- | ----------------------------------------------------------------------------------------------------------------------- |
| **Gemma 4 26B MoE** (`gemma-4-26b-it`)   | 26B          | \~4B per token | 128K    | FP8 or 4-bit GPTQ | 1× [RTX 4090](https://clore.ai/rent-4090.html?utm_source=docs\&utm_medium=guide\&utm_campaign=gemma4) (24GB, quantized) |
| **Gemma 4 31B Dense** (`gemma-4-31b-it`) | 31B          | 31B (all)      | 128K    | FP8 or BF16       | 1× [H100](https://clore.ai/rent-h100.html?utm_source=docs\&utm_medium=guide\&utm_campaign=gemma4) (80GB, BF16)          |

{% hint style="success" %}
**Practical pick:** For 90% of single-GPU deployments, go with **Gemma 4 26B MoE on FP8**. You get the headline Arena quality at \~10–15 tok/s on a 4090 and \~40+ tok/s on an H100, without the latency cost of dense 31B inference.
{% endhint %}

***

## Server Requirements

| Component  | 26B MoE (4-bit, 4090) | 26B MoE (FP8, H100) | 31B Dense (BF16, H100) |
| ---------- | --------------------- | ------------------- | ---------------------- |
| GPU VRAM   | 24GB                  | 80GB                | 80GB                   |
| System RAM | 32GB                  | 64GB                | 64GB                   |
| Disk       | 60GB NVMe             | 80GB NVMe           | 90GB NVMe              |
| Network    | 100 Mbps for HF pull  | 1 Gbps preferred    | 1 Gbps preferred       |
| CUDA       | 12.1+                 | 12.4+               | 12.4+                  |
| Driver     | 550+                  | 555+                | 555+                   |

Plan for an extra \~20% VRAM headroom on top of the static weight footprint to cover KV cache at long contexts. Setting `--gpu-memory-utilization 0.90` in vLLM is a good default.

***

## Quick Deploy on CLORE.AI

The fastest path: rent a single GPU, pull the standard `vllm/vllm-openai` image, and serve the model with an OpenAI-compatible API. Below is the docker-compose layout used by the rest of these guides — adjust the model name and tensor-parallel size based on the variant you picked above.

### Option A — Gemma 4 26B MoE on a single GPU (vLLM, FP8)

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model google/gemma-4-26b-it
      --quantization fp8
      --max-model-len 32768
      --gpu-memory-utilization 0.90
      --served-model-name gemma-4-26b
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: "8gb"

volumes:
  hf_cache:
```

```bash
# Bring it up
HF_TOKEN=hf_xxx docker compose up -d

# Tail logs while the weights download
docker compose logs -f vllm
```

{% hint style="info" %}
**License gating:** Gemma models on Hugging Face require accepting Google's terms once per account. Visit the model page in a browser, click "Acknowledge license", then export `HF_TOKEN` so the container can pull the weights.
{% endhint %}

### Option B — Gemma 4 31B Dense on H100 (vLLM, BF16)

```bash
docker run --gpus all -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model google/gemma-4-31b-it \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --served-model-name gemma-4-31b
```

### Option C — Gemma 4 31B Dense on 2× RTX 4090 (FP8, tensor-parallel)

```bash
docker run --gpus all -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model google/gemma-4-31b-it \
  --quantization fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --served-model-name gemma-4-31b
```

### Option D — Quick local testing with Ollama

For laptop-class experimentation, Ollama wraps the GGUF community builds. Expect quants to land a few days after the official release.

```bash
# Once a community GGUF is published
ollama pull gemma4:26b-moe-q4_k_m
ollama run gemma4:26b-moe-q4_k_m

# OpenAI-compatible API on :11434
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:26b-moe-q4_k_m",
    "messages": [{"role":"user","content":"Summarize the MoE routing approach in two sentences."}]
  }'
```

See the [Ollama guide](/guides/language-models/ollama.md) for general setup, model management, and persistence tips.

***

## Usage Examples

The vLLM container exposes an OpenAI-compatible API on `:8000`. Anything that speaks the OpenAI chat-completions schema works directly.

### Curl chat completion

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-26b",
    "messages": [
      {"role": "system", "content": "You are a careful technical writer."},
      {"role": "user", "content": "Explain MoE routing in three sentences without using analogies."}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'
```

### Python (OpenAI client)

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="gemma-4-26b",
    messages=[
        {"role": "system", "content": "You answer in plain text, no markdown."},
        {"role": "user", "content": "Give me a 5-bullet code review checklist for a Go HTTP handler."},
    ],
    temperature=0.7,
    max_tokens=1024,
)
print(resp.choices[0].message.content)
```

### Streaming responses

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

stream = client.chat.completions.create(
    model="gemma-4-26b",
    messages=[{"role": "user", "content": "Write a haiku about distributed inference."}],
    stream=True,
    max_tokens=128,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()
```

### Hugging Face Transformers (offline use)

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "google/gemma-4-26b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True,  # Fits the MoE on a single 24GB card
)

messages = [
    {"role": "user", "content": "Refactor this Python function for readability:\n\ndef f(x): return [i for i in x if i%2==0 and i>10]"},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
```

***

## Performance Tips

* **Use FP8 on Hopper.** On H100 the FP8 checkpoint is roughly half the memory of BF16 with no measurable quality loss for instruction-following tasks. Pass `--quantization fp8` to vLLM.
* **Use 4-bit GPTQ on Ada (RTX 4090).** For the MoE variant on a single 4090, a community GPTQ 4-bit build is the practical sweet spot — expect \~10–15 tok/s. Ollama's Q4\_K\_M GGUF builds give similar quality with simpler ops.
* **Tensor parallelism for 31B Dense.** Across 2× RTX 4090, pass `--tensor-parallel-size 2`. Pin the context to what you actually need (`--max-model-len 16384`) — every doubling of context roughly doubles the KV cache footprint.
* **Expert parallelism for the MoE.** On multi-GPU setups for the 26B MoE, vLLM's `--enable-expert-parallel` can give a meaningful throughput bump at higher batch sizes. It's overkill for single-GPU.
* **Chunked prefill for long contexts.** When pushing past 32K, add `--enable-chunked-prefill` to vLLM. This keeps prefill latency manageable and prevents stalls on the decode path.
* **Pre-pull weights.** For ephemeral Clore rentals, mount a persistent volume at `/root/.cache/huggingface` so subsequent runs skip the 50–60GB download.
* **Pick the right serving backend.** vLLM is the safe default. SGLang often wins on Hopper for high-concurrency workloads; see the [vLLM guide](/guides/language-models/vllm.md) for the broader comparison.

***

## Benchmarks

{% hint style="warning" %}
**Vendor-published numbers — independent verification pending.** The figures below come from Google's April 2, 2026 launch materials. Independent reproductions on private evals are still rolling in. Treat the Arena ranking and factuality scores as directional, not absolute.
{% endhint %}

| Benchmark                       | Gemma 4 26B MoE                               | Gemma 4 31B Dense                        | Reference       |
| ------------------------------- | --------------------------------------------- | ---------------------------------------- | --------------- |
| LMSYS Arena (overall)           | #3 at release                                 | \~#5 at release                          | vendor-reported |
| Instruction-following (IFEval)  | vendor reports strong gains over Gemma 3      | vendor reports strong gains over Gemma 3 | vendor-reported |
| Factuality (SimpleQA / similar) | beats several closed models per Google        | comparable                               | vendor-reported |
| Multilingual (Global-MMLU)      | vendor reports parity with much larger models | best Gemma score to date                 | vendor-reported |

Gemma 4's positioning argument is "more useful per active parameter," not "raw HumanEval king." If you need pure code generation, compare against [GLM-5.1](/guides/language-models/glm-5-1.md) (frontier coding) or [Qwen3.5](/guides/language-models/qwen35.md) (best 35B-class dense). If you need long-horizon agentic loops, GLM-5.1 is still the sharper tool.

***

## Troubleshooting

| Issue                                          | Solution                                                                                                                                                                                  |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` loading the 26B MoE on 24GB | Switch to FP8 (`--quantization fp8`) or 4-bit (`load_in_4bit=True` in Transformers). Drop `--max-model-len` to 16384 to shrink the KV cache.                                              |
| `OutOfMemoryError` loading 31B Dense on H100   | BF16 at 32K context is right at the edge on 80GB. Lower `--max-model-len` to 16384 or move to FP8.                                                                                        |
| Hugging Face download fails with 403           | You have not accepted the Gemma license on the model page. Open the URL in a browser, acknowledge the terms, then re-pull with a token that has `read` scope.                             |
| Very slow first token                          | Cold weight load (\~30–60s on first request) plus prefill on long inputs. Run a dummy warm-up request after the server starts. Add `--enable-chunked-prefill` for long-context workloads. |
| Garbled output / repetition loops              | Check the chat template — `tokenizer.apply_chat_template` is required; do not concatenate `system`+`user` strings manually. Set `temperature=0.7` and `top_p=0.95` for general use.       |
| Tool / JSON output unreliable                  | Use vLLM's `--guided-decoding-backend` or pass a JSON schema via `response_format`. The model follows constraints well but unstructured prompts will still drift.                         |
| `unsupported quantization` error in vLLM       | Update to a vLLM version released after April 2026 (`pip install -U vllm --pre`). The Gemma 4 architecture needs the latest config parsers.                                               |

***

## FAQ

**Gemma 4 vs Llama 4?** Different shapes for different jobs. [Llama 4 Scout](/guides/language-models/llama4.md) is 109B/17B-active with a headline 10M context — great when you need to dump huge inputs at the model. Gemma 4 26B MoE is much smaller in total params (26B vs 109B), activates fewer params per token (4B vs 17B), and is tuned harder for instruction-following and factuality. For tight VRAM budgets and quality-per-parameter, Gemma 4 wins. For absurd context length, Llama 4 Scout wins.

**How much VRAM for Gemma 4 26B MoE?**

* 4-bit GGUF / GPTQ: fits in **24GB** (single RTX 4090), \~10–15 tok/s.
* FP8: comfortable on **40GB**, fast on **80GB** (H100) at \~40+ tok/s.
* BF16 full: \~55GB of weights plus KV cache — plan for an **80GB** card.

**Can I use Gemma 4 commercially?** Yes, under the standard Gemma terms of use. Review the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy) before deploying — there are restrictions around specific use cases (deception, generating CSAM, illegal activity), and you must pass downstream license notices to your users. It is not an Apache 2.0 / MIT model — it is open-weight under a usage policy. If you need a fully unrestricted license, [Qwen3.5](/guides/language-models/qwen35.md) (Apache 2.0) or [GLM-5.1](/guides/language-models/glm-5-1.md) (MIT) are alternatives.

**Gemma 4 vs DeepSeek-V4?** [DeepSeek-V4](/guides/language-models/deepseek-v4.md) is a different weight class — \~1T params, multimodal, 1M context. Use DeepSeek-V4 when you need raw capability and have a serious GPU rack. Use Gemma 4 26B MoE when you want strong instruction-following on a **single GPU** and care about \~$1–2/day rentals on Clore. Gemma 4 is the "best model that fits on a 4090" candidate; DeepSeek-V4 is the "I will pay for 8× H200" candidate.

**Does Gemma 4 support vision / multimodal inputs?** Gemma 4's headline release is text-only instruction-tuned (`*-it`). Google has historically followed text releases with PaliGemma vision variants — track [huggingface.co/google](https://huggingface.co/google) for updates. For an image-capable open model today, look at [Kimi K2.5](/guides/language-models/kimi-k2.md) or [Llama 4 Scout](/guides/language-models/llama4.md).

***

## Related Guides

* [vLLM](/guides/language-models/vllm.md) — production serving backend used in this guide
* [Ollama](/guides/language-models/ollama.md) — quickest path to local testing with GGUF builds
* [Llama 4](/guides/language-models/llama4.md) — Meta's MoE alternative with 10M context
* [GLM-5.1](/guides/language-models/glm-5-1.md) — frontier-class coding MoE (744B/40B-active) when Gemma's size class is not enough
* [Qwen3.5](/guides/language-models/qwen35.md) — Apache-2.0 35B dense, the other strong single-GPU option
* [Gemma 3](/guides/language-models/gemma3.md) — the predecessor generation, useful baseline for migration

### Links

* [Gemma 4 26B MoE on Hugging Face](https://huggingface.co/google/gemma-4-26b-it)
* [Gemma 4 31B Dense on Hugging Face](https://huggingface.co/google/gemma-4-31b-it)
* [Gemma terms of use](https://ai.google.dev/gemma/terms)
* [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy)
* [vLLM docs](https://docs.vllm.ai)
* [SGLang repo](https://github.com/sgl-project/sglang)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/gemma4.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
