# Gemma 3

Gemma 3, released March 2025 by Google DeepMind, is built on the same technology as Gemini 2.0. Its standout achievement: **the 27B model beats Llama 3.1 405B** on LMArena benchmarks — a model 15 times its size. It's natively multimodal (text + images + video), supports 128K context, and runs on a single RTX 4090 with quantization.

## Key Features

* **Punches way above its weight**: 27B beats 405B-class models on major benchmarks
* **Natively multimodal**: Text, image, and video understanding built-in
* **128K context window**: Process long documents, codebases, conversations
* **Four sizes**: 1B, 4B, 12B, 27B — something for every GPU budget
* **QAT versions**: Quantization-Aware Training variants let 27B run on consumer GPUs
* **Wide framework support**: Ollama, vLLM, Transformers, Keras, JAX, PyTorch

## Model Variants

| Model           | Parameters | VRAM (Q4) | VRAM (FP16) | Best For                    |
| --------------- | ---------- | --------- | ----------- | --------------------------- |
| Gemma 3 1B      | 1B         | 1.5GB     | 3GB         | Edge, mobile, testing       |
| Gemma 3 4B      | 4B         | 4GB       | 9GB         | Budget GPUs, fast tasks     |
| Gemma 3 12B     | 12B        | 10GB      | 25GB        | Balanced quality/speed      |
| Gemma 3 27B     | 27B        | 18GB      | 54GB        | Best quality, production    |
| Gemma 3 27B QAT | 27B        | 14GB      | —           | Optimized for consumer GPUs |

## Requirements

| Component | Gemma 3 4B | Gemma 3 27B (Q4) | Gemma 3 27B (FP16) |
| --------- | ---------- | ---------------- | ------------------ |
| GPU       | RTX 3060   | RTX 4090         | 2× RTX 4090 / A100 |
| VRAM      | 6GB        | 24GB             | 48GB+              |
| RAM       | 16GB       | 32GB             | 64GB               |
| Disk      | 10GB       | 25GB             | 55GB               |
| CUDA      | 11.8+      | 11.8+            | 12.0+              |

**Recommended Clore.ai GPU**: RTX 4090 24GB (\~$0.5–2/day) for 27B quantized — the sweet spot

## Quick Start with Ollama

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run different sizes
ollama run gemma3:1b     # Tiny — 1.5GB VRAM
ollama run gemma3:4b     # Small — 4GB VRAM
ollama run gemma3:12b    # Medium — 10GB VRAM
ollama run gemma3:27b    # Large — 18-20GB VRAM (quantized)

# QAT version (optimized quantization)
ollama run gemma3:27b-qat
```

### Ollama API Server

```bash
ollama serve &

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:27b",
    "messages": [{"role": "user", "content": "Compare REST vs GraphQL for a new API"}]
  }'
```

### Vision with Ollama

```bash
# Analyze an image
ollama run gemma3:27b "Describe this image in detail" --images ./photo.jpg
```

## vLLM Setup (Production)

```bash
pip install vllm

# Serve 27B model
vllm serve google/gemma-3-27b-it \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

# Serve with longer context on 2 GPUs
vllm serve google/gemma-3-27b-it \
  --tensor-parallel-size 2 \
  --max-model-len 65536

# Serve 4B for budget setups
vllm serve google/gemma-3-4b-it \
  --max-model-len 32768
```

## HuggingFace Transformers

### Text Generation

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "google/gemma-3-27b-it"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True  # Fits on 24GB GPU
)

messages = [
    {"role": "user", "content": "Write a Python class for a binary search tree with insert, search, and delete methods"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=2048, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
```

### Vision (Image Understanding)

```python
import torch
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image

model_name = "google/gemma-3-27b-it"
processor = AutoProcessor.from_pretrained(model_name)
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load image
image = Image.open("screenshot.png")

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "What does this screenshot show? List all UI elements."}
    ]}
]

inputs = processor.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(output[0], skip_special_tokens=True))
```

## Docker Quick Start

```bash
docker run --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model google/gemma-3-27b-it \
  --max-model-len 8192
```

## Benchmark Highlights

| Benchmark         | Gemma 3 27B    | Llama 3.1 70B | Llama 3.1 405B |
| ----------------- | -------------- | ------------- | -------------- |
| LMArena ELO       | 1354           | 1298          | 1337           |
| MMLU              | 75.6           | 79.3          | 85.2           |
| HumanEval         | 72.0           | 72.6          | 80.5           |
| VRAM (Q4)         | 18GB           | 40GB          | 200GB+         |
| **Cost on Clore** | **$0.5–2/day** | **$3–6/day**  | **$12–24/day** |

The 27B delivers 405B-class conversational quality at 1/10th the VRAM cost.

## Tips for Clore.ai Users

* **27B QAT is the sweet spot**: Quantization-Aware Training means less quality loss than post-training quantization — run it on a single RTX 4090
* **Vision is free**: No extra setup needed — Gemma 3 understands images natively. Great for document parsing, screenshot analysis, chart reading
* **Start with short context**: Use `--max-model-len 8192` initially; increase only when needed to save VRAM
* **4B for budget runs**: If you're on RTX 3060/3070 ($0.15–0.3/day), the 4B model still outperforms last-gen 27B models
* **Google auth not required**: Unlike some models, Gemma 3 downloads without gating (just accept license on HuggingFace)

## Troubleshooting

| Issue                        | Solution                                                                   |
| ---------------------------- | -------------------------------------------------------------------------- |
| `OutOfMemoryError` on 27B    | Use QAT version or reduce `--max-model-len` to 4096                        |
| Vision not working in Ollama | Update Ollama to latest: `curl -fsSL https://ollama.com/install.sh \| sh`  |
| Slow generation speed        | Check you're using bfloat16, not float32. Use `--dtype bfloat16`           |
| Model outputs garbage        | Ensure you're using the `-it` (instruct-tuned) variant, not the base model |
| Download 403 error           | Accept the Gemma license at <https://huggingface.co/google/gemma-3-27b-it> |

## Further Reading

* [Gemma 3 Technical Report](https://ai.google.dev/gemma)
* [HuggingFace Model Card](https://huggingface.co/google/gemma-3-27b-it)
* [Ollama Library](https://ollama.com/library/gemma3)
* [Google AI Studio](https://aistudio.google.com/) — try Gemma 3 online before renting a GPU


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/gemma3.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
