# Gemma 3

Gemma 3, released March 2025 by Google DeepMind, is built on the same technology as Gemini 2.0. Its standout achievement: **the 27B model beats Llama 3.1 405B** on LMArena benchmarks — a model 15 times its size. It's natively multimodal (text + images + video), supports 128K context, and runs on a single RTX 4090 with quantization.

## Key Features

* **Punches way above its weight**: 27B beats 405B-class models on major benchmarks
* **Natively multimodal**: Text, image, and video understanding built-in
* **128K context window**: Process long documents, codebases, conversations
* **Four sizes**: 1B, 4B, 12B, 27B — something for every GPU budget
* **QAT versions**: Quantization-Aware Training variants let 27B run on consumer GPUs
* **Wide framework support**: Ollama, vLLM, Transformers, Keras, JAX, PyTorch

## Model Variants

| Model           | Parameters | VRAM (Q4) | VRAM (FP16) | Best For                    |
| --------------- | ---------- | --------- | ----------- | --------------------------- |
| Gemma 3 1B      | 1B         | 1.5GB     | 3GB         | Edge, mobile, testing       |
| Gemma 3 4B      | 4B         | 4GB       | 9GB         | Budget GPUs, fast tasks     |
| Gemma 3 12B     | 12B        | 10GB      | 25GB        | Balanced quality/speed      |
| Gemma 3 27B     | 27B        | 18GB      | 54GB        | Best quality, production    |
| Gemma 3 27B QAT | 27B        | 14GB      | —           | Optimized for consumer GPUs |

## Requirements

| Component | Gemma 3 4B | Gemma 3 27B (Q4) | Gemma 3 27B (FP16) |
| --------- | ---------- | ---------------- | ------------------ |
| GPU       | RTX 3060   | RTX 4090         | 2× RTX 4090 / A100 |
| VRAM      | 6GB        | 24GB             | 48GB+              |
| RAM       | 16GB       | 32GB             | 64GB               |
| Disk      | 10GB       | 25GB             | 55GB               |
| CUDA      | 11.8+      | 11.8+            | 12.0+              |

**Recommended Clore.ai GPU**: RTX 4090 24GB (\~$0.5–2/day) for 27B quantized — the sweet spot

## Quick Start with Ollama

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run different sizes
ollama run gemma3:1b     # Tiny — 1.5GB VRAM
ollama run gemma3:4b     # Small — 4GB VRAM
ollama run gemma3:12b    # Medium — 10GB VRAM
ollama run gemma3:27b    # Large — 18-20GB VRAM (quantized)

# QAT version (optimized quantization)
ollama run gemma3:27b-qat
```

### Ollama API Server

```bash
ollama serve &

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:27b",
    "messages": [{"role": "user", "content": "Compare REST vs GraphQL for a new API"}]
  }'
```

### Vision with Ollama

```bash
# Analyze an image
ollama run gemma3:27b "Describe this image in detail" --images ./photo.jpg
```

## vLLM Setup (Production)

```bash
pip install vllm

# Serve 27B model
vllm serve google/gemma-3-27b-it \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

# Serve with longer context on 2 GPUs
vllm serve google/gemma-3-27b-it \
  --tensor-parallel-size 2 \
  --max-model-len 65536

# Serve 4B for budget setups
vllm serve google/gemma-3-4b-it \
  --max-model-len 32768
```

## HuggingFace Transformers

### Text Generation

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "google/gemma-3-27b-it"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True  # Fits on 24GB GPU
)

messages = [
    {"role": "user", "content": "Write a Python class for a binary search tree with insert, search, and delete methods"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=2048, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
```

### Vision (Image Understanding)

```python
import torch
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image

model_name = "google/gemma-3-27b-it"
processor = AutoProcessor.from_pretrained(model_name)
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load image
image = Image.open("screenshot.png")

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "What does this screenshot show? List all UI elements."}
    ]}
]

inputs = processor.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(output[0], skip_special_tokens=True))
```

## Docker Quick Start

```bash
docker run --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model google/gemma-3-27b-it \
  --max-model-len 8192
```

## Benchmark Highlights

| Benchmark         | Gemma 3 27B    | Llama 3.1 70B | Llama 3.1 405B |
| ----------------- | -------------- | ------------- | -------------- |
| LMArena ELO       | 1354           | 1298          | 1337           |
| MMLU              | 75.6           | 79.3          | 85.2           |
| HumanEval         | 72.0           | 72.6          | 80.5           |
| VRAM (Q4)         | 18GB           | 40GB          | 200GB+         |
| **Cost on Clore** | **$0.5–2/day** | **$3–6/day**  | **$12–24/day** |

The 27B delivers 405B-class conversational quality at 1/10th the VRAM cost.

## Tips for Clore.ai Users

* **27B QAT is the sweet spot**: Quantization-Aware Training means less quality loss than post-training quantization — run it on a single RTX 4090
* **Vision is free**: No extra setup needed — Gemma 3 understands images natively. Great for document parsing, screenshot analysis, chart reading
* **Start with short context**: Use `--max-model-len 8192` initially; increase only when needed to save VRAM
* **4B for budget runs**: If you're on RTX 3060/3070 ($0.15–0.3/day), the 4B model still outperforms last-gen 27B models
* **Google auth not required**: Unlike some models, Gemma 3 downloads without gating (just accept license on HuggingFace)

## Troubleshooting

| Issue                        | Solution                                                                   |
| ---------------------------- | -------------------------------------------------------------------------- |
| `OutOfMemoryError` on 27B    | Use QAT version or reduce `--max-model-len` to 4096                        |
| Vision not working in Ollama | Update Ollama to latest: `curl -fsSL https://ollama.com/install.sh \| sh`  |
| Slow generation speed        | Check you're using bfloat16, not float32. Use `--dtype bfloat16`           |
| Model outputs garbage        | Ensure you're using the `-it` (instruct-tuned) variant, not the base model |
| Download 403 error           | Accept the Gemma license at <https://huggingface.co/google/gemma-3-27b-it> |

## Further Reading

* [Gemma 3 Technical Report](https://ai.google.dev/gemma)
* [HuggingFace Model Card](https://huggingface.co/google/gemma-3-27b-it)
* [Ollama Library](https://ollama.com/library/gemma3)
* [Google AI Studio](https://aistudio.google.com/) — try Gemma 3 online before renting a GPU
