Gemma 3

Run Google Gemma 3 multimodal models on Clore.ai — beats Llama-405B at 15x smaller

Gemma 3, released March 2025 by Google DeepMind, is built on the same technology as Gemini 2.0. Its standout achievement: the 27B model beats Llama 3.1 405B on LMArena benchmarks — a model 15 times its size. It's natively multimodal (text + images + video), supports 128K context, and runs on a single RTX 4090 with quantization.

Key Features

  • Punches way above its weight: 27B beats 405B-class models on major benchmarks

  • Natively multimodal: Text, image, and video understanding built-in

  • 128K context window: Process long documents, codebases, conversations

  • Four sizes: 1B, 4B, 12B, 27B — something for every GPU budget

  • QAT versions: Quantization-Aware Training variants let 27B run on consumer GPUs

  • Wide framework support: Ollama, vLLM, Transformers, Keras, JAX, PyTorch

Model Variants

Model
Parameters
VRAM (Q4)
VRAM (FP16)
Best For

Gemma 3 1B

1B

1.5GB

3GB

Edge, mobile, testing

Gemma 3 4B

4B

4GB

9GB

Budget GPUs, fast tasks

Gemma 3 12B

12B

10GB

25GB

Balanced quality/speed

Gemma 3 27B

27B

18GB

54GB

Best quality, production

Gemma 3 27B QAT

27B

14GB

Optimized for consumer GPUs

Requirements

Component
Gemma 3 4B
Gemma 3 27B (Q4)
Gemma 3 27B (FP16)

GPU

RTX 3060

RTX 4090

2× RTX 4090 / A100

VRAM

6GB

24GB

48GB+

RAM

16GB

32GB

64GB

Disk

10GB

25GB

55GB

CUDA

11.8+

11.8+

12.0+

Recommended Clore.ai GPU: RTX 4090 24GB (~$0.5–2/day) for 27B quantized — the sweet spot

Quick Start with Ollama

Ollama API Server

Vision with Ollama

vLLM Setup (Production)

HuggingFace Transformers

Text Generation

Vision (Image Understanding)

Docker Quick Start

Benchmark Highlights

Benchmark
Gemma 3 27B
Llama 3.1 70B
Llama 3.1 405B

LMArena ELO

1354

1298

1337

MMLU

75.6

79.3

85.2

HumanEval

72.0

72.6

80.5

VRAM (Q4)

18GB

40GB

200GB+

Cost on Clore

$0.5–2/day

$3–6/day

$12–24/day

The 27B delivers 405B-class conversational quality at 1/10th the VRAM cost.

Tips for Clore.ai Users

  • 27B QAT is the sweet spot: Quantization-Aware Training means less quality loss than post-training quantization — run it on a single RTX 4090

  • Vision is free: No extra setup needed — Gemma 3 understands images natively. Great for document parsing, screenshot analysis, chart reading

  • Start with short context: Use --max-model-len 8192 initially; increase only when needed to save VRAM

  • 4B for budget runs: If you're on RTX 3060/3070 ($0.15–0.3/day), the 4B model still outperforms last-gen 27B models

  • Google auth not required: Unlike some models, Gemma 3 downloads without gating (just accept license on HuggingFace)

Troubleshooting

Issue
Solution

OutOfMemoryError on 27B

Use QAT version or reduce --max-model-len to 4096

Vision not working in Ollama

Update Ollama to latest: curl -fsSL https://ollama.com/install.sh | sh

Slow generation speed

Check you're using bfloat16, not float32. Use --dtype bfloat16

Model outputs garbage

Ensure you're using the -it (instruct-tuned) variant, not the base model

Download 403 error

Accept the Gemma license at https://huggingface.co/google/gemma-3-27b-it

Further Reading

Last updated

Was this helpful?