Gemma 2

Run Google's Gemma 2 models efficiently on Clore.ai GPUs

Newer version available! Google released Gemma 3 in March 2025 — the 27B model beats Llama 3.1 405B and adds native multimodal support. Consider upgrading.

Run Google's Gemma 2 models for efficient inference.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

What is Gemma 2?

Gemma 2 from Google offers:

Models from 2B to 27B parameters
Excellent performance per size
Strong instruction following
Efficient architecture

Model Variants

Model

Parameters

VRAM

Context

Gemma-2-2B

3GB

Gemma-2-9B

12GB

Gemma-2-27B

27B

32GB

Quick Deploy

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

Ports:

22/tcp
8000/http

Command:

pip install vllm && \
vllm serve google/gemma-2-9b-it --port 8000

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Using Ollama


# Run Gemma 2
ollama run gemma2

# Specific sizes
ollama run gemma2:2b
ollama run gemma2:9b
ollama run gemma2:27b

Installation

pip install transformers accelerate torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/gemma-2-9b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain how neural networks learn."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to("cuda")

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)

Gemma 2 2B (Lightweight)

For edge/mobile deployment:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/gemma-2-2b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Fast inference for simple tasks
messages = [{"role": "user", "content": "Summarize in one sentence: AI is transforming industries."}]

Gemma 2 27B (Best Quality)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "google/gemma-2-27b-it"

# Use 4-bit to fit in 24GB VRAM
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

vLLM Server

vllm serve google/gemma-2-9b-it \
    --port 8000 \
    --dtype bfloat16 \
    --max-model-len 8192

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

response = client.chat.completions.create(
    model="google/gemma-2-9b-it",
    messages=[
        {"role": "user", "content": "Write a haiku about programming"}
    ],
    temperature=0.8
)

print(response.choices[0].message.content)

Streaming

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

stream = client.chat.completions.create(
    model="google/gemma-2-9b-it",
    messages=[{"role": "user", "content": "Tell me a short story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Gradio Interface

import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/gemma-2-9b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

def chat(message, history, temperature):
    messages = []
    for h in history:
        messages.append({"role": "user", "content": h[0]})
        messages.append({"role": "assistant", "content": h[1]})
    messages.append({"role": "user", "content": message})

    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
    outputs = model.generate(inputs, max_new_tokens=512, temperature=temperature, do_sample=True)

    return tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

demo = gr.ChatInterface(
    fn=chat,
    additional_inputs=[gr.Slider(0.1, 1.5, value=0.7, label="Temperature")],
    title="Gemma 2 Chat"
)

demo.launch(server_name="0.0.0.0", server_port=7860)

Batch Processing

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/gemma-2-9b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

prompts = [
    "Explain gravity in one sentence.",
    "What is photosynthesis?",
    "Define machine learning.",
    "What is the speed of light?"
]

messages_batch = [[{"role": "user", "content": p}] for p in prompts]

inputs = tokenizer.apply_chat_template(
    messages_batch,
    return_tensors="pt",
    padding=True,
    add_generation_prompt=True
).to("cuda")

outputs = model.generate(inputs, max_new_tokens=128, pad_token_id=tokenizer.pad_token_id)

for i, output in enumerate(outputs):
    response = tokenizer.decode(output, skip_special_tokens=True)
    print(f"Q: {prompts[i]}")
    print(f"A: {response.split('<start_of_turn>model')[-1].strip()}\n")

Performance

Model

GPU

Tokens/sec

Gemma-2-2B

RTX 3060

~100

Gemma-2-9B

RTX 3090

~60

Gemma-2-9B

RTX 4090

~85

Gemma-2-27B

A100

~45

Gemma-2-27B (4-bit)

RTX 4090

~30

Comparison

Model

MMLU

Quality

Speed

Gemma-2-9B

71.3%

Great

Fast

Llama-3.1-8B

69.4%

Good

Fast

Mistral-7B

62.5%

Good

Fast

Troubleshooting

CUDA out of memory

for 27B - Use 4-bit quantization with BitsAndBytesConfig - Reduce `max_new_tokens` - Clear GPU cache: `torch.cuda.empty_cache()`

Slow generation

Use vLLM for production deployment
Enable Flash Attention
Try 9B model for faster inference

Output quality issues

Use instruction-tuned version (-it suffix)
Adjust temperature (0.7-0.9 recommended)
Add system prompt for context

Tokenizer warnings

Update transformers to latest version
Use padding_side="left" for batch inference

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

Next Steps

Llama 3.2 - Meta's model
Qwen2.5 - Alibaba's model
vLLM Inference - Production serving

PreviousCodeLlama NextPhi-4

Last updated 7 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagWhat is Gemma 2?

hashtagModel Variants

hashtagQuick Deploy

hashtagAccessing Your Service

hashtagUsing Ollama

hashtagInstallation

hashtagBasic Usage

hashtagGemma 2 2B (Lightweight)

hashtagGemma 2 27B (Best Quality)

hashtagvLLM Server

hashtagOpenAI-Compatible API

hashtagStreaming

hashtagGradio Interface

hashtagBatch Processing

hashtagPerformance

hashtagComparison

hashtagTroubleshooting

hashtagSlow generation

hashtagOutput quality issues

hashtagTokenizer warnings

hashtagCost Estimate

hashtagNext Steps

Renting on CLORE.AI

Access Your Server

What is Gemma 2?

Model Variants

Quick Deploy

Accessing Your Service

Using Ollama

Installation

Basic Usage

Gemma 2 2B (Lightweight)

Gemma 2 27B (Best Quality)

vLLM Server

OpenAI-Compatible API

Streaming

Gradio Interface

Batch Processing

Performance

Comparison

Troubleshooting

Slow generation

Output quality issues

Tokenizer warnings

Cost Estimate

Next Steps