Gemma 2

Run Google's Gemma 2 models for efficient inference.

circle-check

Renting on CLORE.AI

  1. Filter by GPU type, VRAM, and price

  2. Choose On-Demand (fixed rate) or Spot (bid price)

  3. Configure your order:

    • Select Docker image

    • Set ports (TCP for SSH, HTTP for web UIs)

    • Add environment variables if needed

    • Enter startup command

  4. Select payment: CLORE, BTC, or USDT/USDC

  5. Create order and wait for deployment

Access Your Server

  • Find connection details in My Orders

  • Web interfaces: Use the HTTP port URL

  • SSH: ssh -p <port> root@<proxy-address>

What is Gemma 2?

Gemma 2 from Google offers:

  • Models from 2B to 27B parameters

  • Excellent performance per size

  • Strong instruction following

  • Efficient architecture

Model Variants

Model
Parameters
VRAM
Context

Gemma-2-2B

2B

3GB

8K

Gemma-2-9B

9B

12GB

8K

Gemma-2-27B

27B

32GB

8K

Quick Deploy

Docker Image:

Ports:

Command:

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

  1. Go to My Orders page

  2. Click on your order

  3. Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Using Ollama

Installation

Basic Usage

Gemma 2 2B (Lightweight)

For edge/mobile deployment:

Gemma 2 27B (Best Quality)

vLLM Server

OpenAI-Compatible API

Streaming

Gradio Interface

Batch Processing

Performance

Model
GPU
Tokens/sec

Gemma-2-2B

RTX 3060

~100

Gemma-2-9B

RTX 3090

~60

Gemma-2-9B

RTX 4090

~85

Gemma-2-27B

A100

~45

Gemma-2-27B (4-bit)

RTX 4090

~30

Comparison

Model
MMLU
Quality
Speed

Gemma-2-9B

71.3%

Great

Fast

Llama-3.1-8B

69.4%

Good

Fast

Mistral-7B

62.5%

Good

Fast

Troubleshooting

triangle-exclamation

for 27B - Use 4-bit quantization with BitsAndBytesConfig - Reduce `max_new_tokens` - Clear GPU cache: `torch.cuda.empty_cache()`

Slow generation

  • Use vLLM for production deployment

  • Enable Flash Attention

  • Try 9B model for faster inference

Output quality issues

  • Use instruction-tuned version (-it suffix)

  • Adjust temperature (0.7-0.9 recommended)

  • Add system prompt for context

Tokenizer warnings

  • Update transformers to latest version

  • Use padding_side="left" for batch inference

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU
Hourly Rate
Daily Rate
4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplacearrow-up-right for current rates.

Save money:

  • Use Spot market for flexible workloads (often 30-50% cheaper)

  • Pay with CLORE tokens

  • Compare prices across different providers

Next Steps

  • Llama 3.2 - Meta's model

  • Qwen2.5 - Alibaba's model

  • vLLM Inference - Production serving

Last updated

Was this helpful?