# Gemini 3.1 Flash Lite

> **Gemini 3.1 Flash Lite** is Google's cheapest and fastest production model as of March 2026, released March 3, 2026. It's the API-optimized tier of the Gemini 3.1 family — designed for high-throughput, cost-sensitive workloads like real-time chatbots, classification pipelines, and RAG retrieval layers. Self-host it via Ollama or vLLM on Clore.ai GPUs for maximum cost control.

## What is Gemini 3.1 Flash Lite?

Released March 3, 2026 as the lightweight entry to the Gemini 3.1 family (which also includes Gemini 3.1 Pro from February 19, 2026), Flash Lite trades some reasoning depth for dramatically lower latency and cost. It's Google's answer to the "fast and cheap" tier — competing directly with GPT-5.4's mini variants and Claude Sonnet in price-performance.

**Key specs:**

* **Multimodal**: text, image, audio, video inputs
* **Context window**: 1M tokens (same as Gemini 3.1 Pro)
* **Output**: up to 8K tokens per request
* **Latency**: \~120ms time-to-first-token for short prompts (API)
* **Architecture**: Distilled from Gemini 3.1 Pro with speculative decoding

> **Note:** Gemini 3.1 Flash Lite is a **Google API-only** model — weights are not publicly released. This guide covers (a) using the Google Gemini API on Clore.ai infrastructure, and (b) comparable open-source alternatives you can fully self-host.

## Option A: Use Gemini 3.1 Flash Lite API on a Clore.ai Server

Even if you can't run the weights locally, hosting your API-consuming application on Clore.ai's cheap servers makes sense for long-running processes, automation pipelines, and batch jobs.

### Setup: API Proxy + FastAPI on Clore.ai

```bash
# Rent a CPU or lightweight GPU server on Clore.ai
# RTX 3060 (~$0.25/hr) is more than sufficient for API proxy workloads

pip install google-generativeai fastapi uvicorn

cat > gemini_proxy.py << 'EOF'
import google.generativeai as genai
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-3.1-flash-lite")

app = FastAPI(title="Gemini 3.1 Flash Lite Proxy")

class ChatRequest(BaseModel):
    message: str
    system_prompt: str = "You are a helpful assistant."
    max_tokens: int = 2048

@app.post("/chat")
async def chat(req: ChatRequest):
    try:
        response = model.generate_content(
            [req.system_prompt, req.message],
            generation_config=genai.GenerationConfig(
                max_output_tokens=req.max_tokens,
                temperature=0.7
            )
        )
        return {"response": response.text, "model": "gemini-3.1-flash-lite"}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/vision")
async def vision_chat(image_url: str, prompt: str):
    import httpx
    async with httpx.AsyncClient() as client:
        img_data = await client.get(image_url)
    
    import PIL.Image
    import io
    image = PIL.Image.open(io.BytesIO(img_data.content))
    response = model.generate_content([prompt, image])
    return {"response": response.text}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
EOF

GOOGLE_API_KEY=your-key uvicorn gemini_proxy:app --host 0.0.0.0 --port 8080
```

### High-Throughput Batch Processing

```python
import google.generativeai as genai
import asyncio
from typing import List

genai.configure(api_key="YOUR_API_KEY")

async def batch_classify(texts: List[str], batch_size: int = 50) -> List[str]:
    """Classify texts in parallel batches — costs ~$0.001 per 1K texts."""
    model = genai.GenerativeModel("gemini-3.1-flash-lite")
    
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        tasks = [
            model.generate_content_async(
                f"Classify this text as POSITIVE, NEGATIVE, or NEUTRAL. Reply with one word only.\n\nText: {text}"
            )
            for text in batch
        ]
        responses = await asyncio.gather(*tasks, return_exceptions=True)
        results.extend([
            r.text.strip() if not isinstance(r, Exception) else "ERROR"
            for r in responses
        ])
    return results

# Example
texts = ["Great product!", "Terrible service.", "It's okay I guess."]
labels = asyncio.run(batch_classify(texts))
print(list(zip(texts, labels)))
```

## Option B: Open-Source Alternatives (Self-Host on Clore.ai)

If you want fully local inference with no API costs, these models match Gemini 3.1 Flash Lite in the "fast/cheap" tier:

### Gemma 3 4B (Google's open lightweight model)

```bash
# Runs on any GPU with 6GB+ VRAM — even RTX 3060
docker run --gpus all -d \
  -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  ollama/ollama

docker exec -it $(docker ps -q) ollama pull gemma3:4b
docker exec -it $(docker ps -q) ollama run gemma3:4b "Explain quantum entanglement simply."
```

### Qwen3.5 7B (Faster, higher quality for the size)

```bash
docker exec -it $(docker ps -q) ollama pull qwen3.5:7b
# ~3.8GB VRAM, ~45 tok/s on RTX 3080
```

### Speed Comparison on Clore.ai Hardware

| Model                       | VRAM | Tokens/sec (RTX 4090) | Cost/1M tokens (Clore.ai)                  |
| --------------------------- | ---- | --------------------- | ------------------------------------------ |
| Gemini 3.1 Flash Lite (API) | N/A  | \~200 (API)           | \~$0.25 input / $1.50 output per 1M tokens |
| Gemma 3 4B (local)          | 4GB  | 95 tok/s              | \~$0.002 (at $2/hr)                        |
| Qwen3.5 7B (local)          | 8GB  | 78 tok/s              | \~$0.005 (at $2/hr)                        |
| Gemma 3 12B (local)         | 12GB | 55 tok/s              | \~$0.008 (at $2/hr)                        |
| Gemma 3 27B (local)         | 20GB | 32 tok/s              | \~$0.014 (at $2/hr)                        |

> **Takeaway:** For high-volume workloads (>100M tokens/month), self-hosting Gemma 3 / Qwen3.5 on Clore.ai is **35–50× cheaper** than the Gemini API.

## Deploy on Clore.ai

### Recommended GPU for Flash Lite-tier workloads

| Use Case               | Recommended GPU            | Price on Clore.ai |
| ---------------------- | -------------------------- | ----------------- |
| API proxy / automation | No GPU needed (CPU server) | \~$0.05/hr        |
| Local 4B model         | RTX 3060 12GB              | \~$0.25/hr        |
| Local 7B model         | RTX 3080 10GB              | \~$0.35/hr        |
| Local 27B model        | RTX 4090 24GB              | \~$1.20/hr (spot) |

### One-Click Ollama Launch on Clore.ai

In the Clore.ai dashboard, select **Ollama** from the templates:

```bash
# Or manually via SSH:
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull gemma3:4b
ollama run gemma3:4b
```

## Use Cases Best Suited for Flash Lite-Tier

1. **RAG retrieval layer** — fast context ranking, not final generation
2. **Real-time chatbot responses** — sub-200ms for short queries
3. **Document classification** — process thousands of docs per minute
4. **Code autocomplete** — low-latency suggestion generation
5. **Translation pipelines** — batch-translate content at low cost
6. **Content moderation** — classify user content at scale

## Cost Estimator

| Monthly Volume | Google API Cost | Clore.ai (Gemma 3 4B)       |
| -------------- | --------------- | --------------------------- |
| 10M tokens     | \~$8.75         | \~$3.60 (50hr/mo RTX 3060)  |
| 100M tokens    | \~$7.00         | \~$3.60 (continuous)        |
| 1B tokens      | \~$70.00        | \~$26 (continuous RTX 3060) |

> For volumes above \~200M tokens/month, self-hosting on Clore.ai beats the Gemini API cost.

## Monitoring API Usage

```python
# Track Gemini API usage and costs
import google.generativeai as genai
import json
from datetime import datetime

genai.configure(api_key="YOUR_API_KEY")

def tracked_generate(prompt: str, log_file: str = "usage.jsonl"):
    model = genai.GenerativeModel("gemini-3.1-flash-lite")
    response = model.generate_content(prompt)
    
    # Log usage
    usage = {
        "timestamp": datetime.utcnow().isoformat(),
        "prompt_tokens": response.usage_metadata.prompt_token_count,
        "output_tokens": response.usage_metadata.candidates_token_count,
        "total_tokens": response.usage_metadata.total_token_count,
        "estimated_cost_usd": response.usage_metadata.total_token_count / 1_000_000 * 0.07
    }
    
    with open(log_file, "a") as f:
        f.write(json.dumps(usage) + "\n")
    
    return response.text

# Usage
result = tracked_generate("What is the capital of France?")
print(result)
```

## Related Guides

* [Gemma 3 on Clore.ai](/guides/language-models/gemma3.md) — Google's open-source model family
* [Ollama Guide](/guides/language-models/ollama.md) — run any LLM locally with one command
* [RAGFlow](/guides/rag-and-vector-databases/ragflow.md) — RAG pipeline that works well with fast models
* [vLLM Serving](/guides/language-models/vllm.md) — high-throughput OpenAI-compatible server
* [GPU Comparison](/guides/getting-started/gpu-comparison.md) — find the cheapest GPU for your needs

***

*Last updated: March 16, 2026 | Gemini 3.1 Flash Lite released: March 3, 2026 | Weights: API-only (Google)*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/gemini-3-1-flash-lite.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
