# Gemini 3.1 Flash Lite

> **Gemini 3.1 Flash Lite** 截至 2026 年 3 月，它是 Google 最便宜、最快的生产模型，于 2026 年 3 月 3 日发布。它是 Gemini 3.1 家族中面向 API 优化的档位——专为高吞吐、成本敏感型工作负载设计，例如实时聊天机器人、分类流水线和 RAG 检索层。可通过 Ollama 或 vLLM 在 Clore.ai GPU 上自托管，以获得最大的成本控制。

## 什么是 Gemini 3.1 Flash Lite？

于 2026 年 3 月 3 日发布，作为 Gemini 3.1 家族的轻量级入门款（其中还包括 2026 年 2 月 19 日发布的 Gemini 3.1 Pro），Flash Lite 以显著更低的延迟和成本，换取了一定的推理深度。它是 Google 对“又快又便宜”档位的回应——在性价比上直接与 GPT-5.4 的 mini 变体和 Claude Sonnet 竞争。

**主要规格：**

* **多模态**：文本、图像、音频、视频输入
* **上下文窗口**：100 万 token（与 Gemini 3.1 Pro 相同）
* **输出**：每次请求最多 8K token
* **延迟**：短提示词的首 token 时间约 120 毫秒（API）
* **架构**：从 Gemini 3.1 Pro 蒸馏，并使用推测解码

> **注意：** Gemini 3.1 Flash Lite 是一个 **仅限 Google API 的** 模型——权重未公开发布。本指南涵盖：（a）在 Clore.ai 基础设施上使用 Google Gemini API；以及（b）你可以完全自托管的可比开源替代方案。

## 方案 A：在 Clore.ai 服务器上使用 Gemini 3.1 Flash Lite API

即使你无法在本地运行权重，把你的 API 消费型应用部署在 Clore.ai 的廉价服务器上，对于长时间运行的进程、自动化流水线和批处理任务来说也很合理。

### 设置：在 Clore.ai 上使用 API 代理 + FastAPI

```bash
# 在 Clore.ai 上租用一台 CPU 或轻量级 GPU 服务器
# RTX 3060（约 0.25 美元/小时）足以满足 API 代理工作负载

pip install google-generativeai fastapi uvicorn

cat > gemini_proxy.py << 'EOF'
import google.generativeai as genai
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"] )
model = genai.GenerativeModel("gemini-3.1-flash-lite")

app = FastAPI(title="Gemini 3.1 Flash Lite 代理")

class ChatRequest(BaseModel):
    message: str
    system_prompt: str = "你是一个乐于助人的助手。"
    max_tokens: int = 2048

@app.post("/chat")
async def chat(req: ChatRequest):
    try:
        response = model.generate_content(
            [req.system_prompt, req.message],
            generation_config=genai.GenerationConfig(
                max_output_tokens=req.max_tokens,
                temperature=0.7
            )
        )
        return {"response": response.text, "model": "gemini-3.1-flash-lite"}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/vision")
async def vision_chat(image_url: str, prompt: str):
    import httpx
    async with httpx.AsyncClient() as client:
        img_data = await client.get(image_url)
    
    import PIL.Image
    import io
    image = PIL.Image.open(io.BytesIO(img_data.content))
    response = model.generate_content([prompt, image])
    return {"response": response.text}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
EOF

GOOGLE_API_KEY=your-key uvicorn gemini_proxy:app --host 0.0.0.0 --port 8080
```

### 高吞吐量批处理

```python
import google.generativeai as genai
import asyncio
from typing import List

genai.configure(api_key="YOUR_API_KEY")

async def batch_classify(texts: List[str], batch_size: int = 50) -> List[str]:
    """并行批量分类文本——成本约为每 1K 文本 0.001 美元。"""
    model = genai.GenerativeModel("gemini-3.1-flash-lite")
    
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        tasks = [
            model.generate_content_async(
                f"将这段文本分类为 POSITIVE、NEGATIVE 或 NEUTRAL。仅回复一个词。\n\n文本：{text}"
            )
            for text in batch
        ]
        responses = await asyncio.gather(*tasks, return_exceptions=True)
        results.extend([
            r.text.strip() if not isinstance(r, Exception) else "ERROR"
            for r in responses
        ])
    return results

# 示例
texts = ["很棒的产品！", "糟糕的服务。", "我觉得还行。"]
labels = asyncio.run(batch_classify(texts))
print(list(zip(texts, labels)))
```

## 方案 B：开源替代方案（在 Clore.ai 上自托管）

如果你想要完全本地推理且没有 API 成本，这些模型在“快速/便宜”档位上与 Gemini 3.1 Flash Lite 相当：

### Gemma 3 4B（Google 的开源轻量模型）

```bash
# 可在任何拥有 6GB+ 显存的 GPU 上运行——甚至是 RTX 3060
docker run --gpus all -d \\
  -p 11434:11434 \\
  -v ollama_data:/root/.ollama \\
  ollama/ollama

docker exec -it $(docker ps -q) ollama pull gemma3:4b
docker exec -it $(docker ps -q) ollama run gemma3:4b "简单解释量子纠缠。"
```

### Qwen3.5 7B（更快，体量下质量更高）

```bash
docker exec -it $(docker ps -q) ollama pull qwen3.5:7b
# 约 3.8GB 显存，在 RTX 3080 上约 45 tok/s
```

### Clore.ai 硬件上的速度对比

| 模型                         | 显存   | 令牌/秒（RTX 4090） | 成本/100 万 token（Clore.ai）              |
| -------------------------- | ---- | -------------- | ------------------------------------- |
| Gemini 3.1 Flash Lite（API） | 不适用  | \~200（API）     | \~0.25 美元输入 / 1.50 美元输出，每 100 万 token |
| Gemma 3 4B（本地）             | 4GB  | 95 tok/s       | \~0.002 美元（按 2 美元/小时）                 |
| Qwen3.5 7B（本地）             | 8GB  | 78 tok/s       | \~0.005 美元（按 2 美元/小时）                 |
| Gemma 3 12B（本地）            | 12GB | 55 tok/s       | \~0.008 美元（按 2 美元/小时）                 |
| Gemma 3 27B（本地）            | 20GB | 32 tok/s       | \~0.014 美元（按 2 美元/小时）                 |

> **结论：** 对于高流量工作负载（每月 >1 亿 token），在 Clore.ai 上自托管 Gemma 3 / Qwen3.5 **便宜 35–50 倍** ，相比 Gemini API。

## 在 Clore.ai 上部署

### 适合 Flash Lite 档工作负载的推荐 GPU

| 使用场景         | 推荐 GPU           | Clore.ai 价格      |
| ------------ | ---------------- | ---------------- |
| API 代理 / 自动化 | 不需要 GPU（CPU 服务器） | \~0.05 美元/小时     |
| 本地 4B 模型     | RTX 3060 12GB    | \~0.25 美元/小时     |
| 本地 7B 模型     | RTX 3080 10GB    | \~0.35 美元/小时     |
| 本地 27B 模型    | RTX 4090 24GB    | \~1.20 美元/小时（现货） |

### 在 Clore.ai 上一键启动 Ollama

在 Clore.ai 控制台中，选择 **Ollama** 模板：

```bash
# 或通过 SSH 手动操作：
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull gemma3:4b
ollama run gemma3:4b
```

## 最适合 Flash Lite 档位的用例

1. **RAG 检索层** ——快速上下文排序，而不是最终生成
2. **实时聊天机器人回复** ——短查询低于 200 毫秒
3. **文档分类** ——每分钟处理数千份文档
4. **代码自动补全** ——低延迟建议生成
5. **翻译流水线** ——以低成本批量翻译内容
6. **内容审核** ——大规模分类用户内容

## 成本估算器

| 月度用量         | Google API 成本 | Clore.ai（Gemma 3 4B）         |
| ------------ | ------------- | ---------------------------- |
| 1000 万 token | \~$8.75       | \~3.60 美元（每月 50 小时 RTX 3060） |
| 1 亿 token    | \~$7.00       | \~3.60 美元（持续运行）              |
| 10 亿 token   | \~$70.00      | \~26 美元（持续运行 RTX 3060）       |

> 对于每月约 2 亿 token 以上的用量，在 Clore.ai 上自托管的成本优于 Gemini API。

## 监控 API 用量

```python
# 跟踪 Gemini API 的使用量和成本
import google.generativeai as genai
import json
from datetime import datetime

genai.configure(api_key="YOUR_API_KEY")

def tracked_generate(prompt: str, log_file: str = "usage.jsonl"):
    model = genai.GenerativeModel("gemini-3.1-flash-lite")
    response = model.generate_content(prompt)
    
    # 记录使用情况
    usage = {
        "timestamp": datetime.utcnow().isoformat(),
        "prompt_tokens": response.usage_metadata.prompt_token_count,
        "output_tokens": response.usage_metadata.candidates_token_count,
        "total_tokens": response.usage_metadata.total_token_count,
        "estimated_cost_usd": response.usage_metadata.total_token_count / 1_000_000 * 0.07
    }
    
    with open(log_file, "a") as f:
        f.write(json.dumps(usage) + "\n")
    
    return response.text

# 用法
result = tracked_generate("法国的首都是哪里？")
print(result)
```

## 相关指南

* [Clore.ai 上的 Gemma 3](/guides/guides_v2-zh/yu-yan-mo-xing/gemma3.md) ——Google 的开源模型家族
* [Ollama 指南](/guides/guides_v2-zh/yu-yan-mo-xing/ollama.md) ——用一条命令在本地运行任意 LLM
* [RAGFlow](/guides/guides_v2-zh/rag-yu-xiang-liang-shu-ju-ku/ragflow.md) ——适用于快速模型的 RAG 流水线
* [vLLM 服务](/guides/guides_v2-zh/yu-yan-mo-xing/vllm.md) ——高吞吐量、兼容 OpenAI 的服务器
* [GPU 对比](/guides/guides_v2-zh/kai-shi-shi-yong/gpu-comparison.md) ——为你的需求找到最便宜的 GPU

***

*最后更新：2026 年 3 月 16 日 | Gemini 3.1 Flash Lite 发布：2026 年 3 月 3 日 | 权重：仅限 API（Google）*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yu-yan-mo-xing/gemini-3-1-flash-lite.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.