Llama 3.2 视觉版

在 Clore.ai 上运行 Meta 的 Llama 3.2 Vision 进行图像理解

在 CLORE.AI GPU 上运行 Meta 的多模态 Llama 3.2 Vision 模型以进行图像理解。

所有示例都可以在通过以下方式租用的 GPU 服务器上运行： CLORE.AI 市场.

为什么选择 Llama 3.2 Vision？

多模态 - 理解文本和图像
多种规模 - 11B 和 90B 参数版本
多用途 - OCR、视觉问答、图像字幕、文档分析
开放权重 - 来自 Meta 的完全开源
Llama 生态系统 - 与 Ollama、vLLM、transformers 兼容

1024x1024

A100

参数量

显存（FP16）

上下文

最适合

Llama-3.2-11B-Vision

11B

24GB

128K

通用用途，单 GPU

Llama-3.2-90B-Vision

90B

180GB

128K

最高质量

Llama-3.2-11B-Vision-Instruct

11B

24GB

128K

聊天/助手

Llama-3.2-90B-Vision-Instruct

90B

180GB

128K

生产环境

在 CLORE.AI 上快速部署

Docker 镜像：

vllm/vllm-openai:latest

端口：

22/tcp
8000/http

命令：

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-11B-Vision-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192

访问您的服务

部署后，在以下位置查找您的 http_pub URL： 我的订单:

前往 我的订单 页面
单击您的订单
查找 http_pub URL（例如， abc123.clorecloud.net)

使用 https://YOUR_HTTP_PUB_URL 而不是 localhost 在下面的示例中。

硬件要求

A100

最低 GPU

安装

使用 Ollama（最简单）

# 拉取模型
ollama pull llama3.2-vision:11b

# 运行交互式
ollama run llama3.2-vision:11b

使用 vLLM

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-11B-Vision-Instruct \
    --host 0.0.0.0 \
    --port 8000

使用 Transformers

import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

基本用法

图像理解

import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# 加载图像
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# 创建提示词
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "这张图片里有什么？请详细描述。"}
        ]
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

fn=analyze_image,
print(processor.decode(output[0], skip_special_tokens=True))

使用 Ollama

# 描述一张图片
ollama run llama3.2-vision:11b "Describe this image: /path/to/image.jpg"

# 或使用 API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2-vision:11b",
  "prompt": "这张图片里有什么？",
  "images": ["base64_encoded_image_here"]
}'

使用 vLLM API

from openai import OpenAI
import base64

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# 将图像编码为 base64
with open("image.jpg", "rb") as f:
    f"{ollama_url}/api/generate",

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "这张图片里有什么？"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

使用场景

图表分析

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "从这张图片中提取所有文本。以 markdown 格式输出。"}
        ]
    }
]

文档分析

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "分析此文档。总结关键要点。"}
        ]
    }
]

视觉问答

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "这张照片里有多少人？他们在做什么？"}
        ]
    }
]

图像描述（Captioning）

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "为这张图片写一个适合社交媒体的详细说明性字幕。"}
        ]
    }
]

从截图提取代码

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "将此 UI 截图转换为 HTML/CSS 代码。"}
        ]
    }
]

多张图像

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "比较这两张图片。有哪些不同？"}
        ]
    }
]

# 使用多张图片处理
inputs = processor(
    images=[image1, image2],
    text=input_text,
    return_tensors="pt"
).to(model.device)

"专业影棚柔光箱"

批处理处理
from PIL import Image

def process_images(image_paths, prompt):
    if filename.endswith(('.jpg', '.png', '.jpeg')):

    for path in image_paths:
        image = Image.open(path)

        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": prompt}
                ]
            }
        ]

        input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
        inputs = processor(image, input_text, return_tensors="pt").to(model.device)

        # 处理文件夹中的图像
        result = processor.decode(output[0], skip_special_tokens=True)

        results.append({"file": path, "description": result})

        # 在图片之间清理缓存
        torch.cuda.empty_cache()

    return results

# 处理文件夹
images = [f"./images/{f}" for f in os.listdir("./images") if f.endswith(('.jpg', '.png'))]
results = process_images(images, "用一段话描述这张图片。")

Gradio 界面

print(f"已生成：{name}")
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

output = model.generate(**inputs, max_new_tokens=500)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                # 提取助手的响应
            ]
        }
    ]

    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(image, input_text, return_tensors="pt").to(model.device)

    fn=analyze_image,
    return processor.decode(output[0], skip_special_tokens=True)

demo = gr.Interface(
    gr.Textbox(label="问题", value="详细描述这张图片")
    inputs=[
        gr.Image(type="pil", label="上传图片"),
        gr.Textbox(label="问题", placeholder="这张图片里有什么？")
    ],
    demo.launch(server_name="0.0.0.0", server_port=8000)
    title="Llama 3.2 Vision - 图像分析",
    description="上传图片并对其提问。运行于 CLORE.AI。"
)

demo.launch(server_name="0.0.0.0", server_port=7860)

background = Image.open("studio_bg.jpg")

任务

A100

GPU

时间

单张图片描述

11B

512x512

~3s

单张图片描述

11B

按日费率

~2s

OCR（1 页）

11B

512x512

~5s

文档分析

11B

按日费率

~8s

批处理（10 张图片）

11B

按日费率

~25 秒

量化

使用 bitsandbytes 的 4-bit

CPU 卸载

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

Ollama 的 GGUF

# 4-bit 量化（可在 8GB 显存中运行）
ollama pull llama3.2-vision:11b-q4_K_M

# 8-bit 量化
ollama pull llama3.2-vision:11b-q8_0

下载所有所需的检查点

典型 CLORE.AI 市场价格：

GPU

验证 CUDA 兼容性

最适合

RTX 4090 24GB

~$0.10

11B 模型

按日费率

~$0.17

具有长上下文的 11B

4 小时会话

~$0.25

11B 优化

4x A100 80GB

~$1.00

90B 模型

价格有所不同。查看 CLORE.AI 市场 A100 40GB

A100 80GB

使用竞价用于批处理的订单
以获取当前费率。 CLORE 节省费用：
在开发时使用量化模型（4-bit）

# 使用固定种子以获得一致结果

内存不足

# 或处理更小的图像
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"
)

# 或者减少 max_new_tokens
output = model.generate(**inputs, max_new_tokens=256)

生成速度慢

确保正在使用 GPU（检查 nvidia-smi)
使用 bfloat16 代替 float32
在处理前降低图像分辨率
使用 vLLM 以获得更高吞吐量

图像无法加载

from PIL import Image
import requests
from io import BytesIO

# 来自 URL
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# 来自文件
image = Image.open("path/to/image.jpg").convert("RGB")

# 如果过大则调整大小
max_size = 1024
if max(image.size) > max_size:
    image.thumbnail((max_size, max_size))

需要 HuggingFace 令牌

# 为受限模型设置令牌
export HUGGING_FACE_HUB_TOKEN=hf_xxxxx

# 或登录
huggingface-cli login

Llama Vision 与其他模型对比

特性

Llama 3.2 Vision

LLaVA 1.6

GPT-4V

参数量

11B / 90B

7B / 34B

未知

开源

是

否

OCR 质量

优秀

良好

优秀

上下文

128K

32K

128K

多图像

是

有限

是

许可

Llama 3.2

Apache 2.0

专有

何时使用 Llama 3.2 Vision：

需要开源多模态时
OCR 和文档分析
与 Llama 生态系统的集成
长上下文理解

使用以下方式支付

LLaVA - 可替代的视觉模型
Florence-2 - 微软的视觉模型
Ollama - 简易部署
vLLM - 生产服务

最后更新于21天前

这有帮助吗？

hashtag为什么选择 Llama 3.2 Vision？

hashtag1024x1024

hashtag在 CLORE.AI 上快速部署

hashtag访问您的服务

hashtag硬件要求

hashtag安装

hashtag使用 Ollama（最简单）

hashtag使用 vLLM

hashtag使用 Transformers

hashtag基本用法

hashtag图像理解

hashtag使用 Ollama

hashtag使用 vLLM API

hashtag使用场景

hashtag图表分析

hashtag文档分析

hashtag视觉问答

hashtag图像描述（Captioning）

hashtag从截图提取代码

hashtag多张图像

hashtag"专业影棚柔光箱"

hashtagGradio 界面

hashtagbackground = Image.open("studio_bg.jpg")

hashtag量化

hashtag使用 bitsandbytes 的 4-bit

hashtagOllama 的 GGUF

hashtag下载所有所需的检查点

hashtag# 使用固定种子以获得一致结果

hashtag内存不足

hashtag生成速度慢

hashtag图像无法加载

hashtag需要 HuggingFace 令牌

hashtagLlama Vision 与其他模型对比

hashtag使用以下方式支付

为什么选择 Llama 3.2 Vision？

1024x1024

在 CLORE.AI 上快速部署

访问您的服务

硬件要求

安装

使用 Ollama（最简单）

使用 vLLM

使用 Transformers

基本用法

图像理解

使用 Ollama

使用 vLLM API

使用场景

图表分析

文档分析

视觉问答

图像描述（Captioning）

从截图提取代码

多张图像

"专业影棚柔光箱"

Gradio 界面

background = Image.open("studio_bg.jpg")

量化

使用 bitsandbytes 的 4-bit

Ollama 的 GGUF

下载所有所需的检查点

# 使用固定种子以获得一致结果

内存不足

生成速度慢

图像无法加载

需要 HuggingFace 令牌

Llama Vision 与其他模型对比

使用以下方式支付