LLaVA

在 Clore.ai 上使用 LLaVA 视觉语言模型与图像对话

使用 LLaVA 与图像聊天——开源的 GPT-4V 替代方案。

所有示例都可以在通过以下方式租用的 GPU 服务器上运行： CLORE.AI 市场.

在 CLORE.AI 上租用

访问 CLORE.AI 市场
按 GPU 类型、显存和价格筛选
选择按需（固定费率）或竞价（出价价格）
配置您的订单：
- 选择 Docker 镜像
- 设置端口（用于 SSH 的 TCP，Web 界面的 HTTP）
- 如有需要，添加环境变量
- 输入启动命令
选择支付方式： CLORE, BTC，或 USDT/USDC
创建订单并等待部署

访问您的服务器

在以下位置查找连接详情： 我的订单
Web 界面：使用 HTTP 端口的 URL
SSH： ssh -p <port> root@<proxy-address>

什么是 LLaVA？

LLaVA（大型语言与视觉助手）可以：

理解并描述图像
回答有关视觉内容的问题
分析图表、示意图、截图
OCR 与文档理解

1024x1024

A100

规模

显存

质量

LLaVA-1.5-7B

8GB

良好

LLaVA-1.5-13B

13B

16GB

更好

LLaVA-1.6-34B

34B

40GB

最佳

LLaVA-NeXT

7-34B

8-40GB

快速部署

Docker 镜像：

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

端口：

22/tcp
8000/http

命令：

pip install llava torch transformers accelerate gradio && \
python -m llava.serve.cli --model-path liuhaotian/llava-v1.5-7b --load-4bit

访问您的服务

部署后，在以下位置查找您的 http_pub URL： 我的订单:

前往 我的订单 页面
单击您的订单
查找 http_pub URL（例如， abc123.clorecloud.net)

使用 https://YOUR_HTTP_PUB_URL 而不是 localhost 在下面的示例中。

安装

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .
pip install flash-attn --no-build-isolation

基本用法

Python API

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
from PIL import Image

model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

# 简单推理
args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": "详细描述这张图片",
    "conv_mode": None,
    "image_file": "photo.jpg",
    "sep": ",",
    "temperature": 0.2,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

output = eval_model(args)
print(output)

使用 Transformers

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# 加载图像
image = Image.open("photo.jpg")

# 创建对话
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "这张图片中显示的是什么？"}
        ]
    }
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(prompt, image, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(output[0], skip_special_tokens=True)
print(response)

Ollama 集成（推荐）

在 CLORE.AI 上运行 LLaVA 的最简单方法：

# 安装 Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 拉取 LLaVA 模型
ollama pull llava:7b

# 使用图像运行（CLI）
ollama run llava:7b "Describe this image: /path/to/image.jpg"

通过 Ollama 的 LLaVA API

重要： LLaVA 的视觉功能仅在通过 /api/generate 端点并使用 images 参数。 /api/chat 和兼容 OpenAI 的端点不 支持 LLaVA 的图像。 工作方式：/api/generate # 先将图像编码为 base64

BASE64_IMAGE=$(base64 -i photo.jpg | tr -d '\n')

# 发送视觉请求
curl https://your-http-pub.clorecloud.net/api/generate -d "{

\"model\": \"llava:7b\",
\"prompt\": \"你在这张图片中看到了什么？请详细描述。\",
  \"images\": [\"$BASE64_IMAGE\"],
  \"stream\": false
  "model": "llava:7b",
  "response": "该图像显示了美丽的山间日落...",
}"

响应：

{
  "done": true
  不可用：/api/chat（视觉返回为 null）
  # 以下不适用于视觉查询：
}

curl https://your-http-pub.clorecloud.net/api/chat -d '{

"messages": [{"role": "user", "content": "describe", "images": ["..."]}]
# 对与图像相关的响应返回 null
  "done": true
  使用 Ollama 的 Python 示例
}'
def encode_image(image_path):

with open(image_path, "rb") as f:

import requests
import base64

return base64.b64encode(f.read()).decode()
    # 对视觉使用 /api/generate（不要使用 /api/chat！）
        "https://your-http-pub.clorecloud.net/api/generate",

"prompt": "你在这张图片中看到了什么？",
response = requests.post(
    "images": [encode_image("photo.jpg")],
    json={
        "done": true
        "stream": False
        print(response.json()["response"])
        完整可运行示例
    }
)

import sys

def analyze_image(ollama_url, image_path, question):

import requests
import base64
"""通过 Ollama 使用 LLaVA 分析图像"""

# 编码图像
    image_base64 = base64.b64encode(f.read()).decode()

    # 使用 /api/generate（视觉的唯一可用端点）
    # 对视觉使用 /api/generate（不要使用 /api/chat！）
        f"{ollama_url}/api/generate",

    "prompt": question,
    response = requests.post(
        "images": [image_base64],
        json={
            "done": true
            return response.json()["response"]
            url = "https://your-http-pub.clorecloud.net"
            完整可运行示例
        }
    )

    result = analyze_image(url, "photo.jpg", "详细描述这张图片")

# 用法
图像描述
prompt = "详细描述这张图片，包括颜色、对象和氛围。"
print(result)

使用场景

OCR / 文本提取

prompt = "提取此图像中可见的所有文本。请清晰地格式化。"

图表分析

prompt = "分析此图表。关键趋势和洞见是什么？"

截图中的代码

prompt = "提取此截图中显示的代码。仅提供代码。"

prompt = "列出此图像中可见的所有对象及其大致位置。"

def analyze_image(image, question):

目标检测

{"type": "text", "text": question}

Gradio 界面

print(f"已生成：{name}")
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

output = model.generate(**inputs, max_new_tokens=500)
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                # 提取助手的响应
            ]
        }
    ]

    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = processor(prompt, image, return_tensors="pt").to("cuda")

    fn=analyze_image,
    response = processor.decode(output[0], skip_special_tokens=True)

    gr.Image(type="pil", label="图像"),
    return response.split("[/INST]")[-1].strip()

demo = gr.Interface(
    gr.Textbox(label="问题", value="详细描述这张图片")
    inputs=[
        outputs=gr.Textbox(label="响应"),
        title="LLaVA 视觉助手"
    ],
    demo.launch(server_name="0.0.0.0", server_port=8000)
    @app.post("/analyze")
)

async def analyze(

API 服务器

from fastapi import FastAPI, UploadFile, File, Form
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import io

app = FastAPI()

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

question: str = Form(default="Describe this image")
img = Image.open(io.BytesIO(await image.read()))
    image: UploadFile = File(...),
    inputs = processor(prompt, img, return_tensors="pt").to("cuda")
):
    return {"response": response.split("[/INST]")[-1].strip()}

    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                # 提取助手的响应
            ]
        }
    ]

    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    def analyze_image(image_path, question):

    fn=analyze_image,
    response = processor.decode(output[0], skip_special_tokens=True)

    {"role": "user", "content": [

# 运行：uvicorn server:app --host 0.0.0.0 --port 8000

"专业影棚柔光箱"

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
批处理处理

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

output = model.generate(**inputs, max_new_tokens=300)
    image = Image.open(image_path)

    conversation = [
        return processor.decode(output[0], skip_special_tokens=True).split("[/INST]")[-1].strip()
            {"type": "image"},
            # 提取助手的响应
        ]}
    ]

    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = processor(prompt, image, return_tensors="pt").to("cuda")

    # 处理文件夹中的图像
    image_folder = "./images"

results = []
for filename in os.listdir(image_folder):
if filename.endswith(('.jpg', '.png', '.jpeg')):

path = os.path.join(image_folder, filename)
    description = analyze_image(path, "简要描述这张图片")
        results.append({"file": filename, "description": description})
        print(f"{filename}: {description[:100]}...")
        with open("descriptions.json", "w") as f:
        from transformers import BitsAndBytesConfig

# 保存结果
import json
bnb_4bit_compute_dtype=torch.float16
    json.dump(results, f, indent=2)

内存优化

4 位量化

CPU 卸载

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    offload_folder="offload"
)

model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

LLaVA-1.6-7B

model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto",
    # 使用 4 位量化
)

background = Image.open("studio_bg.jpg")

A100

GPU

每秒标记数

LLaVA-1.5-7B

速度

~30

LLaVA-1.5-7B

512x512

~45

# 或使用更小的模型（用 7B 代替 13B）

512x512

~40

LLaVA-1.5-13B

~35

# 使用固定种子以获得一致结果

内存不足


# 或处理更小的图像

image = image.resize((336, 336))

使用 flash attention
减少 max_new_tokens

生成速度慢

使用量化模型
使用带上下文的更好提示词
更高分辨率的图像

质量差

使用更大的模型
Ollama LLMs - 使用 Ollama 运行 LLaVA
RAG + LangChain - 视觉 + RAG」}

下载所有所需的检查点

检查文件完整性

GPU

验证 CUDA 兼容性

费用估算

CLORE.AI 市场的典型费率（截至 2024 年）：

按小时费率

~$0.03

~$0.70

~$0.12

速度

~$0.06

~$1.50

~$0.25

512x512

~$0.10

~$2.30

~$0.40

按日费率

~$0.17

~$4.00

~$0.70

4 小时会话

~$0.25

~$6.00

~$1.00

RTX 3060 CLORE.AI 市场 A100 40GB

A100 80GB

使用竞价价格随提供商和需求而异。请查看
以获取当前费率。 CLORE 节省费用：
市场用于灵活工作负载（通常便宜 30-50%）

使用以下方式支付

RAG + LangChain - Vision + RAG
RAG + LangChain - Vision + RAG
vLLM 推理 - 生产部署

上一页Llama 3.2 视觉版下一页Qwen2.5-VL 视觉语言模型

最后更新于22天前

这有帮助吗？

hashtag在 CLORE.AI 上租用

hashtag访问您的服务器

hashtag什么是 LLaVA？

hashtag1024x1024

hashtag快速部署

hashtag访问您的服务

hashtag安装

hashtag基本用法

hashtagPython API

hashtag使用 Transformers

hashtagOllama 集成（推荐）

hashtag通过 Ollama 的 LLaVA API

hashtagBASE64_IMAGE=$(base64 -i photo.jpg | tr -d '\n')

hashtagcurl https://your-http-pub.clorecloud.net/api/chat -d '{

hashtagwith open(image_path, "rb") as f:

hashtagdef analyze_image(ollama_url, image_path, question):

hashtag使用场景

hashtagOCR / 文本提取

hashtag图表分析

hashtag截图中的代码

hashtagprompt = "列出此图像中可见的所有对象及其大致位置。"

hashtag目标检测

hashtagGradio 界面

hashtagAPI 服务器

hashtag"专业影棚柔光箱"

hashtag内存优化

hashtag4 位量化

hashtagLLaVA-1.6-7B

hashtagbackground = Image.open("studio_bg.jpg")

hashtag# 使用固定种子以获得一致结果

hashtag内存不足

hashtag生成速度慢

hashtag质量差

hashtag下载所有所需的检查点

hashtag使用以下方式支付

在 CLORE.AI 上租用

访问您的服务器

什么是 LLaVA？

1024x1024

快速部署

访问您的服务

安装

基本用法

Python API

使用 Transformers

Ollama 集成（推荐）

通过 Ollama 的 LLaVA API

BASE64_IMAGE=$(base64 -i photo.jpg | tr -d '\n')

curl https://your-http-pub.clorecloud.net/api/chat -d '{

with open(image_path, "rb") as f:

def analyze_image(ollama_url, image_path, question):

使用场景

OCR / 文本提取

图表分析

截图中的代码

prompt = "列出此图像中可见的所有对象及其大致位置。"

目标检测

Gradio 界面

API 服务器

"专业影棚柔光箱"

内存优化

4 位量化

LLaVA-1.6-7B

background = Image.open("studio_bg.jpg")

# 使用固定种子以获得一致结果

内存不足

生成速度慢

质量差

下载所有所需的检查点

使用以下方式支付