Qwen2.5-VL Vision Language Model

Run Qwen2.5-VL, the leading open vision-language model, for image/video/document understanding on Clore.ai GPUs.

Qwen2.5-VL from Alibaba (December 2024) is the top-performing open-weight vision-language model (VLM). Available in 3B, 7B, and 72B parameter sizes, it understands images, video frames, PDFs, charts, and complex visual layouts. The 7B variant hits the sweet spot — it outperforms many larger models on benchmarks while running comfortably on a single 24 GB GPU.

On Clore.ai you can rent the exact GPU you need — from an RTX 3090 for the 7B model to multi-GPU setups for the 72B variant — and start analyzing visual content in minutes.

Key Features

Multimodal input — images, video, PDFs, screenshots, charts, and diagrams in a single model.
Three scales — 3B (edge/mobile), 7B (production sweet spot), 72B (SOTA quality).
Dynamic resolution — processes images at their native resolution; no forced resize to 224×224.
Video understanding — accepts multi-frame video input with temporal reasoning.
Document OCR — extracts text from scanned documents, receipts, and handwritten notes.
Multilingual — strong performance across English, Chinese, and 20+ other languages.
Ollama support — run locally with ollama run qwen2.5vl:7b for zero-code deployment.
Transformers integration — Qwen2_5_VLForConditionalGeneration in HuggingFace transformers.

Requirements

Component

72B

GPU VRAM

8 GB

16–24 GB

80+ GB (multi-GPU)

System RAM

16 GB

32 GB

128 GB

Disk

10 GB

20 GB

150 GB

Python

3.10+

CUDA

12.1+

Clore.ai GPU recommendation: For the 7B model, an RTX 4090 (24 GB, ~$0.5–2/day) or RTX 3090 (24 GB, ~$0.3–1/day) is ideal. For 72B, filter the marketplace for A100 80 GB or multi-GPU setups.

Quick Start

Option A: Ollama (Simplest)

# Install ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run the 7B vision model
ollama run qwen2.5vl:7b

Then in the ollama prompt:

>>> Describe this image: /path/to/photo.jpg

Option B: Python / Transformers

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install transformers accelerate qwen-vl-utils pillow

Usage Examples

Image Understanding with Transformers

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "Qwen/Qwen2.5-VL-7B-Instruct"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"},
            {"type": "text", "text": "What species is this insect? Describe its key identifying features."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
response = processor.batch_decode(
    output_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)[0]

print(response)

Video Analysis

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "file:///workspace/clip.mp4", "max_pixels": 360 * 420, "fps": 1.0},
            {"type": "text", "text": "Summarize what happens in this video. List the key events in order."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Document OCR and Extraction

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///workspace/receipt.jpg"},
            {"type": "text", "text": "Extract all items, quantities, and prices from this receipt. Return as JSON."},
        ],
    }
]

# Process using the same model/processor setup from above
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=2048)
print(processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Ollama API for Batch Processing

import ollama
import base64
from pathlib import Path

def analyze_image(image_path: str, question: str) -> str:
    """Send an image to Qwen2.5-VL via Ollama API."""
    image_data = base64.b64encode(Path(image_path).read_bytes()).decode()
    response = ollama.chat(
        model="qwen2.5vl:7b",
        messages=[{
            "role": "user",
            "content": question,
            "images": [image_data],
        }],
    )
    return response["message"]["content"]

# Batch process a folder of images
from pathlib import Path
for img in sorted(Path("./photos").glob("*.jpg")):
    result = analyze_image(str(img), "Describe this image in one sentence.")
    print(f"{img.name}: {result}")

Tips for Clore.ai Users

Ollama for quick deployment — ollama run qwen2.5vl:7b is the fastest path to a working VLM. No Python code needed for interactive use.
7B is the sweet spot — the 7B Instruct variant fits in 16 GB VRAM with 4-bit quantization and delivers quality competitive with much larger models.
Dynamic resolution matters — Qwen2.5-VL processes images at native resolution. For large images (>4K), resize to 1920px max width to avoid excessive VRAM usage.
Video fps setting — for video input, set fps=1.0 to sample 1 frame per second. Higher values eat VRAM fast; 1 fps is enough for most analysis tasks.
Persistent storage — set HF_HOME=/workspace/hf_cache; the 7B model is ~15 GB. For ollama, models go to ~/.ollama/models/.
Structured output — Qwen2.5-VL follows JSON formatting instructions well. Ask for "Return as JSON" and you'll get parseable output most of the time.
Multi-image comparison — you can pass multiple images in a single message for comparison tasks (e.g., "Which of these two products looks more premium?").
tmux — always run inside tmux on Clore.ai rentals.

Troubleshooting

Problem

Fix

OutOfMemoryError with 7B

Use load_in_4bit=True in from_pretrained() with bitsandbytes; or use the 3B variant

Ollama model not found

ollama pull qwen2.5vl:7b — ensure you have the correct tag

Slow video processing

Reduce fps to 0.5 and max_pixels to 256 * 256; fewer frames = faster inference

Garbled or empty output

Increase max_new_tokens; the default may be too low for detailed descriptions

ImportError: qwen_vl_utils

pip install qwen-vl-utils — required for process_vision_info()

72B model doesn't fit

Use 2× A100 80 GB with device_map="auto" or apply AWQ quantization

Image path not found

For local files in messages, use file:///absolute/path format

Chinese in output when prompting in English

Add "Respond in English only." to your prompt

PreviousLLaVA NextFlorence-2

Last updated 23 days ago

Was this helpful?

hashtagKey Features

hashtagRequirements

hashtagQuick Start

hashtagOption A: Ollama (Simplest)

hashtagOption B: Python / Transformers

hashtagUsage Examples

hashtagImage Understanding with Transformers

hashtagVideo Analysis

hashtagDocument OCR and Extraction

hashtagOllama API for Batch Processing

hashtagTips for Clore.ai Users

hashtagTroubleshooting