Qwen2.5-VL Vision Language Model
Run Qwen2.5-VL, the leading open vision-language model, for image/video/document understanding on Clore.ai GPUs.
Qwen2.5-VL from Alibaba (December 2024) is the top-performing open-weight vision-language model (VLM). Available in 3B, 7B, and 72B parameter sizes, it understands images, video frames, PDFs, charts, and complex visual layouts. The 7B variant hits the sweet spot — it outperforms many larger models on benchmarks while running comfortably on a single 24 GB GPU.
On Clore.ai you can rent the exact GPU you need — from an RTX 3090 for the 7B model to multi-GPU setups for the 72B variant — and start analyzing visual content in minutes.
Key Features
Multimodal input — images, video, PDFs, screenshots, charts, and diagrams in a single model.
Three scales — 3B (edge/mobile), 7B (production sweet spot), 72B (SOTA quality).
Dynamic resolution — processes images at their native resolution; no forced resize to 224×224.
Video understanding — accepts multi-frame video input with temporal reasoning.
Document OCR — extracts text from scanned documents, receipts, and handwritten notes.
Multilingual — strong performance across English, Chinese, and 20+ other languages.
Ollama support — run locally with
ollama run qwen2.5vl:7bfor zero-code deployment.Transformers integration —
Qwen2_5_VLForConditionalGenerationin HuggingFacetransformers.
Requirements
GPU VRAM
8 GB
16–24 GB
80+ GB (multi-GPU)
System RAM
16 GB
32 GB
128 GB
Disk
10 GB
20 GB
150 GB
Python
3.10+
3.10+
3.10+
CUDA
12.1+
12.1+
12.1+
Clore.ai GPU recommendation: For the 7B model, an RTX 4090 (24 GB, ~$0.5–2/day) or RTX 3090 (24 GB, ~$0.3–1/day) is ideal. For 72B, filter the marketplace for A100 80 GB or multi-GPU setups.
Quick Start
Option A: Ollama (Simplest)
Then in the ollama prompt:
Option B: Python / Transformers
Usage Examples
Image Understanding with Transformers
Video Analysis
Document OCR and Extraction
Ollama API for Batch Processing
Tips for Clore.ai Users
Ollama for quick deployment —
ollama run qwen2.5vl:7bis the fastest path to a working VLM. No Python code needed for interactive use.7B is the sweet spot — the 7B Instruct variant fits in 16 GB VRAM with 4-bit quantization and delivers quality competitive with much larger models.
Dynamic resolution matters — Qwen2.5-VL processes images at native resolution. For large images (>4K), resize to 1920px max width to avoid excessive VRAM usage.
Video fps setting — for video input, set
fps=1.0to sample 1 frame per second. Higher values eat VRAM fast; 1 fps is enough for most analysis tasks.Persistent storage — set
HF_HOME=/workspace/hf_cache; the 7B model is ~15 GB. For ollama, models go to~/.ollama/models/.Structured output — Qwen2.5-VL follows JSON formatting instructions well. Ask for "Return as JSON" and you'll get parseable output most of the time.
Multi-image comparison — you can pass multiple images in a single message for comparison tasks (e.g., "Which of these two products looks more premium?").
tmux — always run inside
tmuxon Clore.ai rentals.
Troubleshooting
OutOfMemoryError with 7B
Use load_in_4bit=True in from_pretrained() with bitsandbytes; or use the 3B variant
Ollama model not found
ollama pull qwen2.5vl:7b — ensure you have the correct tag
Slow video processing
Reduce fps to 0.5 and max_pixels to 256 * 256; fewer frames = faster inference
Garbled or empty output
Increase max_new_tokens; the default may be too low for detailed descriptions
ImportError: qwen_vl_utils
pip install qwen-vl-utils — required for process_vision_info()
72B model doesn't fit
Use 2× A100 80 GB with device_map="auto" or apply AWQ quantization
Image path not found
For local files in messages, use file:///absolute/path format
Chinese in output when prompting in English
Add "Respond in English only." to your prompt
Last updated
Was this helpful?