# Qwen2.5-VL Vision Language Model

Qwen2.5-VL from Alibaba (December 2024) is the top-performing open-weight vision-language model (VLM). Available in 3B, 7B, and 72B parameter sizes, it understands images, video frames, PDFs, charts, and complex visual layouts. The 7B variant hits the sweet spot — it outperforms many larger models on benchmarks while running comfortably on a single 24 GB GPU.

On [Clore.ai](https://clore.ai/) you can rent the exact GPU you need — from an RTX 3090 for the 7B model to multi-GPU setups for the 72B variant — and start analyzing visual content in minutes.

## Key Features

* **Multimodal input** — images, video, PDFs, screenshots, charts, and diagrams in a single model.
* **Three scales** — 3B (edge/mobile), 7B (production sweet spot), 72B (SOTA quality).
* **Dynamic resolution** — processes images at their native resolution; no forced resize to 224×224.
* **Video understanding** — accepts multi-frame video input with temporal reasoning.
* **Document OCR** — extracts text from scanned documents, receipts, and handwritten notes.
* **Multilingual** — strong performance across English, Chinese, and 20+ other languages.
* **Ollama support** — run locally with `ollama run qwen2.5vl:7b` for zero-code deployment.
* **Transformers integration** — `Qwen2_5_VLForConditionalGeneration` in HuggingFace `transformers`.

## Requirements

| Component  | 3B    | 7B       | 72B                |
| ---------- | ----- | -------- | ------------------ |
| GPU VRAM   | 8 GB  | 16–24 GB | 80+ GB (multi-GPU) |
| System RAM | 16 GB | 32 GB    | 128 GB             |
| Disk       | 10 GB | 20 GB    | 150 GB             |
| Python     | 3.10+ | 3.10+    | 3.10+              |
| CUDA       | 12.1+ | 12.1+    | 12.1+              |

**Clore.ai GPU recommendation:** For the **7B model**, an **RTX 4090** (24 GB, \~$0.5–2/day) or **RTX 3090** (24 GB, \~$0.3–1/day) is ideal. For **72B**, filter the marketplace for **A100 80 GB** or multi-GPU setups.

## Quick Start

### Option A: Ollama (Simplest)

```bash
# Install ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run the 7B vision model
ollama run qwen2.5vl:7b
```

Then in the ollama prompt:

```
>>> Describe this image: /path/to/photo.jpg
```

### Option B: Python / Transformers

```bash
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install transformers accelerate qwen-vl-utils pillow
```

## Usage Examples

### Image Understanding with Transformers

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "Qwen/Qwen2.5-VL-7B-Instruct"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"},
            {"type": "text", "text": "What species is this insect? Describe its key identifying features."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
response = processor.batch_decode(
    output_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)[0]

print(response)
```

### Video Analysis

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "file:///workspace/clip.mp4", "max_pixels": 360 * 420, "fps": 1.0},
            {"type": "text", "text": "Summarize what happens in this video. List the key events in order."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
```

### Document OCR and Extraction

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///workspace/receipt.jpg"},
            {"type": "text", "text": "Extract all items, quantities, and prices from this receipt. Return as JSON."},
        ],
    }
]

# Process using the same model/processor setup from above
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=2048)
print(processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
```

### Ollama API for Batch Processing

```python
import ollama
import base64
from pathlib import Path

def analyze_image(image_path: str, question: str) -> str:
    """Send an image to Qwen2.5-VL via Ollama API."""
    image_data = base64.b64encode(Path(image_path).read_bytes()).decode()
    response = ollama.chat(
        model="qwen2.5vl:7b",
        messages=[{
            "role": "user",
            "content": question,
            "images": [image_data],
        }],
    )
    return response["message"]["content"]

# Batch process a folder of images
from pathlib import Path
for img in sorted(Path("./photos").glob("*.jpg")):
    result = analyze_image(str(img), "Describe this image in one sentence.")
    print(f"{img.name}: {result}")
```

## Tips for Clore.ai Users

1. **Ollama for quick deployment** — `ollama run qwen2.5vl:7b` is the fastest path to a working VLM. No Python code needed for interactive use.
2. **7B is the sweet spot** — the 7B Instruct variant fits in 16 GB VRAM with 4-bit quantization and delivers quality competitive with much larger models.
3. **Dynamic resolution matters** — Qwen2.5-VL processes images at native resolution. For large images (>4K), resize to 1920px max width to avoid excessive VRAM usage.
4. **Video fps setting** — for video input, set `fps=1.0` to sample 1 frame per second. Higher values eat VRAM fast; 1 fps is enough for most analysis tasks.
5. **Persistent storage** — set `HF_HOME=/workspace/hf_cache`; the 7B model is \~15 GB. For ollama, models go to `~/.ollama/models/`.
6. **Structured output** — Qwen2.5-VL follows JSON formatting instructions well. Ask for "Return as JSON" and you'll get parseable output most of the time.
7. **Multi-image comparison** — you can pass multiple images in a single message for comparison tasks (e.g., "Which of these two products looks more premium?").
8. **tmux** — always run inside `tmux` on Clore.ai rentals.

## Troubleshooting

| Problem                                     | Fix                                                                                       |
| ------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `OutOfMemoryError` with 7B                  | Use `load_in_4bit=True` in `from_pretrained()` with `bitsandbytes`; or use the 3B variant |
| Ollama model not found                      | `ollama pull qwen2.5vl:7b` — ensure you have the correct tag                              |
| Slow video processing                       | Reduce `fps` to 0.5 and `max_pixels` to `256 * 256`; fewer frames = faster inference      |
| Garbled or empty output                     | Increase `max_new_tokens`; the default may be too low for detailed descriptions           |
| `ImportError: qwen_vl_utils`                | `pip install qwen-vl-utils` — required for `process_vision_info()`                        |
| 72B model doesn't fit                       | Use 2× A100 80 GB with `device_map="auto"` or apply AWQ quantization                      |
| Image path not found                        | For local files in messages, use `file:///absolute/path` format                           |
| Chinese in output when prompting in English | Add "Respond in English only." to your prompt                                             |
