# Qwen2.5-VL Vision Language Model

Qwen2.5-VL from Alibaba (December 2024) is the top-performing open-weight vision-language model (VLM). Available in 3B, 7B, and 72B parameter sizes, it understands images, video frames, PDFs, charts, and complex visual layouts. The 7B variant hits the sweet spot — it outperforms many larger models on benchmarks while running comfortably on a single 24 GB GPU.

On [Clore.ai](https://clore.ai/) you can rent the exact GPU you need — from an RTX 3090 for the 7B model to multi-GPU setups for the 72B variant — and start analyzing visual content in minutes.

## Key Features

* **Multimodal input** — images, video, PDFs, screenshots, charts, and diagrams in a single model.
* **Three scales** — 3B (edge/mobile), 7B (production sweet spot), 72B (SOTA quality).
* **Dynamic resolution** — processes images at their native resolution; no forced resize to 224×224.
* **Video understanding** — accepts multi-frame video input with temporal reasoning.
* **Document OCR** — extracts text from scanned documents, receipts, and handwritten notes.
* **Multilingual** — strong performance across English, Chinese, and 20+ other languages.
* **Ollama support** — run locally with `ollama run qwen2.5vl:7b` for zero-code deployment.
* **Transformers integration** — `Qwen2_5_VLForConditionalGeneration` in HuggingFace `transformers`.

## Requirements

| Component  | 3B    | 7B       | 72B                |
| ---------- | ----- | -------- | ------------------ |
| GPU VRAM   | 8 GB  | 16–24 GB | 80+ GB (multi-GPU) |
| System RAM | 16 GB | 32 GB    | 128 GB             |
| Disk       | 10 GB | 20 GB    | 150 GB             |
| Python     | 3.10+ | 3.10+    | 3.10+              |
| CUDA       | 12.1+ | 12.1+    | 12.1+              |

**Clore.ai GPU recommendation:** For the **7B model**, an **RTX 4090** (24 GB, \~$0.5–2/day) or **RTX 3090** (24 GB, \~$0.3–1/day) is ideal. For **72B**, filter the marketplace for **A100 80 GB** or multi-GPU setups.

## Quick Start

### Option A: Ollama (Simplest)

```bash
# Install ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run the 7B vision model
ollama run qwen2.5vl:7b
```

Then in the ollama prompt:

```
>>> Describe this image: /path/to/photo.jpg
```

### Option B: Python / Transformers

```bash
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install transformers accelerate qwen-vl-utils pillow
```

## Usage Examples

### Image Understanding with Transformers

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "Qwen/Qwen2.5-VL-7B-Instruct"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"},
            {"type": "text", "text": "What species is this insect? Describe its key identifying features."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
response = processor.batch_decode(
    output_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)[0]

print(response)
```

### Video Analysis

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "file:///workspace/clip.mp4", "max_pixels": 360 * 420, "fps": 1.0},
            {"type": "text", "text": "Summarize what happens in this video. List the key events in order."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
```

### Document OCR and Extraction

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///workspace/receipt.jpg"},
            {"type": "text", "text": "Extract all items, quantities, and prices from this receipt. Return as JSON."},
        ],
    }
]

# Process using the same model/processor setup from above
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=2048)
print(processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
```

### Ollama API for Batch Processing

```python
import ollama
import base64
from pathlib import Path

def analyze_image(image_path: str, question: str) -> str:
    """Send an image to Qwen2.5-VL via Ollama API."""
    image_data = base64.b64encode(Path(image_path).read_bytes()).decode()
    response = ollama.chat(
        model="qwen2.5vl:7b",
        messages=[{
            "role": "user",
            "content": question,
            "images": [image_data],
        }],
    )
    return response["message"]["content"]

# Batch process a folder of images
from pathlib import Path
for img in sorted(Path("./photos").glob("*.jpg")):
    result = analyze_image(str(img), "Describe this image in one sentence.")
    print(f"{img.name}: {result}")
```

## Tips for Clore.ai Users

1. **Ollama for quick deployment** — `ollama run qwen2.5vl:7b` is the fastest path to a working VLM. No Python code needed for interactive use.
2. **7B is the sweet spot** — the 7B Instruct variant fits in 16 GB VRAM with 4-bit quantization and delivers quality competitive with much larger models.
3. **Dynamic resolution matters** — Qwen2.5-VL processes images at native resolution. For large images (>4K), resize to 1920px max width to avoid excessive VRAM usage.
4. **Video fps setting** — for video input, set `fps=1.0` to sample 1 frame per second. Higher values eat VRAM fast; 1 fps is enough for most analysis tasks.
5. **Persistent storage** — set `HF_HOME=/workspace/hf_cache`; the 7B model is \~15 GB. For ollama, models go to `~/.ollama/models/`.
6. **Structured output** — Qwen2.5-VL follows JSON formatting instructions well. Ask for "Return as JSON" and you'll get parseable output most of the time.
7. **Multi-image comparison** — you can pass multiple images in a single message for comparison tasks (e.g., "Which of these two products looks more premium?").
8. **tmux** — always run inside `tmux` on Clore.ai rentals.

## Troubleshooting

| Problem                                     | Fix                                                                                       |
| ------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `OutOfMemoryError` with 7B                  | Use `load_in_4bit=True` in `from_pretrained()` with `bitsandbytes`; or use the 3B variant |
| Ollama model not found                      | `ollama pull qwen2.5vl:7b` — ensure you have the correct tag                              |
| Slow video processing                       | Reduce `fps` to 0.5 and `max_pixels` to `256 * 256`; fewer frames = faster inference      |
| Garbled or empty output                     | Increase `max_new_tokens`; the default may be too low for detailed descriptions           |
| `ImportError: qwen_vl_utils`                | `pip install qwen-vl-utils` — required for `process_vision_info()`                        |
| 72B model doesn't fit                       | Use 2× A100 80 GB with `device_map="auto"` or apply AWQ quantization                      |
| Image path not found                        | For local files in messages, use `file:///absolute/path` format                           |
| Chinese in output when prompting in English | Add "Respond in English only." to your prompt                                             |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/vision-models/qwen-vl.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
