# Llama 3.2 Vision

Run Meta's multimodal Llama 3.2 Vision models for image understanding on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Why Llama 3.2 Vision?

* **Multimodal** - Understands both text and images
* **Multiple sizes** - 11B and 90B parameter versions
* **Versatile** - OCR, visual QA, image captioning, document analysis
* **Open weights** - Fully open source from Meta
* **Llama ecosystem** - Compatible with Ollama, vLLM, transformers

## Model Variants

| Model                         | Parameters | VRAM (FP16) | Context | Best For                |
| ----------------------------- | ---------- | ----------- | ------- | ----------------------- |
| Llama-3.2-11B-Vision          | 11B        | 24GB        | 128K    | General use, single GPU |
| Llama-3.2-90B-Vision          | 90B        | 180GB       | 128K    | Maximum quality         |
| Llama-3.2-11B-Vision-Instruct | 11B        | 24GB        | 128K    | Chat/assistant          |
| Llama-3.2-90B-Vision-Instruct | 90B        | 180GB       | 128K    | Production              |

## Quick Deploy on CLORE.AI

**Docker Image:**

```
vllm/vllm-openai:latest
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-11B-Vision-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Hardware Requirements

| Model      | Minimum GPU   | Recommended  | Optimal   |
| ---------- | ------------- | ------------ | --------- |
| 11B Vision | RTX 4090 24GB | A100 40GB    | A100 80GB |
| 90B Vision | 4x A100 40GB  | 4x A100 80GB | 8x H100   |

## Installation

### Using Ollama (Easiest)

```bash
# Pull the model
ollama pull llama3.2-vision:11b

# Run interactive
ollama run llama3.2-vision:11b
```

### Using vLLM

```bash
pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-11B-Vision-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

### Using Transformers

```python
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
```

## Basic Usage

### Image Understanding

```python
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Create prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What's in this image? Describe in detail."}
        ]
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=500)
print(processor.decode(output[0], skip_special_tokens=True))
```

### With Ollama

```bash
# Describe an image
ollama run llama3.2-vision:11b "Describe this image: /path/to/image.jpg"

# Or use the API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2-vision:11b",
  "prompt": "What is in this image?",
  "images": ["base64_encoded_image_here"]
}'
```

### With vLLM API

```python
from openai import OpenAI
import base64

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Encode image to base64
with open("image.jpg", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)
```

## Use Cases

### OCR / Text Extraction

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Extract all text from this image. Format as markdown."}
        ]
    }
]
```

### Document Analysis

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Analyze this document. Summarize the key points."}
        ]
    }
]
```

### Visual Question Answering

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "How many people are in this photo? What are they doing?"}
        ]
    }
]
```

### Image Captioning

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Write a detailed caption for this image suitable for social media."}
        ]
    }
]
```

### Code from Screenshots

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this UI screenshot to HTML/CSS code."}
        ]
    }
]
```

## Multiple Images

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Compare these two images. What are the differences?"}
        ]
    }
]

# Process with multiple images
inputs = processor(
    images=[image1, image2],
    text=input_text,
    return_tensors="pt"
).to(model.device)
```

## Batch Processing

```python
import os
from PIL import Image

def process_images(image_paths, prompt):
    results = []

    for path in image_paths:
        image = Image.open(path)

        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": prompt}
                ]
            }
        ]

        input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
        inputs = processor(image, input_text, return_tensors="pt").to(model.device)

        output = model.generate(**inputs, max_new_tokens=300)
        result = processor.decode(output[0], skip_special_tokens=True)

        results.append({"file": path, "description": result})

        # Clear cache between images
        torch.cuda.empty_cache()

    return results

# Process folder
images = [f"./images/{f}" for f in os.listdir("./images") if f.endswith(('.jpg', '.png'))]
results = process_images(images, "Describe this image in one paragraph.")
```

## Gradio Interface

```python
import gradio as gr
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

def analyze_image(image, question):
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]
        }
    ]

    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(image, input_text, return_tensors="pt").to(model.device)

    output = model.generate(**inputs, max_new_tokens=500)
    return processor.decode(output[0], skip_special_tokens=True)

demo = gr.Interface(
    fn=analyze_image,
    inputs=[
        gr.Image(type="pil", label="Upload Image"),
        gr.Textbox(label="Question", placeholder="What's in this image?")
    ],
    outputs=gr.Textbox(label="Response"),
    title="Llama 3.2 Vision - Image Analysis",
    description="Upload an image and ask questions about it. Running on CLORE.AI."
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Performance

| Task                     | Model | GPU       | Time  |
| ------------------------ | ----- | --------- | ----- |
| Single image description | 11B   | RTX 4090  | \~3s  |
| Single image description | 11B   | A100 40GB | \~2s  |
| OCR (1 page)             | 11B   | RTX 4090  | \~5s  |
| Document analysis        | 11B   | A100 40GB | \~8s  |
| Batch (10 images)        | 11B   | A100 40GB | \~25s |

## Quantization

### 4-bit with bitsandbytes

```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)
```

### GGUF with Ollama

```bash
# 4-bit quantized (fits in 8GB VRAM)
ollama pull llama3.2-vision:11b-q4_K_M

# 8-bit quantized
ollama pull llama3.2-vision:11b-q8_0
```

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU           | Hourly Rate | Best For              |
| ------------- | ----------- | --------------------- |
| RTX 4090 24GB | \~$0.10     | 11B model             |
| A100 40GB     | \~$0.17     | 11B with long context |
| A100 80GB     | \~$0.25     | 11B optimal           |
| 4x A100 80GB  | \~$1.00     | 90B model             |

*Prices vary. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** orders for batch processing
* Pay with **CLORE** tokens
* Use quantized models (4-bit) for development

## Troubleshooting

### Out of Memory

```python
# Use 4-bit quantization
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"
)

# Or reduce max_new_tokens
output = model.generate(**inputs, max_new_tokens=256)
```

### Slow Generation

* Ensure GPU is being used (check `nvidia-smi`)
* Use bfloat16 instead of float32
* Reduce image resolution before processing
* Use vLLM for better throughput

### Image Not Loading

```python
from PIL import Image
import requests
from io import BytesIO

# From URL
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# From file
image = Image.open("path/to/image.jpg").convert("RGB")

# Resize if too large
max_size = 1024
if max(image.size) > max_size:
    image.thumbnail((max_size, max_size))
```

### HuggingFace Token Required

```bash
# Set token for gated models
export HUGGING_FACE_HUB_TOKEN=hf_xxxxx

# Or login
huggingface-cli login
```

## Llama Vision vs Others

| Feature     | Llama 3.2 Vision | LLaVA 1.6  | GPT-4V      |
| ----------- | ---------------- | ---------- | ----------- |
| Parameters  | 11B / 90B        | 7B / 34B   | Unknown     |
| Open Source | Yes              | Yes        | No          |
| OCR Quality | Excellent        | Good       | Excellent   |
| Context     | 128K             | 32K        | 128K        |
| Multi-image | Yes              | Limited    | Yes         |
| License     | Llama 3.2        | Apache 2.0 | Proprietary |

**Use Llama 3.2 Vision when:**

* Need open-source multimodal
* OCR and document analysis
* Integration with Llama ecosystem
* Long context understanding

## Next Steps

* [LLaVA](https://docs.clore.ai/guides/vision-models/llava-vision-language) - Alternative vision model
* [Florence-2](https://docs.clore.ai/guides/vision-models/florence2) - Microsoft's vision model
* [Ollama](https://docs.clore.ai/guides/language-models/ollama) - Easy deployment
* [vLLM](https://docs.clore.ai/guides/language-models/vllm) - Production serving


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/vision-models/llama-vision.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
