# Llama 3.2 Vision

Run Meta's multimodal Llama 3.2 Vision models for image understanding on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Why Llama 3.2 Vision?

* **Multimodal** - Understands both text and images
* **Multiple sizes** - 11B and 90B parameter versions
* **Versatile** - OCR, visual QA, image captioning, document analysis
* **Open weights** - Fully open source from Meta
* **Llama ecosystem** - Compatible with Ollama, vLLM, transformers

## Model Variants

| Model                         | Parameters | VRAM (FP16) | Context | Best For                |
| ----------------------------- | ---------- | ----------- | ------- | ----------------------- |
| Llama-3.2-11B-Vision          | 11B        | 24GB        | 128K    | General use, single GPU |
| Llama-3.2-90B-Vision          | 90B        | 180GB       | 128K    | Maximum quality         |
| Llama-3.2-11B-Vision-Instruct | 11B        | 24GB        | 128K    | Chat/assistant          |
| Llama-3.2-90B-Vision-Instruct | 90B        | 180GB       | 128K    | Production              |

## Quick Deploy on CLORE.AI

**Docker Image:**

```
vllm/vllm-openai:latest
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-11B-Vision-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Hardware Requirements

| Model      | Minimum GPU   | Recommended  | Optimal   |
| ---------- | ------------- | ------------ | --------- |
| 11B Vision | RTX 4090 24GB | A100 40GB    | A100 80GB |
| 90B Vision | 4x A100 40GB  | 4x A100 80GB | 8x H100   |

## Installation

### Using Ollama (Easiest)

```bash
# Pull the model
ollama pull llama3.2-vision:11b

# Run interactive
ollama run llama3.2-vision:11b
```

### Using vLLM

```bash
pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-11B-Vision-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

### Using Transformers

```python
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
```

## Basic Usage

### Image Understanding

```python
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Create prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What's in this image? Describe in detail."}
        ]
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=500)
print(processor.decode(output[0], skip_special_tokens=True))
```

### With Ollama

```bash
# Describe an image
ollama run llama3.2-vision:11b "Describe this image: /path/to/image.jpg"

# Or use the API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2-vision:11b",
  "prompt": "What is in this image?",
  "images": ["base64_encoded_image_here"]
}'
```

### With vLLM API

```python
from openai import OpenAI
import base64

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Encode image to base64
with open("image.jpg", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)
```

## Use Cases

### OCR / Text Extraction

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Extract all text from this image. Format as markdown."}
        ]
    }
]
```

### Document Analysis

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Analyze this document. Summarize the key points."}
        ]
    }
]
```

### Visual Question Answering

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "How many people are in this photo? What are they doing?"}
        ]
    }
]
```

### Image Captioning

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Write a detailed caption for this image suitable for social media."}
        ]
    }
]
```

### Code from Screenshots

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this UI screenshot to HTML/CSS code."}
        ]
    }
]
```

## Multiple Images

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Compare these two images. What are the differences?"}
        ]
    }
]

# Process with multiple images
inputs = processor(
    images=[image1, image2],
    text=input_text,
    return_tensors="pt"
).to(model.device)
```

## Batch Processing

```python
import os
from PIL import Image

def process_images(image_paths, prompt):
    results = []

    for path in image_paths:
        image = Image.open(path)

        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": prompt}
                ]
            }
        ]

        input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
        inputs = processor(image, input_text, return_tensors="pt").to(model.device)

        output = model.generate(**inputs, max_new_tokens=300)
        result = processor.decode(output[0], skip_special_tokens=True)

        results.append({"file": path, "description": result})

        # Clear cache between images
        torch.cuda.empty_cache()

    return results

# Process folder
images = [f"./images/{f}" for f in os.listdir("./images") if f.endswith(('.jpg', '.png'))]
results = process_images(images, "Describe this image in one paragraph.")
```

## Gradio Interface

```python
import gradio as gr
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

def analyze_image(image, question):
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]
        }
    ]

    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(image, input_text, return_tensors="pt").to(model.device)

    output = model.generate(**inputs, max_new_tokens=500)
    return processor.decode(output[0], skip_special_tokens=True)

demo = gr.Interface(
    fn=analyze_image,
    inputs=[
        gr.Image(type="pil", label="Upload Image"),
        gr.Textbox(label="Question", placeholder="What's in this image?")
    ],
    outputs=gr.Textbox(label="Response"),
    title="Llama 3.2 Vision - Image Analysis",
    description="Upload an image and ask questions about it. Running on CLORE.AI."
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Performance

| Task                     | Model | GPU       | Time  |
| ------------------------ | ----- | --------- | ----- |
| Single image description | 11B   | RTX 4090  | \~3s  |
| Single image description | 11B   | A100 40GB | \~2s  |
| OCR (1 page)             | 11B   | RTX 4090  | \~5s  |
| Document analysis        | 11B   | A100 40GB | \~8s  |
| Batch (10 images)        | 11B   | A100 40GB | \~25s |

## Quantization

### 4-bit with bitsandbytes

```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)
```

### GGUF with Ollama

```bash
# 4-bit quantized (fits in 8GB VRAM)
ollama pull llama3.2-vision:11b-q4_K_M

# 8-bit quantized
ollama pull llama3.2-vision:11b-q8_0
```

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU           | Hourly Rate | Best For              |
| ------------- | ----------- | --------------------- |
| RTX 4090 24GB | \~$0.10     | 11B model             |
| A100 40GB     | \~$0.17     | 11B with long context |
| A100 80GB     | \~$0.25     | 11B optimal           |
| 4x A100 80GB  | \~$1.00     | 90B model             |

*Prices vary. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** orders for batch processing
* Pay with **CLORE** tokens
* Use quantized models (4-bit) for development

## Troubleshooting

### Out of Memory

```python
# Use 4-bit quantization
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"
)

# Or reduce max_new_tokens
output = model.generate(**inputs, max_new_tokens=256)
```

### Slow Generation

* Ensure GPU is being used (check `nvidia-smi`)
* Use bfloat16 instead of float32
* Reduce image resolution before processing
* Use vLLM for better throughput

### Image Not Loading

```python
from PIL import Image
import requests
from io import BytesIO

# From URL
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# From file
image = Image.open("path/to/image.jpg").convert("RGB")

# Resize if too large
max_size = 1024
if max(image.size) > max_size:
    image.thumbnail((max_size, max_size))
```

### HuggingFace Token Required

```bash
# Set token for gated models
export HUGGING_FACE_HUB_TOKEN=hf_xxxxx

# Or login
huggingface-cli login
```

## Llama Vision vs Others

| Feature     | Llama 3.2 Vision | LLaVA 1.6  | GPT-4V      |
| ----------- | ---------------- | ---------- | ----------- |
| Parameters  | 11B / 90B        | 7B / 34B   | Unknown     |
| Open Source | Yes              | Yes        | No          |
| OCR Quality | Excellent        | Good       | Excellent   |
| Context     | 128K             | 32K        | 128K        |
| Multi-image | Yes              | Limited    | Yes         |
| License     | Llama 3.2        | Apache 2.0 | Proprietary |

**Use Llama 3.2 Vision when:**

* Need open-source multimodal
* OCR and document analysis
* Integration with Llama ecosystem
* Long context understanding

## Next Steps

* [LLaVA](https://docs.clore.ai/guides/vision-models/llava-vision-language) - Alternative vision model
* [Florence-2](https://docs.clore.ai/guides/vision-models/florence2) - Microsoft's vision model
* [Ollama](https://docs.clore.ai/guides/language-models/ollama) - Easy deployment
* [vLLM](https://docs.clore.ai/guides/language-models/vllm) - Production serving
