Llama 3.2 Vision

Run Meta's Llama 3.2 Vision for image understanding on Clore.ai

Run Meta's multimodal Llama 3.2 Vision models for image understanding on CLORE.AI GPUs.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Why Llama 3.2 Vision?

Multimodal - Understands both text and images
Multiple sizes - 11B and 90B parameter versions
Versatile - OCR, visual QA, image captioning, document analysis
Open weights - Fully open source from Meta
Llama ecosystem - Compatible with Ollama, vLLM, transformers

Model Variants

Model

Parameters

VRAM (FP16)

Context

Best For

Llama-3.2-11B-Vision

11B

24GB

128K

General use, single GPU

Llama-3.2-90B-Vision

90B

180GB

128K

Maximum quality

Llama-3.2-11B-Vision-Instruct

11B

24GB

128K

Chat/assistant

Llama-3.2-90B-Vision-Instruct

90B

180GB

128K

Production

Quick Deploy on CLORE.AI

Docker Image:

vllm/vllm-openai:latest

Ports:

22/tcp
8000/http

Command:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-11B-Vision-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Hardware Requirements

Model

Minimum GPU

Recommended

Optimal

11B Vision

RTX 4090 24GB

A100 40GB

A100 80GB

90B Vision

4x A100 40GB

4x A100 80GB

8x H100

Installation

Using Ollama (Easiest)

# Pull the model
ollama pull llama3.2-vision:11b

# Run interactive
ollama run llama3.2-vision:11b

Using vLLM

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-11B-Vision-Instruct \
    --host 0.0.0.0 \
    --port 8000

Using Transformers

import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

Basic Usage

Image Understanding

import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Create prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What's in this image? Describe in detail."}
        ]
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=500)
print(processor.decode(output[0], skip_special_tokens=True))

With Ollama

# Describe an image
ollama run llama3.2-vision:11b "Describe this image: /path/to/image.jpg"

# Or use the API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2-vision:11b",
  "prompt": "What is in this image?",
  "images": ["base64_encoded_image_here"]
}'

With vLLM API

from openai import OpenAI
import base64

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Encode image to base64
with open("image.jpg", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

Use Cases

OCR / Text Extraction

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Extract all text from this image. Format as markdown."}
        ]
    }
]

Document Analysis

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Analyze this document. Summarize the key points."}
        ]
    }
]

Visual Question Answering

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "How many people are in this photo? What are they doing?"}
        ]
    }
]

Image Captioning

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Write a detailed caption for this image suitable for social media."}
        ]
    }
]

Code from Screenshots

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this UI screenshot to HTML/CSS code."}
        ]
    }
]

Multiple Images

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Compare these two images. What are the differences?"}
        ]
    }
]

# Process with multiple images
inputs = processor(
    images=[image1, image2],
    text=input_text,
    return_tensors="pt"
).to(model.device)

Batch Processing

import os
from PIL import Image

def process_images(image_paths, prompt):
    results = []

    for path in image_paths:
        image = Image.open(path)

        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": prompt}
                ]
            }
        ]

        input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
        inputs = processor(image, input_text, return_tensors="pt").to(model.device)

        output = model.generate(**inputs, max_new_tokens=300)
        result = processor.decode(output[0], skip_special_tokens=True)

        results.append({"file": path, "description": result})

        # Clear cache between images
        torch.cuda.empty_cache()

    return results

# Process folder
images = [f"./images/{f}" for f in os.listdir("./images") if f.endswith(('.jpg', '.png'))]
results = process_images(images, "Describe this image in one paragraph.")

Gradio Interface

import gradio as gr
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

def analyze_image(image, question):
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]
        }
    ]

    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(image, input_text, return_tensors="pt").to(model.device)

    output = model.generate(**inputs, max_new_tokens=500)
    return processor.decode(output[0], skip_special_tokens=True)

demo = gr.Interface(
    fn=analyze_image,
    inputs=[
        gr.Image(type="pil", label="Upload Image"),
        gr.Textbox(label="Question", placeholder="What's in this image?")
    ],
    outputs=gr.Textbox(label="Response"),
    title="Llama 3.2 Vision - Image Analysis",
    description="Upload an image and ask questions about it. Running on CLORE.AI."
)

demo.launch(server_name="0.0.0.0", server_port=7860)

Performance

Task

Model

GPU

Time

Single image description

11B

RTX 4090

~3s

Single image description

11B

A100 40GB

~2s

OCR (1 page)

11B

RTX 4090

~5s

Document analysis

11B

A100 40GB

~8s

Batch (10 images)

11B

A100 40GB

~25s

Quantization

4-bit with bitsandbytes

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

GGUF with Ollama

# 4-bit quantized (fits in 8GB VRAM)
ollama pull llama3.2-vision:11b-q4_K_M

# 8-bit quantized
ollama pull llama3.2-vision:11b-q8_0

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU

Hourly Rate

Best For

RTX 4090 24GB

~$0.10

11B model

A100 40GB

~$0.17

11B with long context

A100 80GB

~$0.25

11B optimal

4x A100 80GB

~$1.00

90B model

Prices vary. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot orders for batch processing
Pay with CLORE tokens
Use quantized models (4-bit) for development

Troubleshooting

Out of Memory

# Use 4-bit quantization
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"
)

# Or reduce max_new_tokens
output = model.generate(**inputs, max_new_tokens=256)

Slow Generation

Ensure GPU is being used (check nvidia-smi)
Use bfloat16 instead of float32
Reduce image resolution before processing
Use vLLM for better throughput

Image Not Loading

from PIL import Image
import requests
from io import BytesIO

# From URL
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# From file
image = Image.open("path/to/image.jpg").convert("RGB")

# Resize if too large
max_size = 1024
if max(image.size) > max_size:
    image.thumbnail((max_size, max_size))

HuggingFace Token Required

# Set token for gated models
export HUGGING_FACE_HUB_TOKEN=hf_xxxxx

# Or login
huggingface-cli login

Llama Vision vs Others

Feature

Llama 3.2 Vision

LLaVA 1.6

GPT-4V

Parameters

11B / 90B

7B / 34B

Unknown

Open Source

Yes

OCR Quality

Excellent

Good

Excellent

Context

128K

32K

128K

Multi-image

Yes

Limited

Yes

License

Llama 3.2

Apache 2.0

Proprietary

Use Llama 3.2 Vision when:

Need open-source multimodal
OCR and document analysis
Integration with Llama ecosystem
Long context understanding

Next Steps

LLaVA - Alternative vision model
Florence-2 - Microsoft's vision model
Ollama - Easy deployment
vLLM - Production serving

PreviousOverview NextLLaVA

Last updated 26 days ago

Was this helpful?

hashtagWhy Llama 3.2 Vision?

hashtagModel Variants

hashtagQuick Deploy on CLORE.AI

hashtagAccessing Your Service

hashtagHardware Requirements

hashtagInstallation

hashtagUsing Ollama (Easiest)

hashtagUsing vLLM

hashtagUsing Transformers

hashtagBasic Usage

hashtagImage Understanding

hashtagWith Ollama

hashtagWith vLLM API

hashtagUse Cases

hashtagOCR / Text Extraction

hashtagDocument Analysis

hashtagVisual Question Answering

hashtagImage Captioning

hashtagCode from Screenshots

hashtagMultiple Images

hashtagBatch Processing

hashtagGradio Interface

hashtagPerformance

hashtagQuantization

hashtag4-bit with bitsandbytes

hashtagGGUF with Ollama

hashtagCost Estimate

hashtagTroubleshooting

hashtagOut of Memory

hashtagSlow Generation

hashtagImage Not Loading

hashtagHuggingFace Token Required

hashtagLlama Vision vs Others

hashtagNext Steps

Why Llama 3.2 Vision?

Model Variants

Quick Deploy on CLORE.AI

Accessing Your Service

Hardware Requirements

Installation

Using Ollama (Easiest)

Using vLLM

Using Transformers

Basic Usage

Image Understanding

With Ollama

With vLLM API

Use Cases

OCR / Text Extraction

Document Analysis

Visual Question Answering

Image Captioning

Code from Screenshots

Multiple Images

Batch Processing

Gradio Interface

Performance

Quantization

4-bit with bitsandbytes

GGUF with Ollama

Cost Estimate

Troubleshooting

Out of Memory

Slow Generation

Image Not Loading

HuggingFace Token Required

Llama Vision vs Others

Next Steps