> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-de/vision-modelle/llama-vision.md).

# Llama 3.2 Vision

Führen Sie Metas multimodale Llama 3.2 Vision-Modelle zur Bildverarbeitung auf CLORE.AI-GPUs aus.

{% hint style="success" %}
Alle Beispiele können auf GPU-Servern ausgeführt werden, die über [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Warum Llama 3.2 Vision?

* **Multimodal** - Versteht sowohl Text als auch Bilder
* **Mehrere Größen** - Versionen mit 11B und 90B Parametern
* **Vielseitig** - OCR, visuelle Fragenbeantwortung, Bildunterschriftenerstellung, Dokumentenanalyse
* **Offene Gewichte** - Vollständig Open Source von Meta
* **Llama-Ökosystem** - Kompatibel mit Ollama, vLLM, transformers

## Modellvarianten

| Modell                        | Parameter | VRAM (FP16) | Kontext | Am besten geeignet für             |
| ----------------------------- | --------- | ----------- | ------- | ---------------------------------- |
| Llama-3.2-11B-Vision          | 11B       | 24GB        | 128K    | Allgemeiner Gebrauch, einzelne GPU |
| Llama-3.2-90B-Vision          | 90B       | 180GB       | 128K    | Maximale Qualität                  |
| Llama-3.2-11B-Vision-Instruct | 11B       | 24GB        | 128K    | Chat/Assistent                     |
| Llama-3.2-90B-Vision-Instruct | 90B       | 180GB       | 128K    | Produktion                         |

## Schnelle Bereitstellung auf CLORE.AI

**Docker-Image:**

```
vllm/vllm-openai:latest
```

**Ports:**

```
22/tcp
8000/http
```

**Befehl:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-11B-Vision-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192
```

## Zugriff auf Ihren Dienst

Nach der Bereitstellung finden Sie Ihre `http_pub` URL in **Meine Bestellungen**:

1. Gehen Sie zur **Meine Bestellungen** Seite
2. Klicken Sie auf Ihre Bestellung
3. Finden Sie die `http_pub` URL (z. B., `abc123.clorecloud.net`)

Verwenden Sie `https://IHRE_HTTP_PUB_URL` anstelle von `localhost` in den Beispielen unten.

## Hardware-Anforderungen

| Modell     | Minimale GPU  | Empfohlen    | Optimal   |
| ---------- | ------------- | ------------ | --------- |
| 11B Vision | RTX 4090 24GB | A100 40GB    | A100 80GB |
| 90B Vision | 4x A100 40GB  | 4x A100 80GB | 8x H100   |

## Installation

### Mit Ollama (am einfachsten)

```bash
# Modell herunterladen
ollama pull llama3.2-vision:11b

# Interaktiv starten
ollama run llama3.2-vision:11b
```

### Mit vLLM

```bash
pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-11B-Vision-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

### Verwendung von Transformers

```python
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
```

## Grundlegende Verwendung

### Bildverstehen

```python
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Bild laden
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Prompt erstellen
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What's in this image? Describe in detail."}
        ]
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=500)
print(processor.decode(output[0], skip_special_tokens=True))
```

### Mit Ollama

```bash
# Ein Bild beschreiben
ollama run llama3.2-vision:11b "Describe this image: /path/to/image.jpg"

# Oder die API verwenden
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2-vision:11b",
  "prompt": "What is in this image?",
  "images": ["base64_encoded_image_here"]
}'
```

### Mit vLLM API

```python
from openai import OpenAI
import base64

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="nicht benötigt"
)

# Bild in Base64 kodieren
with open("image.jpg", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)
```

## Anwendungsfälle

### OCR / Textextraktion

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Extract all text from this image. Format as markdown."}
        ]
    }
]
```

### Dokumentenanalyse

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Analyze this document. Summarize the key points."}
        ]
    }
]
```

### Visuelle Fragenbeantwortung

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "How many people are in this photo? What are they doing?"}
        ]
    }
]
```

### Bildbeschriftung

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Write a detailed caption for this image suitable for social media."}
        ]
    }
]
```

### Code aus Screenshots

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this UI screenshot to HTML/CSS code."}
        ]
    }
]
```

## Mehrere Bilder

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Compare these two images. What are the differences?"}
        ]
    }
]

# Mit mehreren Bildern verarbeiten
inputs = processor(
    images=[image1, image2],
    text=input_text,
    return_tensors="pt"
).to(model.device)
```

## Batch-Verarbeitung

```python
import os
from PIL import Image

def process_images(image_paths, prompt):
    results = []

    for path in image_paths:
        image = Image.open(path)

        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": prompt}
                ]
            }
        ]

        input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
        inputs = processor(image, input_text, return_tensors="pt").to(model.device)

        output = model.generate(**inputs, max_new_tokens=300)
        result = processor.decode(output[0], skip_special_tokens=True)

        results.append({"file": path, "description": result})

        # Cache zwischen Bildern leeren
        torch.cuda.empty_cache()

    return results

# Ordner verarbeiten
images = [f"./images/{f}" for f in os.listdir("./images") if f.endswith(('.jpg', '.png'))]
results = process_images(images, "Describe this image in one paragraph.")
```

## Gradio-Oberfläche

```python
import gradio as gr
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

def analyze_image(image, question):
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]
        }
    ]

    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(image, input_text, return_tensors="pt").to(model.device)

    output = model.generate(**inputs, max_new_tokens=500)
    return processor.decode(output[0], skip_special_tokens=True)

demo = gr.Interface(
    fn=analyze_image,
    inputs=[
        gr.Image(type="pil", label="Upload Image"),
        gr.Textbox(label="Question", placeholder="What's in this image?")
    ],
    outputs=gr.Textbox(label="Antwort"),
    title="Llama 3.2 Vision - Image Analysis",
    description="Upload an image and ask questions about it. Running on CLORE.AI."
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Leistung

| Aufgabe                             | Modell | GPU       | Zeit  |
| ----------------------------------- | ------ | --------- | ----- |
| Beschreibung eines einzelnen Bildes | 11B    | RTX 4090  | \~3s  |
| Beschreibung eines einzelnen Bildes | 11B    | A100 40GB | \~2s  |
| OCR (1 Seite)                       | 11B    | RTX 4090  | \~5s  |
| Dokumentenanalyse                   | 11B    | A100 40GB | \~8s  |
| Batch (10 Bilder)                   | 11B    | A100 40GB | \~25s |

## Quantisierung

### 4-Bit mit bitsandbytes

```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)
```

### GGUF mit Ollama

```bash
# 4-Bit quantisiert (passt in 8GB VRAM)
ollama pull llama3.2-vision:11b-q4_K_M

# 8-Bit quantisiert
ollama pull llama3.2-vision:11b-q8_0
```

## Kostenabschätzung

Typische CLORE.AI-Marktplatzpreise:

| GPU           | Stundensatz | Am besten geeignet für |
| ------------- | ----------- | ---------------------- |
| RTX 4090 24GB | \~$0.10     | 11B-Modell             |
| A100 40GB     | \~$0.17     | 11B mit langem Kontext |
| A100 80GB     | \~$0.25     | 11B optimal            |
| 4x A100 80GB  | \~$1.00     | 90B-Modell             |

*Preise variieren. Prüfe* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *auf aktuelle Preise.*

**Geld sparen:**

* Verwenden Sie **Spot** Aufträge für die Batch-Verarbeitung
* Bezahlen mit **CLORE** Token
* Verwenden Sie quantisierte Modelle (4-Bit) für die Entwicklung

## Fehlerbehebung

### Kein Speicher mehr

```python
# Verwende 4-Bit-Quantisierung
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"
)

# Oder max_new_tokens reduzieren
output = model.generate(**inputs, max_new_tokens=256)
```

### Langsame Generierung

* Stellen Sie sicher, dass die GPU verwendet wird (prüfen `nvidia-smi`)
* Verwenden Sie bfloat16 anstelle von float32
* Reduzieren Sie die Bildauflösung vor der Verarbeitung
* Verwenden Sie vLLM für besseren Durchsatz

### Bild lädt nicht

```python
from PIL import Image
import requests
from io import BytesIO

# Von URL
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# Von Datei
image = Image.open("path/to/image.jpg").convert("RGB")

# Größe ändern, wenn zu groß
max_size = 1024
if max(image.size) > max_size:
    image.thumbnail((max_size, max_size))
```

### HuggingFace-Token erforderlich

```bash
# Token für geschützte Modelle setzen
export HUGGING_FACE_HUB_TOKEN=hf_xxxxx

# Oder einloggen
huggingface-cli login
```

## Llama Vision vs. Andere

| Funktion       | Llama 3.2 Vision | LLaVA 1.6  | GPT-4V        |
| -------------- | ---------------- | ---------- | ------------- |
| Parameter      | 11B / 90B        | 7B / 34B   | Unbekannt     |
| Open Source    | Ja               | Ja         | Nein          |
| OCR-Qualität   | Ausgezeichnet    | Gut        | Ausgezeichnet |
| Kontext        | 128K             | 32K        | 128K          |
| Mehrere Bilder | Ja               | Begrenzt   | Ja            |
| Lizenz         | Llama 3.2        | Apache 2.0 | Proprietär    |

**Verwenden Sie Llama 3.2 Vision, wenn:**

* Ein Open-Source-multimodales Modell benötigt wird
* OCR und Dokumentenanalyse
* Integration mit dem Llama-Ökosystem
* Verständnis langer Kontexte

## Nächste Schritte

* [LLaVA](/guides/guides_v2-de/vision-modelle/llava-vision-language.md) - Alternativs Vision-Modell
* [Florence-2](/guides/guides_v2-de/vision-modelle/florence2.md) - Microsofts Vision-Modell
* [Ollama](/guides/guides_v2-de/sprachmodelle/ollama.md) - Einfache Bereitstellung
* [vLLM](/guides/guides_v2-de/sprachmodelle/vllm.md) - Produktionsbereitstellung


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-de/vision-modelle/llama-vision.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.