# LLaVA

Chat with images using LLaVA - the open-source GPT-4V alternative.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is LLaVA?

LLaVA (Large Language and Vision Assistant) can:

* Understand and describe images
* Answer questions about visual content
* Analyze charts, diagrams, screenshots
* OCR and document understanding

## Model Variants

| Model         | Size  | VRAM   | Quality |
| ------------- | ----- | ------ | ------- |
| LLaVA-1.5-7B  | 7B    | 8GB    | Good    |
| LLaVA-1.5-13B | 13B   | 16GB   | Better  |
| LLaVA-1.6-34B | 34B   | 40GB   | Best    |
| LLaVA-NeXT    | 7-34B | 8-40GB | Latest  |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
pip install llava torch transformers accelerate gradio && \
python -m llava.serve.cli --model-path liuhaotian/llava-v1.5-7b --load-4bit
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .
pip install flash-attn --no-build-isolation
```

## Basic Usage

### Python API

```python
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
from PIL import Image

model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

# Simple inference
args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": "Describe this image in detail",
    "conv_mode": None,
    "image_file": "photo.jpg",
    "sep": ",",
    "temperature": 0.2,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

output = eval_model(args)
print(output)
```

### Using Transformers

```python
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load image
image = Image.open("photo.jpg")

# Create conversation
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is shown in this image?"}
        ]
    }
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(prompt, image, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(output[0], skip_special_tokens=True)
print(response)
```

## Ollama Integration (Recommended)

The easiest way to run LLaVA on CLORE.AI:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull LLaVA model
ollama pull llava:7b

# Run with image (CLI)
ollama run llava:7b "Describe this image: /path/to/image.jpg"
```

### LLaVA API via Ollama

{% hint style="warning" %}
**Important:** LLaVA vision works **only** through the `/api/generate` endpoint with the `images` parameter. The `/api/chat` and OpenAI-compatible endpoints do **not** support images with LLaVA.
{% endhint %}

#### Working Method: /api/generate

```bash
# Encode image to base64 first
BASE64_IMAGE=$(base64 -i photo.jpg | tr -d '\n')

# Send vision request
curl https://your-http-pub.clorecloud.net/api/generate -d "{
  \"model\": \"llava:7b\",
  \"prompt\": \"What do you see in this image? Describe in detail.\",
  \"images\": [\"$BASE64_IMAGE\"],
  \"stream\": false
}"
```

Response:

```json
{
  "model": "llava:7b",
  "response": "The image shows a beautiful sunset over mountains...",
  "done": true
}
```

#### NOT Working: /api/chat (returns null for vision)

```bash
# This does NOT work for vision queries:
curl https://your-http-pub.clorecloud.net/api/chat -d '{
  "model": "llava:7b",
  "messages": [{"role": "user", "content": "describe", "images": ["..."]}]
}'
# Returns null for image-related responses
```

### Python with Ollama

```python
import requests
import base64

def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode()

# Use /api/generate for vision (NOT /api/chat!)
response = requests.post(
    "https://your-http-pub.clorecloud.net/api/generate",
    json={
        "model": "llava:7b",
        "prompt": "What do you see in this image?",
        "images": [encode_image("photo.jpg")],
        "stream": False
    }
)

print(response.json()["response"])
```

### Complete Working Example

```python
import requests
import base64
import sys

def analyze_image(ollama_url, image_path, question):
    """Analyze an image using LLaVA via Ollama"""

    # Encode image
    with open(image_path, "rb") as f:
        image_base64 = base64.b64encode(f.read()).decode()

    # Use /api/generate (the only working endpoint for vision)
    response = requests.post(
        f"{ollama_url}/api/generate",
        json={
            "model": "llava:7b",
            "prompt": question,
            "images": [image_base64],
            "stream": False
        }
    )

    return response.json()["response"]

# Usage
url = "https://your-http-pub.clorecloud.net"
result = analyze_image(url, "photo.jpg", "Describe this image in detail")
print(result)
```

## Use Cases

### Image Description

```python
prompt = "Describe this image in detail, including colors, objects, and atmosphere."
```

### OCR / Text Extraction

```python
prompt = "Extract all text visible in this image. Format it clearly."
```

### Chart Analysis

```python
prompt = "Analyze this chart. What are the key trends and insights?"
```

### Code from Screenshot

```python
prompt = "Extract the code shown in this screenshot. Provide only the code."
```

### Object Detection

```python
prompt = "List all objects visible in this image with their approximate locations."
```

## Gradio Interface

```python
import gradio as gr
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

def analyze_image(image, question):
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]
        }
    ]

    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = processor(prompt, image, return_tensors="pt").to("cuda")

    output = model.generate(**inputs, max_new_tokens=500)
    response = processor.decode(output[0], skip_special_tokens=True)

    # Extract assistant response
    return response.split("[/INST]")[-1].strip()

demo = gr.Interface(
    fn=analyze_image,
    inputs=[
        gr.Image(type="pil", label="Image"),
        gr.Textbox(label="Question", value="Describe this image in detail")
    ],
    outputs=gr.Textbox(label="Response"),
    title="LLaVA Vision Assistant"
)

demo.launch(server_name="0.0.0.0", server_port=8000)
```

## API Server

```python
from fastapi import FastAPI, UploadFile, File, Form
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import io

app = FastAPI()

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

@app.post("/analyze")
async def analyze(
    image: UploadFile = File(...),
    question: str = Form(default="Describe this image")
):
    img = Image.open(io.BytesIO(await image.read()))

    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]
        }
    ]

    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = processor(prompt, img, return_tensors="pt").to("cuda")

    output = model.generate(**inputs, max_new_tokens=500)
    response = processor.decode(output[0], skip_special_tokens=True)

    return {"response": response.split("[/INST]")[-1].strip()}

# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Batch Processing

```python
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import os

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

def analyze_image(image_path, question):
    image = Image.open(image_path)

    conversation = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": question}
        ]}
    ]

    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = processor(prompt, image, return_tensors="pt").to("cuda")

    output = model.generate(**inputs, max_new_tokens=300)
    return processor.decode(output[0], skip_special_tokens=True).split("[/INST]")[-1].strip()

# Process folder of images
image_folder = "./images"
results = []

for filename in os.listdir(image_folder):
    if filename.endswith(('.jpg', '.png', '.jpeg')):
        path = os.path.join(image_folder, filename)
        description = analyze_image(path, "Describe this image briefly")
        results.append({"file": filename, "description": description})
        print(f"{filename}: {description[:100]}...")

# Save results
import json
with open("descriptions.json", "w") as f:
    json.dump(results, f, indent=2)
```

## Memory Optimization

### 4-bit Quantization

```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)
```

### CPU Offload

```python
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto",
    offload_folder="offload"
)
```

## Performance

| Model         | GPU      | Tokens/sec |
| ------------- | -------- | ---------- |
| LLaVA-1.5-7B  | RTX 3090 | \~30       |
| LLaVA-1.5-7B  | RTX 4090 | \~45       |
| LLaVA-1.6-7B  | RTX 4090 | \~40       |
| LLaVA-1.5-13B | A100     | \~35       |

## Troubleshooting

### Out of Memory

```python

# Use 4-bit quantization

# Or use smaller model (7B instead of 13B)

# Or process smaller images
image = image.resize((336, 336))
```

### Slow Generation

* Use flash attention
* Reduce max\_new\_tokens
* Use quantized model

### Poor Quality

* Use larger model
* Better prompts with context
* Higher resolution images

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* Ollama LLMs - Run LLaVA with Ollama
* RAG + LangChain - Vision + RAG
* vLLM Inference - Production serving
