# Florence-2

Microsoft's powerful vision model for captioning, detection, segmentation, and more.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
All examples in this guide can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace) marketplace.
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is Florence-2?

Florence-2 by Microsoft is a vision foundation model that handles:

* Image captioning (brief and detailed)
* Object detection and localization
* Dense region captioning
* Referring expression comprehension
* OCR and text recognition
* Visual question answering

## Resources

* **HuggingFace:** [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large)
* **Paper:** [Florence-2 Paper](https://arxiv.org/abs/2311.06242)
* **GitHub:** [microsoft/Florence-2](https://github.com/microsoft/Florence-2)
* **Demo:** [HuggingFace Space](https://huggingface.co/spaces/microsoft/Florence-2)

## Recommended Hardware

| Component | Minimum       | Recommended   | Optimal       |
| --------- | ------------- | ------------- | ------------- |
| GPU       | RTX 3060 12GB | RTX 4080 16GB | RTX 4090 24GB |
| VRAM      | 8GB           | 12GB          | 16GB          |
| CPU       | 4 cores       | 8 cores       | 16 cores      |
| RAM       | 16GB          | 32GB          | 64GB          |
| Storage   | 30GB SSD      | 50GB NVMe     | 100GB NVMe    |
| Internet  | 100 Mbps      | 500 Mbps      | 1 Gbps        |

## Quick Deploy on CLORE.AI

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
pip install transformers accelerate einops timm gradio && \
python -c "
import gradio as gr
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
from PIL import Image

model = AutoModelForCausalLM.from_pretrained('microsoft/Florence-2-large', torch_dtype=torch.float16, trust_remote_code=True).to('cuda')
processor = AutoProcessor.from_pretrained('microsoft/Florence-2-large', trust_remote_code=True)

def process(image, task):
    inputs = processor(text=task, images=image, return_tensors='pt').to('cuda', torch.float16)
    generated_ids = model.generate(input_ids=inputs['input_ids'], pixel_values=inputs['pixel_values'], max_new_tokens=1024)
    result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    return processor.post_process_generation(result, task=task, image_size=image.size)

gr.Interface(fn=process, inputs=[gr.Image(type='pil'), gr.Dropdown(['<CAPTION>', '<DETAILED_CAPTION>', '<OD>'])], outputs='json').launch(server_name='0.0.0.0')
"
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash
pip install transformers accelerate einops timm
pip install flash-attn --no-build-isolation  # Optional, for faster inference
```

## What You Can Create

### Content Analysis

* Auto-generate image descriptions
* Extract text from images (OCR)
* Analyze visual content at scale

### Data Annotation

* Auto-label datasets with captions
* Generate bounding boxes for objects
* Create dense annotations

### Accessibility

* Generate alt-text for images
* Describe images for visually impaired
* Create audio descriptions

### Search & Discovery

* Index images by content
* Build visual search systems
* Content moderation

### Document Processing

* Extract text from documents
* Understand charts and diagrams
* Process scanned materials

## Basic Usage

### Image Captioning

```python
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")

processor = AutoProcessor.from_pretrained(
    "microsoft/Florence-2-large",
    trust_remote_code=True
)

image = Image.open("photo.jpg")

# Brief caption
task = "<CAPTION>"
inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
caption = processor.post_process_generation(result, task=task, image_size=image.size)
print(caption)

# Output: {'<CAPTION>': 'A dog playing in the park'}

# Detailed caption
task = "<DETAILED_CAPTION>"
inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
detailed = processor.post_process_generation(result, task=task, image_size=image.size)
print(detailed)
```

### Object Detection

```python
task = "<OD>"  # Object Detection
inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
detections = processor.post_process_generation(result, task=task, image_size=image.size)

# Output: {'<OD>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['dog', 'ball', ...]}}
```

### OCR (Text Recognition)

```python
task = "<OCR>"
inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
text = processor.post_process_generation(result, task=task, image_size=image.size)
print(text)

# Output: {'<OCR>': 'Text found in the image...'}
```

### Dense Region Captioning

```python
task = "<DENSE_REGION_CAPTION>"
inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
regions = processor.post_process_generation(result, task=task, image_size=image.size)

# Output: {'<DENSE_REGION_CAPTION>': {'bboxes': [...], 'labels': ['a brown dog running', 'green grass', ...]}}
```

### Referring Expression Comprehension

Find objects based on text descriptions:

```python
task = "<CAPTION_TO_PHRASE_GROUNDING>"
text_input = "the red car on the left"

inputs = processor(
    text=task + text_input,
    images=image,
    return_tensors="pt"
).to("cuda", torch.float16)

generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
grounding = processor.post_process_generation(result, task=task, image_size=image.size)

# Returns bounding box of "the red car on the left"
```

## All Available Tasks

```python
TASKS = [
    "<CAPTION>",                    # Brief caption
    "<DETAILED_CAPTION>",           # Detailed description
    "<MORE_DETAILED_CAPTION>",      # Very detailed description
    "<OD>",                          # Object detection
    "<DENSE_REGION_CAPTION>",       # Region descriptions
    "<REGION_PROPOSAL>",            # Propose regions of interest
    "<CAPTION_TO_PHRASE_GROUNDING>", # Find objects from text
    "<REFERRING_EXPRESSION_SEGMENTATION>", # Segment from text
    "<REGION_TO_SEGMENTATION>",     # Segment specified region
    "<OPEN_VOCABULARY_DETECTION>",  # Detect with text labels
    "<REGION_TO_CATEGORY>",         # Classify region
    "<REGION_TO_DESCRIPTION>",      # Describe region
    "<OCR>",                         # Extract text
    "<OCR_WITH_REGION>",            # Extract text with locations
]
```

## Batch Processing

```python
import os
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch
import json

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

def process_image(image_path, task):
    image = Image.open(image_path)
    inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024
    )
    result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    return processor.post_process_generation(result, task=task, image_size=image.size)

# Process directory
input_dir = "./images"
results = {}

for filename in os.listdir(input_dir):
    if not filename.endswith(('.jpg', '.png')):
        continue

    path = os.path.join(input_dir, filename)
    results[filename] = {
        "caption": process_image(path, "<CAPTION>"),
        "objects": process_image(path, "<OD>"),
        "text": process_image(path, "<OCR>")
    }
    print(f"Processed: {filename}")

with open("results.json", "w") as f:
    json.dump(results, f, indent=2)
```

## Gradio Interface

```python
import gradio as gr
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image, ImageDraw
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

def run_task(image, task):
    inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024
    )
    result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed = processor.post_process_generation(result, task=task, image_size=image.size)

    # Draw boxes if detection task
    output_image = image.copy()
    if task in ["<OD>", "<DENSE_REGION_CAPTION>"]:
        draw = ImageDraw.Draw(output_image)
        if "bboxes" in parsed.get(task, {}):
            for box, label in zip(parsed[task]["bboxes"], parsed[task]["labels"]):
                draw.rectangle(box, outline="red", width=2)
                draw.text((box[0], box[1]-15), label, fill="red")

    return output_image, str(parsed)

demo = gr.Interface(
    fn=run_task,
    inputs=[
        gr.Image(type="pil", label="Input Image"),
        gr.Dropdown(
            choices=["<CAPTION>", "<DETAILED_CAPTION>", "<OD>", "<DENSE_REGION_CAPTION>", "<OCR>"],
            value="<CAPTION>",
            label="Task"
        )
    ],
    outputs=[
        gr.Image(label="Result"),
        gr.Textbox(label="Output", lines=10)
    ],
    title="Florence-2 Vision AI",
    description="Multi-task vision model. Running on CLORE.AI GPU servers."
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Performance

| Task             | Resolution | GPU      | Speed |
| ---------------- | ---------- | -------- | ----- |
| Caption          | 768x768    | RTX 3090 | 200ms |
| Caption          | 768x768    | RTX 4090 | 120ms |
| Object Detection | 768x768    | RTX 4090 | 150ms |
| OCR              | 768x768    | RTX 4090 | 180ms |
| Dense Caption    | 768x768    | A100     | 100ms |

## Model Variants

| Model               | Parameters | VRAM | Speed  |
| ------------------- | ---------- | ---- | ------ |
| Florence-2-base     | 232M       | 4GB  | Fast   |
| Florence-2-large    | 771M       | 8GB  | Medium |
| Florence-2-base-ft  | 232M       | 4GB  | Fast   |
| Florence-2-large-ft | 771M       | 8GB  | Medium |

## Common Problems & Solutions

### Out of Memory

**Problem:** CUDA OOM error

**Solutions:**

```python

# Use base model instead of large
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-base",
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")

# Or enable CPU offload
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto"
)
```

### Slow Inference

**Problem:** Processing takes too long

**Solutions:**

* Use Florence-2-base for faster inference
* Install flash-attention for speedup
* Batch multiple images together
* Use A100 GPU for production

```bash
pip install flash-attn --no-build-isolation
```

### Poor OCR Results

**Problem:** Text recognition is inaccurate

**Solutions:**

* Ensure image is high resolution (at least 768px)
* Use `<OCR_WITH_REGION>` for better localization
* Pre-process: enhance contrast, deskew image
* Crop to text regions before OCR

### Detection Missing Objects

**Problem:** Objects not detected

**Solutions:**

* Use `<DENSE_REGION_CAPTION>` for more regions
* Try `<OPEN_VOCABULARY_DETECTION>` with specific labels
* Combine with GroundingDINO for specific objects

## Troubleshooting

### Task not working

* Check exact task name syntax
* Some tasks need specific input format
* Verify model version matches task

### Output format unexpected

* Different tasks return different formats
* Parse output according to task type
* Check documentation for task outputs

### CUDA memory issues

* Florence-2-large needs \~8GB VRAM
* Use Florence-2-base for less memory
* Enable gradient checkpointing

### Slow processing

* Use batch inference when possible
* Enable FP16 mode
* Consider TensorRT optimization

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [LLaVA](https://docs.clore.ai/guides/vision-models/llava-vision-language) - Vision chat and QA
* [GroundingDINO](https://docs.clore.ai/guides/vision-models/groundingdino) - Zero-shot detection
* [SAM2](https://docs.clore.ai/guides/vision-models/sam2-video) - Segment detected objects
