Florence-2

Microsoft Florence-2 for captioning, detection, and segmentation

Microsoft's powerful vision model for captioning, detection, segmentation, and more.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

All examples in this guide can be run on GPU servers rented through CLORE.AI Marketplace marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

What is Florence-2?

Florence-2 by Microsoft is a vision foundation model that handles:

Image captioning (brief and detailed)
Object detection and localization
Dense region captioning
Referring expression comprehension
OCR and text recognition
Visual question answering

Resources

HuggingFace: microsoft/Florence-2-large
Paper: Florence-2 Paper
GitHub: microsoft/Florence-2
Demo: HuggingFace Space

Recommended Hardware

Component

Minimum

Recommended

Optimal

GPU

RTX 3060 12GB

RTX 4080 16GB

RTX 4090 24GB

VRAM

8GB

12GB

16GB

CPU

4 cores

8 cores

16 cores

RAM

16GB

32GB

64GB

Storage

30GB SSD

50GB NVMe

100GB NVMe

Internet

100 Mbps

500 Mbps

1 Gbps

Quick Deploy on CLORE.AI

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

Ports:

22/tcp
7860/http

Command:

pip install transformers accelerate einops timm gradio && \
python -c "
import gradio as gr
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
from PIL import Image

model = AutoModelForCausalLM.from_pretrained('microsoft/Florence-2-large', torch_dtype=torch.float16, trust_remote_code=True).to('cuda')
processor = AutoProcessor.from_pretrained('microsoft/Florence-2-large', trust_remote_code=True)

def process(image, task):
    inputs = processor(text=task, images=image, return_tensors='pt').to('cuda', torch.float16)
    generated_ids = model.generate(input_ids=inputs['input_ids'], pixel_values=inputs['pixel_values'], max_new_tokens=1024)
    result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    return processor.post_process_generation(result, task=task, image_size=image.size)

gr.Interface(fn=process, inputs=[gr.Image(type='pil'), gr.Dropdown(['<CAPTION>', '<DETAILED_CAPTION>', '<OD>'])], outputs='json').launch(server_name='0.0.0.0')
"

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Installation

pip install transformers accelerate einops timm
pip install flash-attn --no-build-isolation  # Optional, for faster inference

What You Can Create

Content Analysis

Auto-generate image descriptions
Extract text from images (OCR)
Analyze visual content at scale

Data Annotation

Auto-label datasets with captions
Generate bounding boxes for objects
Create dense annotations

Accessibility

Generate alt-text for images
Describe images for visually impaired
Create audio descriptions

Search & Discovery

Index images by content
Build visual search systems
Content moderation

Document Processing

Extract text from documents
Understand charts and diagrams
Process scanned materials

Basic Usage

Image Captioning

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")

processor = AutoProcessor.from_pretrained(
    "microsoft/Florence-2-large",
    trust_remote_code=True
)

image = Image.open("photo.jpg")

# Brief caption
task = "<CAPTION>"
inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
caption = processor.post_process_generation(result, task=task, image_size=image.size)
print(caption)

# Output: {'<CAPTION>': 'A dog playing in the park'}

# Detailed caption
task = "<DETAILED_CAPTION>"
inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
detailed = processor.post_process_generation(result, task=task, image_size=image.size)
print(detailed)

Object Detection

task = "<OD>"  # Object Detection
inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
detections = processor.post_process_generation(result, task=task, image_size=image.size)

# Output: {'<OD>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['dog', 'ball', ...]}}

OCR (Text Recognition)

task = "<OCR>"
inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
text = processor.post_process_generation(result, task=task, image_size=image.size)
print(text)

# Output: {'<OCR>': 'Text found in the image...'}

Dense Region Captioning

task = "<DENSE_REGION_CAPTION>"
inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
regions = processor.post_process_generation(result, task=task, image_size=image.size)

# Output: {'<DENSE_REGION_CAPTION>': {'bboxes': [...], 'labels': ['a brown dog running', 'green grass', ...]}}

Referring Expression Comprehension

Find objects based on text descriptions:

task = "<CAPTION_TO_PHRASE_GROUNDING>"
text_input = "the red car on the left"

inputs = processor(
    text=task + text_input,
    images=image,
    return_tensors="pt"
).to("cuda", torch.float16)

generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024
)
result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
grounding = processor.post_process_generation(result, task=task, image_size=image.size)

# Returns bounding box of "the red car on the left"

All Available Tasks

TASKS = [
    "<CAPTION>",                    # Brief caption
    "<DETAILED_CAPTION>",           # Detailed description
    "<MORE_DETAILED_CAPTION>",      # Very detailed description
    "<OD>",                          # Object detection
    "<DENSE_REGION_CAPTION>",       # Region descriptions
    "<REGION_PROPOSAL>",            # Propose regions of interest
    "<CAPTION_TO_PHRASE_GROUNDING>", # Find objects from text
    "<REFERRING_EXPRESSION_SEGMENTATION>", # Segment from text
    "<REGION_TO_SEGMENTATION>",     # Segment specified region
    "<OPEN_VOCABULARY_DETECTION>",  # Detect with text labels
    "<REGION_TO_CATEGORY>",         # Classify region
    "<REGION_TO_DESCRIPTION>",      # Describe region
    "<OCR>",                         # Extract text
    "<OCR_WITH_REGION>",            # Extract text with locations
]

Batch Processing

import os
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch
import json

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

def process_image(image_path, task):
    image = Image.open(image_path)
    inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024
    )
    result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    return processor.post_process_generation(result, task=task, image_size=image.size)

# Process directory
input_dir = "./images"
results = {}

for filename in os.listdir(input_dir):
    if not filename.endswith(('.jpg', '.png')):
        continue

    path = os.path.join(input_dir, filename)
    results[filename] = {
        "caption": process_image(path, "<CAPTION>"),
        "objects": process_image(path, "<OD>"),
        "text": process_image(path, "<OCR>")
    }
    print(f"Processed: {filename}")

with open("results.json", "w") as f:
    json.dump(results, f, indent=2)

Gradio Interface

import gradio as gr
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image, ImageDraw
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

def run_task(image, task):
    inputs = processor(text=task, images=image, return_tensors="pt").to("cuda", torch.float16)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024
    )
    result = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed = processor.post_process_generation(result, task=task, image_size=image.size)

    # Draw boxes if detection task
    output_image = image.copy()
    if task in ["<OD>", "<DENSE_REGION_CAPTION>"]:
        draw = ImageDraw.Draw(output_image)
        if "bboxes" in parsed.get(task, {}):
            for box, label in zip(parsed[task]["bboxes"], parsed[task]["labels"]):
                draw.rectangle(box, outline="red", width=2)
                draw.text((box[0], box[1]-15), label, fill="red")

    return output_image, str(parsed)

demo = gr.Interface(
    fn=run_task,
    inputs=[
        gr.Image(type="pil", label="Input Image"),
        gr.Dropdown(
            choices=["<CAPTION>", "<DETAILED_CAPTION>", "<OD>", "<DENSE_REGION_CAPTION>", "<OCR>"],
            value="<CAPTION>",
            label="Task"
        )
    ],
    outputs=[
        gr.Image(label="Result"),
        gr.Textbox(label="Output", lines=10)
    ],
    title="Florence-2 Vision AI",
    description="Multi-task vision model. Running on CLORE.AI GPU servers."
)

demo.launch(server_name="0.0.0.0", server_port=7860)

Performance

Task

Resolution

GPU

Speed

Caption

768x768

RTX 3090

200ms

Caption

768x768

RTX 4090

120ms

Object Detection

768x768

RTX 4090

150ms

OCR

768x768

RTX 4090

180ms

Dense Caption

768x768

A100

100ms

Model Variants

Model

Parameters

VRAM

Speed

Florence-2-base

232M

4GB

Fast

Florence-2-large

771M

8GB

Medium

Florence-2-base-ft

232M

4GB

Fast

Florence-2-large-ft

771M

8GB

Medium

Common Problems & Solutions

Out of Memory

Problem: CUDA OOM error

Solutions:


# Use base model instead of large
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-base",
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")

# Or enable CPU offload
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto"
)

Slow Inference

Problem: Processing takes too long

Solutions:

Use Florence-2-base for faster inference
Install flash-attention for speedup
Batch multiple images together
Use A100 GPU for production

pip install flash-attn --no-build-isolation

Poor OCR Results

Problem: Text recognition is inaccurate

Solutions:

Ensure image is high resolution (at least 768px)
Use <OCR_WITH_REGION> for better localization
Pre-process: enhance contrast, deskew image
Crop to text regions before OCR

Detection Missing Objects

Problem: Objects not detected

Solutions:

Use <DENSE_REGION_CAPTION> for more regions
Try <OPEN_VOCABULARY_DETECTION> with specific labels
Combine with GroundingDINO for specific objects

Troubleshooting

Task not working

Check exact task name syntax
Some tasks need specific input format
Verify model version matches task

Output format unexpected

Different tasks return different formats
Parse output according to task type
Check documentation for task outputs

CUDA memory issues

Florence-2-large needs ~8GB VRAM
Use Florence-2-base for less memory
Enable gradient checkpointing

Slow processing

Use batch inference when possible
Enable FP16 mode
Consider TensorRT optimization

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

Next Steps

LLaVA - Vision chat and QA
GroundingDINO - Zero-shot detection
SAM2 - Segment detected objects

PreviousQwen2.5-VL Vision Language Model NextSAM2 Video

Last updated 26 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagWhat is Florence-2?

hashtagResources

hashtagRecommended Hardware

hashtagQuick Deploy on CLORE.AI

hashtagAccessing Your Service

hashtagInstallation

hashtagWhat You Can Create

hashtagContent Analysis

hashtagData Annotation

hashtagAccessibility

hashtagSearch & Discovery

hashtagDocument Processing

hashtagBasic Usage

hashtagImage Captioning

hashtagObject Detection

hashtagOCR (Text Recognition)

hashtagDense Region Captioning

hashtagReferring Expression Comprehension

hashtagAll Available Tasks

hashtagBatch Processing

hashtagGradio Interface

hashtagPerformance

hashtagModel Variants

hashtagCommon Problems & Solutions

hashtagOut of Memory

hashtagSlow Inference

hashtagPoor OCR Results

hashtagDetection Missing Objects

hashtagTroubleshooting

hashtagTask not working

hashtagOutput format unexpected

hashtagCUDA memory issues

hashtagSlow processing

hashtagCost Estimate

hashtagNext Steps

Renting on CLORE.AI

Access Your Server

What is Florence-2?

Resources

Recommended Hardware

Quick Deploy on CLORE.AI

Accessing Your Service

Installation

What You Can Create

Content Analysis

Data Annotation

Accessibility

Search & Discovery

Document Processing

Basic Usage

Image Captioning

Object Detection

OCR (Text Recognition)

Dense Region Captioning

Referring Expression Comprehension

All Available Tasks

Batch Processing

Gradio Interface

Performance

Model Variants

Common Problems & Solutions

Out of Memory

Slow Inference

Poor OCR Results

Detection Missing Objects

Troubleshooting

Task not working

Output format unexpected

CUDA memory issues

Slow processing

Cost Estimate

Next Steps