Florence-2

Microsoft's powerful vision model for captioning, detection, segmentation, and more.

circle-check
circle-info

All examples in this guide can be run on GPU servers rented through CLORE.AI Marketplacearrow-up-right marketplace.

Renting on CLORE.AI

  1. Filter by GPU type, VRAM, and price

  2. Choose On-Demand (fixed rate) or Spot (bid price)

  3. Configure your order:

    • Select Docker image

    • Set ports (TCP for SSH, HTTP for web UIs)

    • Add environment variables if needed

    • Enter startup command

  4. Select payment: CLORE, BTC, or USDT/USDC

  5. Create order and wait for deployment

Access Your Server

  • Find connection details in My Orders

  • Web interfaces: Use the HTTP port URL

  • SSH: ssh -p <port> root@<proxy-address>

What is Florence-2?

Florence-2 by Microsoft is a vision foundation model that handles:

  • Image captioning (brief and detailed)

  • Object detection and localization

  • Dense region captioning

  • Referring expression comprehension

  • OCR and text recognition

  • Visual question answering

Resources

Component
Minimum
Recommended
Optimal

GPU

RTX 3060 12GB

RTX 4080 16GB

RTX 4090 24GB

VRAM

8GB

12GB

16GB

CPU

4 cores

8 cores

16 cores

RAM

16GB

32GB

64GB

Storage

30GB SSD

50GB NVMe

100GB NVMe

Internet

100 Mbps

500 Mbps

1 Gbps

Quick Deploy on CLORE.AI

Docker Image:

Ports:

Command:

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

  1. Go to My Orders page

  2. Click on your order

  3. Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Installation

What You Can Create

Content Analysis

  • Auto-generate image descriptions

  • Extract text from images (OCR)

  • Analyze visual content at scale

Data Annotation

  • Auto-label datasets with captions

  • Generate bounding boxes for objects

  • Create dense annotations

Accessibility

  • Generate alt-text for images

  • Describe images for visually impaired

  • Create audio descriptions

Search & Discovery

  • Index images by content

  • Build visual search systems

  • Content moderation

Document Processing

  • Extract text from documents

  • Understand charts and diagrams

  • Process scanned materials

Basic Usage

Image Captioning

Object Detection

OCR (Text Recognition)

Dense Region Captioning

Referring Expression Comprehension

Find objects based on text descriptions:

All Available Tasks

Batch Processing

Gradio Interface

Performance

Task
Resolution
GPU
Speed

Caption

768x768

RTX 3090

200ms

Caption

768x768

RTX 4090

120ms

Object Detection

768x768

RTX 4090

150ms

OCR

768x768

RTX 4090

180ms

Dense Caption

768x768

A100

100ms

Model Variants

Model
Parameters
VRAM
Speed

Florence-2-base

232M

4GB

Fast

Florence-2-large

771M

8GB

Medium

Florence-2-base-ft

232M

4GB

Fast

Florence-2-large-ft

771M

8GB

Medium

Common Problems & Solutions

Out of Memory

Problem: CUDA OOM error

Solutions:

Slow Inference

Problem: Processing takes too long

Solutions:

  • Use Florence-2-base for faster inference

  • Install flash-attention for speedup

  • Batch multiple images together

  • Use A100 GPU for production

Poor OCR Results

Problem: Text recognition is inaccurate

Solutions:

  • Ensure image is high resolution (at least 768px)

  • Use <OCR_WITH_REGION> for better localization

  • Pre-process: enhance contrast, deskew image

  • Crop to text regions before OCR

Detection Missing Objects

Problem: Objects not detected

Solutions:

  • Use <DENSE_REGION_CAPTION> for more regions

  • Try <OPEN_VOCABULARY_DETECTION> with specific labels

  • Combine with GroundingDINO for specific objects

Troubleshooting

Task not working

  • Check exact task name syntax

  • Some tasks need specific input format

  • Verify model version matches task

Output format unexpected

  • Different tasks return different formats

  • Parse output according to task type

  • Check documentation for task outputs

CUDA memory issues

  • Florence-2-large needs ~8GB VRAM

  • Use Florence-2-base for less memory

  • Enable gradient checkpointing

Slow processing

  • Use batch inference when possible

  • Enable FP16 mode

  • Consider TensorRT optimization

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU
Hourly Rate
Daily Rate
4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplacearrow-up-right for current rates.

Save money:

  • Use Spot market for flexible workloads (often 30-50% cheaper)

  • Pay with CLORE tokens

  • Compare prices across different providers

Next Steps

Last updated

Was this helpful?