GroundingDINO

Detect any object using text descriptions with GroundingDINO

Detect any object using text descriptions with GroundingDINO.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

All examples in this guide can be run on GPU servers rented through CLORE.AI Marketplace marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

What is GroundingDINO?

GroundingDINO by IDEA-Research enables:

Zero-shot object detection with text prompts
Detect any object without training
High-accuracy bounding box localization
Combine with SAM for automatic segmentation

Resources

GitHub: IDEA-Research/GroundingDINO
Paper: GroundingDINO Paper
HuggingFace: IDEA-Research/grounding-dino
Demo: HuggingFace Space

Recommended Hardware

Component

Minimum

Recommended

Optimal

GPU

RTX 3060 12GB

RTX 4080 16GB

RTX 4090 24GB

VRAM

6GB

12GB

16GB

CPU

4 cores

8 cores

16 cores

RAM

16GB

32GB

64GB

Storage

20GB SSD

50GB NVMe

100GB NVMe

Internet

100 Mbps

500 Mbps

1 Gbps

Quick Deploy on CLORE.AI

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

Ports:

22/tcp
7860/http

Command:

cd /workspace && \
git clone https://github.com/IDEA-Research/GroundingDINO.git && \
cd GroundingDINO && \
pip install -e . && \
python demo/gradio_demo.py

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Installation

git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
pip install -e .

# Download weights
mkdir weights
cd weights
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

What You Can Create

Automated Labeling

Auto-annotate datasets for ML training
Generate bounding boxes from descriptions
Speed up data labeling pipelines

Visual Search

Find specific objects in image databases
Content moderation systems
Product recognition in retail

Robotics & Automation

Object localization for robot arms
Inventory management systems
Quality control inspection

Creative Applications

Auto-crop subjects from photos
Generate object masks with SAM
Content-aware image editing

Analytics

Count objects in images
Track inventory from photos
Wildlife monitoring

Basic Usage

from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

# Load model
model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

# Load image
image_source, image = load_image("input.jpg")

# Detect objects
TEXT_PROMPT = "cat . dog . person"
BOX_THRESHOLD = 0.35
TEXT_THRESHOLD = 0.25

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_THRESHOLD,
    text_threshold=TEXT_THRESHOLD
)

# Annotate image
annotated_frame = annotate(
    image_source=image_source,
    boxes=boxes,
    logits=logits,
    phrases=phrases
)

cv2.imwrite("output.jpg", annotated_frame)

GroundingDINO + SAM (Grounded-SAM)

Combine detection with segmentation:

import torch
import numpy as np
from groundingdino.util.inference import load_model, load_image, predict
from segment_anything import sam_model_registry, SamPredictor

# Load GroundingDINO
dino_model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

# Load SAM
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
sam_predictor = SamPredictor(sam)

# Load image
image_source, image = load_image("input.jpg")

# Detect with GroundingDINO
boxes, logits, phrases = predict(
    model=dino_model,
    image=image,
    caption="person . car",
    box_threshold=0.35,
    text_threshold=0.25
)

# Segment with SAM
sam_predictor.set_image(image_source)

# Convert boxes to SAM format
H, W = image_source.shape[:2]
boxes_xyxy = boxes * torch.tensor([W, H, W, H])

masks = []
for box in boxes_xyxy:
    mask, _, _ = sam_predictor.predict(
        box=box.numpy(),
        multimask_output=False
    )
    masks.append(mask)

Batch Processing

import os
from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

input_dir = "./images"
output_dir = "./detected"
os.makedirs(output_dir, exist_ok=True)

TEXT_PROMPT = "product . price tag . barcode"

for filename in os.listdir(input_dir):
    if not filename.endswith(('.jpg', '.png')):
        continue

    image_path = os.path.join(input_dir, filename)
    image_source, image = load_image(image_path)

    boxes, logits, phrases = predict(
        model=model,
        image=image,
        caption=TEXT_PROMPT,
        box_threshold=0.3,
        text_threshold=0.25
    )

    annotated = annotate(image_source, boxes, logits, phrases)
    cv2.imwrite(os.path.join(output_dir, filename), annotated)

    print(f"{filename}: Found {len(boxes)} objects")

Custom Detection Pipeline

from groundingdino.util.inference import load_model, load_image, predict
import json

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

def detect_and_export(image_path, prompt, output_json):
    image_source, image = load_image(image_path)
    H, W = image_source.shape[:2]

    boxes, logits, phrases = predict(
        model=model,
        image=image,
        caption=prompt,
        box_threshold=0.35,
        text_threshold=0.25
    )

    # Convert to absolute coordinates
    detections = []
    for box, logit, phrase in zip(boxes, logits, phrases):
        x1, y1, x2, y2 = box * torch.tensor([W, H, W, H])
        detections.append({
            "label": phrase,
            "confidence": float(logit),
            "bbox": {
                "x1": int(x1),
                "y1": int(y1),
                "x2": int(x2),
                "y2": int(y2)
            }
        })

    with open(output_json, "w") as f:
        json.dump(detections, f, indent=2)

    return detections

# Detect cars and people
results = detect_and_export(
    "street.jpg",
    "car . person . bicycle . traffic light",
    "detections.json"
)

Gradio Interface

import gradio as gr
import cv2
from groundingdino.util.inference import load_model, load_image, predict, annotate
import tempfile
import numpy as np

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

def detect_objects(image, text_prompt, box_threshold, text_threshold):
    # Save temp image
    with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as f:
        cv2.imwrite(f.name, cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR))
        image_source, img = load_image(f.name)

    boxes, logits, phrases = predict(
        model=model,
        image=img,
        caption=text_prompt,
        box_threshold=box_threshold,
        text_threshold=text_threshold
    )

    annotated = annotate(image_source, boxes, logits, phrases)
    annotated_rgb = cv2.cvtColor(annotated, cv2.COLOR_BGR2RGB)

    return annotated_rgb, f"Found {len(boxes)} objects: {', '.join(phrases)}"

demo = gr.Interface(
    fn=detect_objects,
    inputs=[
        gr.Image(type="pil", label="Input Image"),
        gr.Textbox(label="Objects to Detect", value="person . car . dog", placeholder="object1 . object2 . object3"),
        gr.Slider(0.1, 0.9, value=0.35, label="Box Threshold"),
        gr.Slider(0.1, 0.9, value=0.25, label="Text Threshold")
    ],
    outputs=[
        gr.Image(label="Detection Result"),
        gr.Textbox(label="Summary")
    ],
    title="GroundingDINO - Open-Set Object Detection",
    description="Detect any object by describing it in text. Running on CLORE.AI GPU servers."
)

demo.launch(server_name="0.0.0.0", server_port=7860)

Performance

Task

Resolution

GPU

Speed

Single image

800x600

RTX 3090

120ms

Single image

800x600

RTX 4090

80ms

Single image

1920x1080

RTX 4090

150ms

Batch (10 images)

800x600

RTX 4090

600ms

Common Problems & Solutions

Low Detection Accuracy

Problem: Objects not being detected

Solutions:

Lower box_threshold to 0.2-0.3
Lower text_threshold to 0.15-0.2
Use more specific object descriptions
Separate objects with " . " not commas


# Good prompt format
TEXT_PROMPT = "red car . person wearing hat . wooden chair"

# Bad prompt format
TEXT_PROMPT = "red car, person wearing hat, wooden chair"

Out of Memory

Problem: CUDA OOM on large images

Solutions:


# Resize large images before detection
from PIL import Image

def resize_if_needed(image_path, max_size=1280):
    img = Image.open(image_path)
    if max(img.size) > max_size:
        ratio = max_size / max(img.size)
        new_size = (int(img.width * ratio), int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)
        img.save(image_path)

Slow Inference

Problem: Detection takes too long

Solutions:

Use smaller input images
Batch process multiple images
Use FP16 inference
Rent faster GPU (RTX 4090, A100)

False Positives

Problem: Detecting wrong objects

Solutions:

Increase box_threshold to 0.4-0.5
Be more specific in prompts
Use negative prompts (filter results post-detection)


# Filter low-confidence detections
filtered = [(b, l, p) for b, l, p in zip(boxes, logits, phrases) if l > 0.5]

Troubleshooting

Objects not detected

Use more specific text descriptions
Try different phrasings
Lower confidence threshold

Bounding boxes wrong

Be more specific in text prompt
Use "." to separate multiple objects
Check image quality

Out of memory

Reduce image resolution
Process images one at a time
Use smaller model variant

Slow inference

Use TensorRT for speedup
Batch similar-sized images
Enable FP16 inference

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

Next Steps

SAM2 - Segment detected objects
Florence-2 - More vision tasks
YOLO - Faster detection for known classes

PreviousSAM2 Video NextOverview

Last updated 25 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagWhat is GroundingDINO?

hashtagResources

hashtagRecommended Hardware

hashtagQuick Deploy on CLORE.AI

hashtagAccessing Your Service

hashtagInstallation

hashtagWhat You Can Create

hashtagAutomated Labeling

hashtagVisual Search

hashtagRobotics & Automation

hashtagCreative Applications

hashtagAnalytics

hashtagBasic Usage

hashtagGroundingDINO + SAM (Grounded-SAM)

hashtagBatch Processing

hashtagCustom Detection Pipeline

hashtagGradio Interface

hashtagPerformance

hashtagCommon Problems & Solutions

hashtagLow Detection Accuracy

hashtagOut of Memory

hashtagSlow Inference

hashtagFalse Positives

hashtagTroubleshooting

hashtagObjects not detected

hashtagBounding boxes wrong

hashtagSlow inference

hashtagCost Estimate

hashtagNext Steps

Renting on CLORE.AI

Access Your Server

What is GroundingDINO?

Resources

Recommended Hardware

Quick Deploy on CLORE.AI

Accessing Your Service

Installation

What You Can Create

Automated Labeling

Visual Search

Robotics & Automation

Creative Applications

Analytics

Basic Usage

GroundingDINO + SAM (Grounded-SAM)

Batch Processing

Custom Detection Pipeline

Gradio Interface

Performance

Common Problems & Solutions

Low Detection Accuracy

Out of Memory

Slow Inference

False Positives

Troubleshooting

Objects not detected

Bounding boxes wrong

Slow inference

Cost Estimate

Next Steps