# GroundingDINO

Detect any object using text descriptions with GroundingDINO.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
All examples in this guide can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace) marketplace.
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is GroundingDINO?

GroundingDINO by IDEA-Research enables:

* Zero-shot object detection with text prompts
* Detect any object without training
* High-accuracy bounding box localization
* Combine with SAM for automatic segmentation

## Resources

* **GitHub:** [IDEA-Research/GroundingDINO](https://github.com/IDEA-Research/GroundingDINO)
* **Paper:** [GroundingDINO Paper](https://arxiv.org/abs/2303.05499)
* **HuggingFace:** [IDEA-Research/grounding-dino](https://huggingface.co/IDEA-Research/grounding-dino-base)
* **Demo:** [HuggingFace Space](https://huggingface.co/spaces/IDEA-Research/Grounding_DINO_Demo)

## Recommended Hardware

| Component | Minimum       | Recommended   | Optimal       |
| --------- | ------------- | ------------- | ------------- |
| GPU       | RTX 3060 12GB | RTX 4080 16GB | RTX 4090 24GB |
| VRAM      | 6GB           | 12GB          | 16GB          |
| CPU       | 4 cores       | 8 cores       | 16 cores      |
| RAM       | 16GB          | 32GB          | 64GB          |
| Storage   | 20GB SSD      | 50GB NVMe     | 100GB NVMe    |
| Internet  | 100 Mbps      | 500 Mbps      | 1 Gbps        |

## Quick Deploy on CLORE.AI

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
cd /workspace && \
git clone https://github.com/IDEA-Research/GroundingDINO.git && \
cd GroundingDINO && \
pip install -e . && \
python demo/gradio_demo.py
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
pip install -e .

# Download weights
mkdir weights
cd weights
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
```

## What You Can Create

### Automated Labeling

* Auto-annotate datasets for ML training
* Generate bounding boxes from descriptions
* Speed up data labeling pipelines

### Visual Search

* Find specific objects in image databases
* Content moderation systems
* Product recognition in retail

### Robotics & Automation

* Object localization for robot arms
* Inventory management systems
* Quality control inspection

### Creative Applications

* Auto-crop subjects from photos
* Generate object masks with SAM
* Content-aware image editing

### Analytics

* Count objects in images
* Track inventory from photos
* Wildlife monitoring

## Basic Usage

```python
from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

# Load model
model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

# Load image
image_source, image = load_image("input.jpg")

# Detect objects
TEXT_PROMPT = "cat . dog . person"
BOX_THRESHOLD = 0.35
TEXT_THRESHOLD = 0.25

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_THRESHOLD,
    text_threshold=TEXT_THRESHOLD
)

# Annotate image
annotated_frame = annotate(
    image_source=image_source,
    boxes=boxes,
    logits=logits,
    phrases=phrases
)

cv2.imwrite("output.jpg", annotated_frame)
```

## GroundingDINO + SAM (Grounded-SAM)

Combine detection with segmentation:

```python
import torch
import numpy as np
from groundingdino.util.inference import load_model, load_image, predict
from segment_anything import sam_model_registry, SamPredictor

# Load GroundingDINO
dino_model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

# Load SAM
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
sam_predictor = SamPredictor(sam)

# Load image
image_source, image = load_image("input.jpg")

# Detect with GroundingDINO
boxes, logits, phrases = predict(
    model=dino_model,
    image=image,
    caption="person . car",
    box_threshold=0.35,
    text_threshold=0.25
)

# Segment with SAM
sam_predictor.set_image(image_source)

# Convert boxes to SAM format
H, W = image_source.shape[:2]
boxes_xyxy = boxes * torch.tensor([W, H, W, H])

masks = []
for box in boxes_xyxy:
    mask, _, _ = sam_predictor.predict(
        box=box.numpy(),
        multimask_output=False
    )
    masks.append(mask)
```

## Batch Processing

```python
import os
from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

input_dir = "./images"
output_dir = "./detected"
os.makedirs(output_dir, exist_ok=True)

TEXT_PROMPT = "product . price tag . barcode"

for filename in os.listdir(input_dir):
    if not filename.endswith(('.jpg', '.png')):
        continue

    image_path = os.path.join(input_dir, filename)
    image_source, image = load_image(image_path)

    boxes, logits, phrases = predict(
        model=model,
        image=image,
        caption=TEXT_PROMPT,
        box_threshold=0.3,
        text_threshold=0.25
    )

    annotated = annotate(image_source, boxes, logits, phrases)
    cv2.imwrite(os.path.join(output_dir, filename), annotated)

    print(f"{filename}: Found {len(boxes)} objects")
```

## Custom Detection Pipeline

```python
from groundingdino.util.inference import load_model, load_image, predict
import json

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

def detect_and_export(image_path, prompt, output_json):
    image_source, image = load_image(image_path)
    H, W = image_source.shape[:2]

    boxes, logits, phrases = predict(
        model=model,
        image=image,
        caption=prompt,
        box_threshold=0.35,
        text_threshold=0.25
    )

    # Convert to absolute coordinates
    detections = []
    for box, logit, phrase in zip(boxes, logits, phrases):
        x1, y1, x2, y2 = box * torch.tensor([W, H, W, H])
        detections.append({
            "label": phrase,
            "confidence": float(logit),
            "bbox": {
                "x1": int(x1),
                "y1": int(y1),
                "x2": int(x2),
                "y2": int(y2)
            }
        })

    with open(output_json, "w") as f:
        json.dump(detections, f, indent=2)

    return detections

# Detect cars and people
results = detect_and_export(
    "street.jpg",
    "car . person . bicycle . traffic light",
    "detections.json"
)
```

## Gradio Interface

```python
import gradio as gr
import cv2
from groundingdino.util.inference import load_model, load_image, predict, annotate
import tempfile
import numpy as np

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

def detect_objects(image, text_prompt, box_threshold, text_threshold):
    # Save temp image
    with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as f:
        cv2.imwrite(f.name, cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR))
        image_source, img = load_image(f.name)

    boxes, logits, phrases = predict(
        model=model,
        image=img,
        caption=text_prompt,
        box_threshold=box_threshold,
        text_threshold=text_threshold
    )

    annotated = annotate(image_source, boxes, logits, phrases)
    annotated_rgb = cv2.cvtColor(annotated, cv2.COLOR_BGR2RGB)

    return annotated_rgb, f"Found {len(boxes)} objects: {', '.join(phrases)}"

demo = gr.Interface(
    fn=detect_objects,
    inputs=[
        gr.Image(type="pil", label="Input Image"),
        gr.Textbox(label="Objects to Detect", value="person . car . dog", placeholder="object1 . object2 . object3"),
        gr.Slider(0.1, 0.9, value=0.35, label="Box Threshold"),
        gr.Slider(0.1, 0.9, value=0.25, label="Text Threshold")
    ],
    outputs=[
        gr.Image(label="Detection Result"),
        gr.Textbox(label="Summary")
    ],
    title="GroundingDINO - Open-Set Object Detection",
    description="Detect any object by describing it in text. Running on CLORE.AI GPU servers."
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Performance

| Task              | Resolution | GPU      | Speed |
| ----------------- | ---------- | -------- | ----- |
| Single image      | 800x600    | RTX 3090 | 120ms |
| Single image      | 800x600    | RTX 4090 | 80ms  |
| Single image      | 1920x1080  | RTX 4090 | 150ms |
| Batch (10 images) | 800x600    | RTX 4090 | 600ms |

## Common Problems & Solutions

### Low Detection Accuracy

**Problem:** Objects not being detected

**Solutions:**

* Lower `box_threshold` to 0.2-0.3
* Lower `text_threshold` to 0.15-0.2
* Use more specific object descriptions
* Separate objects with " . " not commas

```python

# Good prompt format
TEXT_PROMPT = "red car . person wearing hat . wooden chair"

# Bad prompt format
TEXT_PROMPT = "red car, person wearing hat, wooden chair"
```

### Out of Memory

**Problem:** CUDA OOM on large images

**Solutions:**

```python

# Resize large images before detection
from PIL import Image

def resize_if_needed(image_path, max_size=1280):
    img = Image.open(image_path)
    if max(img.size) > max_size:
        ratio = max_size / max(img.size)
        new_size = (int(img.width * ratio), int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)
        img.save(image_path)
```

### Slow Inference

**Problem:** Detection takes too long

**Solutions:**

* Use smaller input images
* Batch process multiple images
* Use FP16 inference
* Rent faster GPU (RTX 4090, A100)

### False Positives

**Problem:** Detecting wrong objects

**Solutions:**

* Increase `box_threshold` to 0.4-0.5
* Be more specific in prompts
* Use negative prompts (filter results post-detection)

```python

# Filter low-confidence detections
filtered = [(b, l, p) for b, l, p in zip(boxes, logits, phrases) if l > 0.5]
```

## Troubleshooting

### Objects not detected

* Use more specific text descriptions
* Try different phrasings
* Lower confidence threshold

### Bounding boxes wrong

* Be more specific in text prompt
* Use "." to separate multiple objects
* Check image quality

{% hint style="danger" %}
**Out of memory**
{% endhint %}

* Reduce image resolution
* Process images one at a time
* Use smaller model variant

### Slow inference

* Use TensorRT for speedup
* Batch similar-sized images
* Enable FP16 inference

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [SAM2](/guides/vision-models/sam2-video.md) - Segment detected objects
* [Florence-2](/guides/vision-models/florence2.md) - More vision tasks
* [YOLO](/guides/computer-vision/yolov8-detection.md) - Faster detection for known classes


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/vision-models/groundingdino.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
