# GroundingDINO

Detect any object using text descriptions with GroundingDINO.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
All examples in this guide can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace) marketplace.
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is GroundingDINO?

GroundingDINO by IDEA-Research enables:

* Zero-shot object detection with text prompts
* Detect any object without training
* High-accuracy bounding box localization
* Combine with SAM for automatic segmentation

## Resources

* **GitHub:** [IDEA-Research/GroundingDINO](https://github.com/IDEA-Research/GroundingDINO)
* **Paper:** [GroundingDINO Paper](https://arxiv.org/abs/2303.05499)
* **HuggingFace:** [IDEA-Research/grounding-dino](https://huggingface.co/IDEA-Research/grounding-dino-base)
* **Demo:** [HuggingFace Space](https://huggingface.co/spaces/IDEA-Research/Grounding_DINO_Demo)

## Recommended Hardware

| Component | Minimum       | Recommended   | Optimal       |
| --------- | ------------- | ------------- | ------------- |
| GPU       | RTX 3060 12GB | RTX 4080 16GB | RTX 4090 24GB |
| VRAM      | 6GB           | 12GB          | 16GB          |
| CPU       | 4 cores       | 8 cores       | 16 cores      |
| RAM       | 16GB          | 32GB          | 64GB          |
| Storage   | 20GB SSD      | 50GB NVMe     | 100GB NVMe    |
| Internet  | 100 Mbps      | 500 Mbps      | 1 Gbps        |

## Quick Deploy on CLORE.AI

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
cd /workspace && \
git clone https://github.com/IDEA-Research/GroundingDINO.git && \
cd GroundingDINO && \
pip install -e . && \
python demo/gradio_demo.py
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
pip install -e .

# Download weights
mkdir weights
cd weights
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
```

## What You Can Create

### Automated Labeling

* Auto-annotate datasets for ML training
* Generate bounding boxes from descriptions
* Speed up data labeling pipelines

### Visual Search

* Find specific objects in image databases
* Content moderation systems
* Product recognition in retail

### Robotics & Automation

* Object localization for robot arms
* Inventory management systems
* Quality control inspection

### Creative Applications

* Auto-crop subjects from photos
* Generate object masks with SAM
* Content-aware image editing

### Analytics

* Count objects in images
* Track inventory from photos
* Wildlife monitoring

## Basic Usage

```python
from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

# Load model
model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

# Load image
image_source, image = load_image("input.jpg")

# Detect objects
TEXT_PROMPT = "cat . dog . person"
BOX_THRESHOLD = 0.35
TEXT_THRESHOLD = 0.25

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_THRESHOLD,
    text_threshold=TEXT_THRESHOLD
)

# Annotate image
annotated_frame = annotate(
    image_source=image_source,
    boxes=boxes,
    logits=logits,
    phrases=phrases
)

cv2.imwrite("output.jpg", annotated_frame)
```

## GroundingDINO + SAM (Grounded-SAM)

Combine detection with segmentation:

```python
import torch
import numpy as np
from groundingdino.util.inference import load_model, load_image, predict
from segment_anything import sam_model_registry, SamPredictor

# Load GroundingDINO
dino_model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

# Load SAM
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
sam_predictor = SamPredictor(sam)

# Load image
image_source, image = load_image("input.jpg")

# Detect with GroundingDINO
boxes, logits, phrases = predict(
    model=dino_model,
    image=image,
    caption="person . car",
    box_threshold=0.35,
    text_threshold=0.25
)

# Segment with SAM
sam_predictor.set_image(image_source)

# Convert boxes to SAM format
H, W = image_source.shape[:2]
boxes_xyxy = boxes * torch.tensor([W, H, W, H])

masks = []
for box in boxes_xyxy:
    mask, _, _ = sam_predictor.predict(
        box=box.numpy(),
        multimask_output=False
    )
    masks.append(mask)
```

## Batch Processing

```python
import os
from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

input_dir = "./images"
output_dir = "./detected"
os.makedirs(output_dir, exist_ok=True)

TEXT_PROMPT = "product . price tag . barcode"

for filename in os.listdir(input_dir):
    if not filename.endswith(('.jpg', '.png')):
        continue

    image_path = os.path.join(input_dir, filename)
    image_source, image = load_image(image_path)

    boxes, logits, phrases = predict(
        model=model,
        image=image,
        caption=TEXT_PROMPT,
        box_threshold=0.3,
        text_threshold=0.25
    )

    annotated = annotate(image_source, boxes, logits, phrases)
    cv2.imwrite(os.path.join(output_dir, filename), annotated)

    print(f"{filename}: Found {len(boxes)} objects")
```

## Custom Detection Pipeline

```python
from groundingdino.util.inference import load_model, load_image, predict
import json

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

def detect_and_export(image_path, prompt, output_json):
    image_source, image = load_image(image_path)
    H, W = image_source.shape[:2]

    boxes, logits, phrases = predict(
        model=model,
        image=image,
        caption=prompt,
        box_threshold=0.35,
        text_threshold=0.25
    )

    # Convert to absolute coordinates
    detections = []
    for box, logit, phrase in zip(boxes, logits, phrases):
        x1, y1, x2, y2 = box * torch.tensor([W, H, W, H])
        detections.append({
            "label": phrase,
            "confidence": float(logit),
            "bbox": {
                "x1": int(x1),
                "y1": int(y1),
                "x2": int(x2),
                "y2": int(y2)
            }
        })

    with open(output_json, "w") as f:
        json.dump(detections, f, indent=2)

    return detections

# Detect cars and people
results = detect_and_export(
    "street.jpg",
    "car . person . bicycle . traffic light",
    "detections.json"
)
```

## Gradio Interface

```python
import gradio as gr
import cv2
from groundingdino.util.inference import load_model, load_image, predict, annotate
import tempfile
import numpy as np

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

def detect_objects(image, text_prompt, box_threshold, text_threshold):
    # Save temp image
    with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as f:
        cv2.imwrite(f.name, cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR))
        image_source, img = load_image(f.name)

    boxes, logits, phrases = predict(
        model=model,
        image=img,
        caption=text_prompt,
        box_threshold=box_threshold,
        text_threshold=text_threshold
    )

    annotated = annotate(image_source, boxes, logits, phrases)
    annotated_rgb = cv2.cvtColor(annotated, cv2.COLOR_BGR2RGB)

    return annotated_rgb, f"Found {len(boxes)} objects: {', '.join(phrases)}"

demo = gr.Interface(
    fn=detect_objects,
    inputs=[
        gr.Image(type="pil", label="Input Image"),
        gr.Textbox(label="Objects to Detect", value="person . car . dog", placeholder="object1 . object2 . object3"),
        gr.Slider(0.1, 0.9, value=0.35, label="Box Threshold"),
        gr.Slider(0.1, 0.9, value=0.25, label="Text Threshold")
    ],
    outputs=[
        gr.Image(label="Detection Result"),
        gr.Textbox(label="Summary")
    ],
    title="GroundingDINO - Open-Set Object Detection",
    description="Detect any object by describing it in text. Running on CLORE.AI GPU servers."
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Performance

| Task              | Resolution | GPU      | Speed |
| ----------------- | ---------- | -------- | ----- |
| Single image      | 800x600    | RTX 3090 | 120ms |
| Single image      | 800x600    | RTX 4090 | 80ms  |
| Single image      | 1920x1080  | RTX 4090 | 150ms |
| Batch (10 images) | 800x600    | RTX 4090 | 600ms |

## Common Problems & Solutions

### Low Detection Accuracy

**Problem:** Objects not being detected

**Solutions:**

* Lower `box_threshold` to 0.2-0.3
* Lower `text_threshold` to 0.15-0.2
* Use more specific object descriptions
* Separate objects with " . " not commas

```python

# Good prompt format
TEXT_PROMPT = "red car . person wearing hat . wooden chair"

# Bad prompt format
TEXT_PROMPT = "red car, person wearing hat, wooden chair"
```

### Out of Memory

**Problem:** CUDA OOM on large images

**Solutions:**

```python

# Resize large images before detection
from PIL import Image

def resize_if_needed(image_path, max_size=1280):
    img = Image.open(image_path)
    if max(img.size) > max_size:
        ratio = max_size / max(img.size)
        new_size = (int(img.width * ratio), int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)
        img.save(image_path)
```

### Slow Inference

**Problem:** Detection takes too long

**Solutions:**

* Use smaller input images
* Batch process multiple images
* Use FP16 inference
* Rent faster GPU (RTX 4090, A100)

### False Positives

**Problem:** Detecting wrong objects

**Solutions:**

* Increase `box_threshold` to 0.4-0.5
* Be more specific in prompts
* Use negative prompts (filter results post-detection)

```python

# Filter low-confidence detections
filtered = [(b, l, p) for b, l, p in zip(boxes, logits, phrases) if l > 0.5]
```

## Troubleshooting

### Objects not detected

* Use more specific text descriptions
* Try different phrasings
* Lower confidence threshold

### Bounding boxes wrong

* Be more specific in text prompt
* Use "." to separate multiple objects
* Check image quality

{% hint style="danger" %}
**Out of memory**
{% endhint %}

* Reduce image resolution
* Process images one at a time
* Use smaller model variant

### Slow inference

* Use TensorRT for speedup
* Batch similar-sized images
* Enable FP16 inference

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [SAM2](https://docs.clore.ai/guides/vision-models/sam2-video) - Segment detected objects
* [Florence-2](https://docs.clore.ai/guides/vision-models/florence2) - More vision tasks
* [YOLO](https://docs.clore.ai/guides/computer-vision/yolov8-detection) - Faster detection for known classes
