# GroundingDINO

使用 GroundingDINO 通过文本描述检测任何对象。

{% hint style="success" %}
所有示例都可以在通过以下方式租用的 GPU 服务器上运行： [CLORE.AI 市场](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
本指南中的所有示例都可以在通过以下方式租用的 GPU 服务器上运行： [CLORE.AI 市场](https://clore.ai/marketplace) 市场。
{% endhint %}

## 在 CLORE.AI 上租用

1. 访问 [CLORE.AI 市场](https://clore.ai/marketplace)
2. 按 GPU 类型、显存和价格筛选
3. 选择 **按需** （固定费率）或 **竞价** （出价价格）
4. 配置您的订单：
   * 选择 Docker 镜像
   * 设置端口（用于 SSH 的 TCP，Web 界面的 HTTP）
   * 如有需要，添加环境变量
   * 输入启动命令
5. 选择支付方式： **CLORE**, **BTC**，或 **USDT/USDC**
6. 创建订单并等待部署

### 访问您的服务器

* 在以下位置查找连接详情： **我的订单**
* Web 界面：使用 HTTP 端口的 URL
* SSH： `ssh -p <port> root@<proxy-address>`

## 什么是 GroundingDINO？

IDEA-Research 的 GroundingDINO 提供：

* 使用文本提示的零样本目标检测
* 无需训练即可检测任何对象
* 高精度边界框定位
* 与 SAM 结合实现自动分割

## 资源

* **GitHub：** [IDEA-Research/GroundingDINO](https://github.com/IDEA-Research/GroundingDINO)
* **论文：** [GroundingDINO 论文](https://arxiv.org/abs/2303.05499)
* **HuggingFace：** [IDEA-Research/grounding-dino](https://huggingface.co/IDEA-Research/grounding-dino-base)
* **演示：** [HuggingFace Space](https://huggingface.co/spaces/IDEA-Research/Grounding_DINO_Demo)

## 推荐硬件

| 组件  | 最低            | 推荐            | 最佳            |
| --- | ------------- | ------------- | ------------- |
| GPU | RTX 3060 12GB | RTX 4080 16GB | RTX 4090 24GB |
| 显存  | 6GB           | 12GB          | 16GB          |
| CPU | 4 核           | 8 核           | 16 核          |
| 内存  | 16GB          | 32GB          | 64GB          |
| 存储  | 20GB SSD      | 50GB NVMe     | 100GB NVMe    |
| 网络  | 100 Mbps      | 500 Mbps      | 1 Gbps        |

## 在 CLORE.AI 上快速部署

**Docker 镜像：**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**端口：**

```
22/tcp
7860/http
```

**命令：**

```bash
cd /workspace && \
git clone https://github.com/IDEA-Research/GroundingDINO.git && \
cd GroundingDINO && \
pip install -e . && \
python demo/gradio_demo.py
```

## 访问您的服务

部署后，在以下位置查找您的 `http_pub` URL： **我的订单**:

1. 前往 **我的订单** 页面
2. 单击您的订单
3. 查找 `http_pub` URL（例如， `abc123.clorecloud.net`)

使用 `https://YOUR_HTTP_PUB_URL` 而不是 `localhost` 在下面的示例中。

## 安装

```bash
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
pip install -e .

# 下载权重
mkdir weights
cd weights
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
```

## 您可以创建的内容

### 自动标注

* 为机器学习训练自动注释数据集
* 根据描述生成边界框
* 加速数据标注流程

### 视觉搜索

* 在图像数据库中查找特定对象
* 内容审核系统
* 零售中的产品识别

### 机器人与自动化

* 用于机械臂的对象定位
* 库存管理系统
* 质量控制检测

### 创意应用

* 从照片中自动裁剪主体
* 使用 SAM 生成对象掩码
* 基于内容的图像编辑

### 分析

* 统计图像中的对象数量
* 从照片跟踪库存
* 野生动物监测

## 基本用法

```python
from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

# 加载模型
model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

# 加载图像
image_source, image = load_image("input.jpg")

# 检测对象
TEXT_PROMPT = "cat . dog . person"
BOX_THRESHOLD = 0.35
TEXT_THRESHOLD = 0.25

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_THRESHOLD,
    text_threshold=TEXT_THRESHOLD
)

# 标注图像
annotated_frame = annotate(
    image_source=image_source,
    boxes=boxes,
    logits=logits,
    phrases=phrases
)

cv2.imwrite("output.jpg", annotated_frame)
```

## GroundingDINO + SAM（Grounded-SAM）

将检测与分割结合：

```python
import torch
import numpy as np
from groundingdino.util.inference import load_model, load_image, predict
from segment_anything import sam_model_registry, SamPredictor

# 加载 GroundingDINO
dino_model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

# 加载 SAM
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
sam_predictor = SamPredictor(sam)

# 加载图像
image_source, image = load_image("input.jpg")

# 使用 GroundingDINO 进行检测
boxes, logits, phrases = predict(
    model=dino_model,
    image=image,
    caption="person . car",
    box_threshold=0.35,
    text_threshold=0.25
)

# 使用 SAM 进行分割
sam_predictor.set_image(image_source)

# 将边界框转换为 SAM 格式
H, W = image_source.shape[:2]
boxes_xyxy = boxes * torch.tensor([W, H, W, H])

masks = []
for box in boxes_xyxy:
    mask, _, _ = sam_predictor.predict(
        box=box.numpy(),
        multimask_output=False
    )
    masks.append(mask)
```

## "专业影棚柔光箱"

```python
批处理处理
from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

import os
output_dir = "./detected"
output_dir = "./relit"

TEXT_PROMPT = "product . price tag . barcode"

lighting_prompt = "专业影棚照明，柔和的阴影"
    for filename in os.listdir(input_dir):
        if not filename.endswith(('.jpg', '.png')):

    image_path = os.path.join(input_dir, filename)
    image_source, image = load_image(image_path)

    boxes, logits, phrases = predict(
        model=model,
        image=image,
        caption=TEXT_PROMPT,
        box_threshold=0.3,
        text_threshold=0.25
    )

    annotated = annotate(image_source, boxes, logits, phrases)
    cv2.imwrite(os.path.join(output_dir, filename), annotated)

    print(f"{filename}: Found {len(boxes)} objects")
```

## 自定义检测管道

```python
from groundingdino.util.inference import load_model, load_image, predict
import json

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

def detect_and_export(image_path, prompt, output_json):
    image_source, image = load_image(image_path)
    H, W = image_source.shape[:2]

    boxes, logits, phrases = predict(
        model=model,
        image=image,
        caption=prompt,
        box_threshold=0.35,
        text_threshold=0.25
    )

    # 转换为绝对坐标
    detections = []
    for box, logit, phrase in zip(boxes, logits, phrases):
        x1, y1, x2, y2 = box * torch.tensor([W, H, W, H])
        detections.append({
            "label": phrase,
            "confidence": float(logit),
            "bbox": {
                "x1": int(x1),
                "y1": int(y1),
                "x2": int(x2),
                "y2": int(y2)
            }
        })

    with open(output_json, "w") as f:
        json.dump(detections, f, indent=2)

    return detections

# 检测汽车和行人
results = detect_and_export(
    "street.jpg",
    "car . person . bicycle . traffic light",
    "detections.json"
)
```

## Gradio 界面

```python
print(f"已生成：{name}")
import cv2
from groundingdino.util.inference import load_model, load_image, predict, annotate
import tempfile
import numpy as np

model = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth"
)

def detect_objects(image, text_prompt, box_threshold, text_threshold):
    # 保存临时图片
    with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as f:
        cv2.imwrite(f.name, cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR))
        image_source, img = load_image(f.name)

    boxes, logits, phrases = predict(
        model=model,
        image=img,
        caption=text_prompt,
        box_threshold=box_threshold,
        text_threshold=text_threshold
    )

    annotated = annotate(image_source, boxes, logits, phrases)
    annotated_rgb = cv2.cvtColor(annotated, cv2.COLOR_BGR2RGB)

    return annotated_rgb, f"Found {len(boxes)} objects: {', '.join(phrases)}"

demo = gr.Interface(
    fn=detect_objects,
    inputs=[
        fn=relight_image,
        gr.Textbox(label="Objects to Detect", value="person . car . dog", placeholder="object1 . object2 . object3"),
        gr.Slider(0.1, 0.9, value=0.35, label="Box Threshold"),
        gr.Slider(0.1, 0.9, value=0.25, label="Text Threshold")
    ],
    outputs=[
        gr.Image(label="Detection Result"),
        gr.Textbox(label="Summary")
    ],
    title="GroundingDINO - Open-Set Object Detection",
    description="通过文本描述检测任何对象。在 CLORE.AI GPU 服务器上运行。"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## background = Image.open("studio\_bg.jpg")

| 任务          | 分辨率       | GPU     | 性能     |
| ----------- | --------- | ------- | ------ |
| 单张图片        | 800x600   | 速度      | 120ms  |
| 单张图片        | 800x600   | 512x512 | 80 毫秒  |
| 单张图片        | 1920x1080 | 512x512 | 150ms  |
| 批处理（10 张图片） | 800x600   | 512x512 | 600 毫秒 |

## IC-Light-FBC

### 检测准确率低

**与背景合成** 对象未被检测到

**光照未改变**

* 降低 `box_threshold` 到 0.2-0.3
* 降低 `text_threshold` 到 0.15-0.2
* 使用更具体的对象描述
* 使用 " . " 分隔对象而不是逗号

```python

# 良好提示格式
TEXT_PROMPT = "red car . person wearing hat . wooden chair"

# 不良提示格式
TEXT_PROMPT = "red car, person wearing hat, wooden chair"
```

### 内存不足

**与背景合成** 大图像导致 CUDA OOM

**光照未改变**

```python

# 在检测前调整大图像大小
from PIL import Image

def resize_if_needed(image_path, max_size=1280):
    img = Image.open(image_path)
    if max(img.size) > max_size:
        ratio = max_size / max(img.size)
        new_size = (int(img.width * ratio), int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)
        img.save(image_path)
```

### 推理缓慢

**与背景合成** 检测耗时过长

**光照未改变**

* 使用更小的输入图像
* 对多张图像进行批处理
* 使用 FP16 推理
* 租用更快的 GPU（RTX 4090、A100）

### 误报

**与背景合成** 检测到错误的对象

**光照未改变**

* 增加 `box_threshold` 到 0.4-0.5
* 在提示中更具体
* 使用负面提示（在检测后过滤结果）

```python

# 过滤低置信度检测
filtered = [(b, l, p) for b, l, p in zip(boxes, logits, phrases) if l > 0.5]
```

## # 使用固定种子以获得一致结果

### 对象未被检测到

* 使用更具体的文本描述
* 尝试不同的措辞
* 降低置信度阈值

### 边界框错误

* 在文本提示中更具体
* 使用 "." 分隔多个对象
* 检查图像质量

{% hint style="danger" %}
**内存不足**
{% endhint %}

* 降低图像分辨率
* 一次处理一张图像
* 使用更小的模型变体

### 推理缓慢

* 使用 TensorRT 提速
* 批处理相似尺寸的图像
* 启用 FP16 推理

## 下载所有所需的检查点

检查文件完整性

| GPU     | 验证 CUDA 兼容性 | 费用估算    | CLORE.AI 市场的典型费率（截至 2024 年）： |
| ------- | ----------- | ------- | ---------------------------- |
| 按小时费率   | \~$0.03     | \~$0.70 | \~$0.12                      |
| 速度      | \~$0.06     | \~$1.50 | \~$0.25                      |
| 512x512 | \~$0.10     | \~$2.30 | \~$0.40                      |
| 按日费率    | \~$0.17     | \~$4.00 | \~$0.70                      |
| 4 小时会话  | \~$0.25     | \~$6.00 | \~$1.00                      |

*RTX 3060* [*CLORE.AI 市场*](https://clore.ai/marketplace) *A100 40GB*

**A100 80GB**

* 使用 **竞价** 价格随提供商和需求而异。请查看
* 以获取当前费率。 **CLORE** 节省费用：
* 市场用于灵活工作负载（通常便宜 30-50%）

## 使用以下方式支付

* [SAM2](/guides/guides_v2-zh/shi-jue-mo-xing/sam2-video.md) - 对检测到的对象进行分割
* [Florence-2](/guides/guides_v2-zh/shi-jue-mo-xing/florence2.md) - 更多视觉任务
* [YOLO](/guides/guides_v2-zh/ji-suan-ji-shi-jue/yolov8-detection.md) - 对已知类别更快的检测


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/shi-jue-mo-xing/groundingdino.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.