# YOLOv9/v10 Detection

> **State-of-the-art real-time object detection — train and deploy the latest YOLO models on GPU**

YOLO (You Only Look Once) remains the gold standard for real-time object detection. YOLOv9 introduced Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN), while YOLOv10 brought NMS-free detection with dual-label assignments. Both deliver top-tier accuracy/speed tradeoffs on NVIDIA GPUs.

* **YOLOv9 GitHub:** [WongKinYiu/yolov9](https://github.com/WongKinYiu/yolov9) — 8K+ ⭐
* **YOLOv10 GitHub:** [THU-MIG/yolov10](https://github.com/THU-MIG/yolov10) — 10K+ ⭐
* **Ultralytics (unified):** [ultralytics/ultralytics](https://github.com/ultralytics/ultralytics) — 32K+ ⭐

***

## YOLOv9 vs YOLOv10 vs YOLOv8 — Quick Comparison

| Model    | mAP50-95 | Speed (A100) | Parameters | NMS      |
| -------- | -------- | ------------ | ---------- | -------- |
| YOLOv8x  | 53.9     | 14.2ms       | 68.2M      | Required |
| YOLOv9e  | 55.6     | 16.8ms       | 57.3M      | Required |
| YOLOv10x | 54.4     | 10.7ms       | 29.5M      | **Free** |
| YOLOv10b | 53.0     | 8.8ms        | 19.1M      | **Free** |
| YOLOv10s | 46.8     | 4.2ms        | 7.2M       | **Free** |

{% hint style="success" %}
**YOLOv10 is NMS-free** — no post-processing Non-Maximum Suppression step. This enables end-to-end deployment and is particularly beneficial for edge/embedded scenarios and TensorRT deployment.
{% endhint %}

***

## Use Cases

* **Security & surveillance** — real-time person/vehicle/object detection
* **Autonomous vehicles** — pedestrian and obstacle detection
* **Manufacturing QC** — defect detection on production lines
* **Retail analytics** — customer flow and product detection
* **Medical imaging** — anomaly detection in X-rays and scans
* **Sports analytics** — player and ball tracking
* **Agriculture** — crop disease and pest detection

***

## Prerequisites

* Clore.ai account with GPU rental
* Training data (for custom model training) or use COCO pretrained weights
* Basic Python and command line knowledge

***

## Step 1 — Rent a GPU on Clore.ai

1. Go to [clore.ai](https://clore.ai) → **Marketplace**
2. Choose GPU based on your task:
   * **Inference only:** RTX 3080/3090 or RTX 4080 — excellent price/performance
   * **Training small models:** RTX 4090 24GB
   * **Training large models (YOLOv9e/YOLOv10x):** A100 40/80GB

{% hint style="info" %}
**For real-time inference** (video streams), RTX 3090 or RTX 4090 delivers 100–500 FPS depending on the model variant. Even the smallest YOLOv10n runs at 1000+ FPS on a 4090 with TensorRT.
{% endhint %}

***

## Step 2 — Deploy the Ultralytics Container

The official Ultralytics Docker image supports YOLOv8, YOLOv9, and YOLOv10 through a unified API:

**Docker Image:**

```
ultralytics/ultralytics:latest
```

**Ports:**

```
22
8000
```

**Environment Variables:**

```
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
```

**Disk:** 20GB minimum (pretrained weights + your dataset)

***

## Step 3 — Connect and Verify

```bash
ssh root@<server-ip> -p <ssh-port>

# Check GPU
nvidia-smi

# Check Ultralytics installation
python3 -c "import ultralytics; ultralytics.checks()"

# Should show GPU info, CUDA version, and model availability
```

***

## Step 4 — Quick Inference with Pretrained Models

### YOLOv10 Inference (NMS-free)

```python
from ultralytics import YOLO
import cv2

# Load YOLOv10 model (auto-downloads if not present)
model = YOLO("yolov10x.pt")  # Options: n, s, m, b, l, x

# Run inference on an image
results = model("https://ultralytics.com/images/bus.jpg")

# Display results
for result in results:
    boxes = result.boxes
    print(f"Detected {len(boxes)} objects")
    for box in boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        xyxy = box.xyxy[0].tolist()
        print(f"  {model.names[cls]}: {conf:.2f} at {[int(x) for x in xyxy]}")

# Save annotated image
results[0].save("output.jpg")
```

### YOLOv9 Inference

```python
from ultralytics import YOLO

# Load YOLOv9 model
model = YOLO("yolov9e.pt")  # Options: t, s, m, c, e

# Batch inference for maximum throughput
results = model(
    source=[
        "image1.jpg",
        "image2.jpg",
        "image3.jpg",
    ],
    batch=8,        # Process 8 images in parallel
    device="cuda",
    conf=0.25,      # Confidence threshold
    iou=0.45,       # NMS IoU threshold (not needed for v10)
    imgsz=640,
    half=True       # FP16 for 2x speedup
)
```

### Real-Time Video Stream Inference

```python
from ultralytics import YOLO
import cv2

model = YOLO("yolov10s.pt")

# For webcam (device=0) or video file
cap = cv2.VideoCapture("input_video.mp4")

# Get video properties
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

# Output writer
out = cv2.VideoWriter(
    "output_video.mp4",
    cv2.VideoWriter_fourcc(*"mp4v"),
    fps,
    (width, height)
)

frame_count = 0
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    
    results = model(frame, conf=0.25, verbose=False)
    annotated = results[0].plot()
    out.write(annotated)
    frame_count += 1
    
    if frame_count % 100 == 0:
        print(f"Processed {frame_count} frames")

cap.release()
out.release()
print("Done! Output saved to output_video.mp4")
```

***

## Step 5 — Train a Custom Model

### Prepare Your Dataset

YOLO uses a specific directory structure and label format:

```
dataset/
├── images/
│   ├── train/          # Training images (.jpg/.png)
│   ├── val/            # Validation images
│   └── test/           # Test images (optional)
└── labels/
    ├── train/          # Label files (.txt)
    ├── val/
    └── test/
```

Each label file (same name as image, `.txt` extension) contains:

```
# class_id center_x center_y width height (all normalized 0-1)
0 0.512 0.334 0.256 0.412
1 0.123 0.654 0.089 0.123
```

### Create Dataset Config

```bash
cat > /workspace/custom_dataset.yaml << 'EOF'
# Dataset configuration
path: /workspace/dataset
train: images/train
val: images/val
test: images/test

# Number of classes
nc: 3

# Class names
names:
  0: person
  1: car
  2: bicycle
EOF
```

### Import from Roboflow (Recommended)

```python
# Install Roboflow
pip install roboflow

from roboflow import Roboflow
rf = Roboflow(api_key="YOUR_API_KEY")
project = rf.workspace("your-workspace").project("your-project")
version = project.version(1)
dataset = version.download("yolov9")

# Dataset is now at ./your-project-1/
```

### Train YOLOv10

```python
from ultralytics import YOLO

# Load pretrained YOLOv10 model (transfer learning)
model = YOLO("yolov10m.pt")  # Medium variant — good balance

results = model.train(
    data="/workspace/custom_dataset.yaml",
    epochs=100,
    imgsz=640,
    batch=16,               # Adjust for your GPU VRAM
    device="cuda",
    workers=8,
    project="/workspace/runs",
    name="yolov10_custom",
    patience=50,            # Early stopping
    save=True,
    save_period=10,         # Save checkpoint every 10 epochs
    plots=True,
    val=True,
    augment=True,           # Data augmentation
    degrees=10.0,
    flipud=0.0,
    fliplr=0.5,
    mosaic=1.0,
    mixup=0.1,
    copy_paste=0.1,
    lr0=0.01,
    lrf=0.01,
    momentum=0.937,
    weight_decay=0.0005,
    warmup_epochs=3.0,
    amp=True                # Automatic Mixed Precision (FP16)
)

print(f"Training complete! Best mAP: {results.results_dict['metrics/mAP50-95(B)']:.3f}")
```

### Train YOLOv9

```python
from ultralytics import YOLO

model = YOLO("yolov9e.pt")

results = model.train(
    data="/workspace/custom_dataset.yaml",
    epochs=100,
    imgsz=640,
    batch=8,               # v9e is larger, needs less batch
    device="cuda",
    workers=8,
    project="/workspace/runs",
    name="yolov9_custom",
    amp=True,
    optimizer="SGD",
    momentum=0.937,
    weight_decay=0.0005
)
```

{% hint style="info" %}
**Training Tips:**

* **Batch size:** Start with `batch=16` for RTX 4090, `batch=32` for A100 40GB
* **Image size:** `imgsz=640` is standard; use 1280 for high-resolution tasks
* **Epochs:** 100 epochs is typical for fine-tuning, 300+ for training from scratch
* **AMP (Mixed Precision):** Always enable `amp=True` for 1.5–2x speedup
  {% endhint %}

***

## Step 6 — Export to TensorRT for Maximum Speed

```python
from ultralytics import YOLO

# Load trained model
model = YOLO("/workspace/runs/yolov10_custom/weights/best.pt")

# Export to TensorRT (FP16 for best speed/accuracy balance)
model.export(
    format="engine",        # TensorRT engine
    device="cuda",
    half=True,              # FP16
    dynamic=False,          # Static shapes for max TRT optimization
    batch=1,                # Optimize for batch size 1 (real-time)
    imgsz=640,
    workspace=4             # TRT workspace in GB
)
# Saved as: best.engine

# Load and run TRT engine
trt_model = YOLO("best.engine")
results = trt_model("image.jpg")
```

### Export to ONNX

```python
# Export to ONNX for deployment flexibility
model.export(
    format="onnx",
    opset=17,
    half=True,              # FP16 weights
    dynamic=True,           # Dynamic batch size
    simplify=True
)
```

***

## Step 7 — Serve as a REST API

```bash
pip install fastapi uvicorn python-multipart

cat > /workspace/yolo_api.py << 'EOF'
from fastapi import FastAPI, File, UploadFile
from fastapi.responses import JSONResponse, FileResponse
from ultralytics import YOLO
from PIL import Image
import io
import uuid
import os

app = FastAPI(title="YOLOv10 Detection API")
model = YOLO("yolov10x.pt")

@app.get("/health")
async def health():
    return {"status": "ok", "model": "yolov10x", "device": "cuda"}

@app.post("/detect")
async def detect(
    file: UploadFile = File(...),
    conf: float = 0.25,
    iou: float = 0.45,
    return_image: bool = False
):
    # Read uploaded image
    image_data = await file.read()
    img = Image.open(io.BytesIO(image_data)).convert("RGB")
    
    # Run detection
    results = model(img, conf=conf, iou=iou, verbose=False)
    result = results[0]
    
    # Build response
    detections = []
    for box in result.boxes:
        detections.append({
            "class": model.names[int(box.cls[0])],
            "confidence": round(float(box.conf[0]), 4),
            "bbox": [round(x, 2) for x in box.xyxy[0].tolist()],
            "class_id": int(box.cls[0])
        })
    
    response = {
        "count": len(detections),
        "detections": detections,
        "image_size": list(result.orig_shape)
    }
    
    if return_image:
        output_path = f"/tmp/{uuid.uuid4()}.jpg"
        result.save(filename=output_path)
        return FileResponse(output_path, media_type="image/jpeg")
    
    return JSONResponse(response)

@app.post("/detect/batch")
async def detect_batch(files: list[UploadFile] = File(...)):
    results = []
    for file in files:
        data = await file.read()
        img = Image.open(io.BytesIO(data)).convert("RGB")
        res = model(img, verbose=False)[0]
        results.append({
            "filename": file.filename,
            "count": len(res.boxes),
            "detections": [
                {"class": model.names[int(b.cls[0])], "conf": float(b.conf[0])}
                for b in res.boxes
            ]
        })
    return JSONResponse({"results": results})

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
EOF

python3 /workspace/yolo_api.py &

# Test the API
curl -X POST "http://localhost:8000/detect" \
    -F "file=@test_image.jpg" | python3 -m json.tool
```

***

## Step 8 — Validate and Benchmark Your Model

```python
from ultralytics import YOLO

model = YOLO("yolov10x.pt")

# Validate on COCO dataset
metrics = model.val(
    data="coco.yaml",
    imgsz=640,
    batch=32,
    device="cuda",
    half=True
)

print(f"mAP50:    {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")
print(f"Precision: {metrics.box.mp:.3f}")
print(f"Recall:    {metrics.box.mr:.3f}")

# Benchmark speed
model.benchmark(
    format="engine",   # Compare multiple export formats
    imgsz=640,
    half=True,
    device="cuda"
)
```

***

## Download Results

```bash
# From your local machine:
scp -P <ssh-port> root@<server-ip>:/workspace/runs/yolov10_custom/weights/best.pt ./
scp -P <ssh-port> root@<server-ip>:/workspace/output_video.mp4 ./

# Download entire training run
rsync -avz -e "ssh -p <ssh-port>" \
    root@<server-ip>:/workspace/runs/ \
    ./yolo_training_runs/
```

***

## Troubleshooting

### CUDA Out of Memory During Training

```python
# Reduce batch size
model.train(data="data.yaml", batch=4, imgsz=640)

# Or enable gradient checkpointing
model.train(data="data.yaml", batch=8, imgsz=640, cache=False)
```

### Slow Training Speed

```python
# Enable caching (loads dataset into RAM/GPU)
model.train(data="data.yaml", cache=True)  # Cache to RAM
model.train(data="data.yaml", cache="disk")  # Cache to disk

# Increase workers (careful: too many can slow down)
model.train(data="data.yaml", workers=8)
```

### Low mAP / Poor Detection

```bash
# Verify labels are correct (normalized, within 0-1)
python3 -c "
from ultralytics.data.utils import check_det_dataset
check_det_dataset('custom_dataset.yaml')
"

# Visualize training samples
python3 -c "
from ultralytics import YOLO
model = YOLO('yolov10m.pt')
model.train(data='data.yaml', epochs=1, batch=4, plots=True)
# Check /workspace/runs/train/exp/train_batch*.jpg
"
```

***

## Performance Reference (Clore.ai GPUs)

| Model        | GPU      | Batch | FPS (inference) | mAP50-95 |
| ------------ | -------- | ----- | --------------- | -------- |
| YOLOv10n     | RTX 3090 | 1     | 1,200           | 38.5     |
| YOLOv10s     | RTX 3090 | 1     | 780             | 46.8     |
| YOLOv10m     | RTX 4090 | 1     | 950             | 51.3     |
| YOLOv10x     | RTX 4090 | 1     | 380             | 54.4     |
| YOLOv9e      | A100 40G | 1     | 720             | 55.6     |
| YOLOv10x TRT | RTX 4090 | 1     | 920             | 54.2     |

***

## Additional Resources

* [Ultralytics Documentation](https://docs.ultralytics.com/)
* [YOLOv9 Paper](https://arxiv.org/abs/2402.13616)
* [YOLOv10 Paper](https://arxiv.org/abs/2405.14458)
* [Roboflow Universe](https://universe.roboflow.com/) — 100K+ public datasets
* [Ultralytics HUB](https://hub.ultralytics.com/) — Cloud training platform
* [COCO Dataset](https://cocodataset.org/) — Standard benchmark dataset

***

*YOLOv9 and YOLOv10 on Clore.ai GPU rentals provide an affordable path to training custom object detection models and deploying real-time inference pipelines — without the overhead of AWS SageMaker or Google Vertex AI.*

***

## Clore.ai GPU Recommendations

| Use Case             | Recommended GPU | Est. Cost on Clore.ai |
| -------------------- | --------------- | --------------------- |
| Development/Testing  | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production Inference | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large-batch Training | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.
