Files
dataset-yolo-script/sam2-cpu/PLAN.md
2026-02-04 15:29:36 +07:00

23 KiB

Feature Plan: YOLO-Assisted Auto-Annotation + Mini Frigate RKNN

Overview

Create two integrated components:

  1. YOLO-Assisted Annotator - Use pretrained YOLOv9t to auto-annotate video frames
  2. Frigate-Mini-RKNN - Standalone mini fork of Frigate for RKNN inference with MP4 input

Goals

  • Auto-annotate videos using YOLOv9t pretrained model (replaces manual SAM2 prompts)
  • Minimal Frigate fork with multiple detector backends:
    • RKNN - Rockchip NPU acceleration (RK3588, RK3568, etc.)
    • ONNX - CPU-only inference (cross-platform, no special hardware)
    • YOLO - Ultralytics backend (CPU/CUDA)
  • MP4 file as camera feed source
  • Output: Clean snapshot + YOLO format label pairs
  • Simple text-based configuration
  • Debug mode with object list visualization

Detector Backends Comparison

Backend Hardware Performance Platform Use Case
RKNN Rockchip NPU Fast (30+ FPS) ARM (RK3588/3568) Production on Rockchip SBC
ONNX CPU Medium (5-15 FPS) Any (x86/ARM) Development, testing, no GPU
YOLO CPU/CUDA Fast with GPU Any Development, CUDA systems
  1. Development/Testing: Use ONNX backend on any CPU
  2. Production on Rockchip: Convert to RKNN, deploy on NPU
  3. Production on x86/CUDA: Use YOLO backend with GPU

Project Structure

sam2-yolo-pipeline/
├── notebooks/                      # Existing Kaggle notebooks
├── utils/                          # Existing utilities
├── yolo_annotator/                 # NEW: YOLO-assisted annotation
│   ├── __init__.py
│   ├── annotator.py               # Core YOLOv9t annotator
│   ├── video_source.py            # MP4/RTSP video source handler
│   ├── export.py                  # Snapshot + label export
│   └── visualizer.py              # Debug visualization
├── frigate_mini/                   # NEW: Mini Frigate fork
│   ├── __init__.py
│   ├── app.py                     # Main application entry
│   ├── config/
│   │   ├── __init__.py
│   │   ├── schema.py              # Config validation
│   │   └── loader.py              # YAML config loader
│   ├── detector/
│   │   ├── __init__.py
│   │   ├── base.py                # Base detector interface
│   │   ├── rknn_detector.py       # RKNN backend
│   │   ├── onnx_detector.py       # ONNX fallback
│   │   └── yolo_detector.py       # Ultralytics YOLO fallback
│   ├── video/
│   │   ├── __init__.py
│   │   ├── mp4_source.py          # MP4 file source
│   │   └── frame_processor.py     # Frame processing pipeline
│   ├── output/
│   │   ├── __init__.py
│   │   ├── snapshot.py            # Snapshot capture
│   │   └── annotation.py          # YOLO label writer
│   └── debug/
│       ├── __init__.py
│       ├── object_list.py         # Detected objects display
│       └── visualizer.py          # Bounding box overlay
├── configs/                        # NEW: Configuration files
│   ├── annotator.yaml             # Annotator settings
│   └── frigate_mini.yaml          # Frigate-mini settings
├── models/                         # NEW: Model weights storage
│   └── .gitkeep
├── output/                         # NEW: Default output directory
│   ├── snapshots/
│   ├── labels/
│   └── debug/
├── scripts/                        # NEW: CLI scripts
│   ├── annotate.py                # Run annotation pipeline
│   ├── frigate_mini.py            # Run mini frigate
│   └── convert_to_rknn.py         # Convert ONNX to RKNN
└── requirements.txt               # Updated dependencies

Component 1: YOLO-Assisted Annotator

Purpose

Replace SAM2 auto-annotation with faster YOLOv9t-based detection for creating training datasets.

Workflow

MP4 Video → Frame Extraction → YOLOv9t Detection → Filter/NMS → YOLO Labels + Snapshots

Features

  1. Model Loading

    • Load pretrained YOLOv9t (.pt file)
    • Support custom trained models
    • Configurable confidence threshold
    • Configurable NMS threshold
  2. Video Processing

    • MP4 file input
    • Configurable FPS sampling
    • Frame skip / time range selection
    • Resolution scaling
  3. Detection Filtering

    • Filter by class IDs
    • Filter by confidence score
    • Filter by bbox size (min/max area)
    • Filter by aspect ratio
  4. Output Generation

    • Clean snapshot images (no annotations drawn)
    • YOLO format label files (.txt)
    • Optional debug images with boxes drawn
    • JSON manifest of all detections

Configuration (annotator.yaml)

# YOLO-Assisted Annotator Configuration

model:
  path: "models/yolov9t.pt"          # Path to YOLO model
  device: "cuda"                      # cuda, cpu, or rknn
  conf_threshold: 0.25                # Confidence threshold
  iou_threshold: 0.45                 # NMS IoU threshold

video:
  source: "input/video.mp4"           # Video file path
  sample_fps: 2                       # Frames per second to extract
  max_frames: null                    # Max frames (null = all)
  start_time: 0                       # Start time in seconds
  end_time: null                      # End time (null = end of video)
  resize: null                        # [width, height] or null

detection:
  classes: null                       # Class IDs to keep (null = all)
  min_confidence: 0.3                 # Minimum confidence to save
  min_area: 100                       # Minimum bbox area in pixels
  max_area: null                      # Maximum bbox area (null = no limit)
  min_size: 0.01                      # Minimum bbox dimension (normalized)

output:
  directory: "output/annotations"     # Output directory
  save_snapshots: true                # Save clean images
  save_labels: true                   # Save YOLO labels
  save_debug: true                    # Save debug visualizations
  save_manifest: true                 # Save JSON manifest
  image_format: "jpg"                 # jpg or png
  image_quality: 95                   # JPEG quality (1-100)

classes:
  # Class name mapping (for display/filtering)
  0: "person"
  1: "bicycle"
  2: "car"
  # ... etc

Component 2: Frigate-Mini-RKNN

Purpose

Minimal standalone Frigate-like system for RKNN inference on Rockchip devices, outputting annotation pairs.

Workflow

MP4 Feed → Frame Decode → RKNN Inference → Object Tracking → Snapshot + Label Export

Features

  1. Video Input

    • MP4 file as "camera" source
    • Loop playback option
    • Configurable FPS limit
    • Multiple video sources support
  2. RKNN Detector

    • Load RKNN model (.rknn file)
    • NPU acceleration on Rockchip SoCs
    • Fallback to ONNX/CPU if RKNN unavailable
    • Batch inference support
  3. Object Detection

    • YOLOv9t architecture support
    • Configurable input resolution
    • Post-processing (NMS, filtering)
    • Class filtering
  4. Snapshot System

    • Capture on detection trigger
    • Configurable cooldown period
    • Clean snapshots (no overlays)
    • Crop to detected object (optional)
  5. Annotation Export

    • YOLO format labels
    • Synchronized snapshot-label pairs
    • Auto-naming with timestamps
    • Dataset structure output
  6. Debug Mode

    • Real-time object list display
    • Bounding box visualization
    • FPS counter
    • Detection statistics
    • Save debug frames

Configuration (frigate_mini.yaml)

# Frigate-Mini Configuration - ONNX CPU Mode
# Works on any system without special hardware

debug: true
log_level: "info"

detector:
  type: "onnx"                        # Use ONNX Runtime
  model_path: "models/yolov9t.onnx"   # ONNX model file
  input_size: [640, 640]              # Model input resolution
  conf_threshold: 0.25                # Detection confidence
  nms_threshold: 0.45                 # NMS threshold
  
  # ONNX specific settings
  onnx:
    device: "cpu"                     # cpu or cuda
    num_threads: 4                    # CPU threads (0 = auto)
    optimization_level: "all"         # none, basic, extended, all

Option B: RKNN NPU (For Rockchip devices)

# Frigate-Mini Configuration - RKNN NPU Mode
# For Rockchip SBCs (RK3588, RK3568, etc.)

debug: true
log_level: "info"

detector:
  type: "rknn"                        # Use RKNN Runtime
  model_path: "models/yolov9t.rknn"   # RKNN model file
  input_size: [640, 640]              # Model input resolution
  conf_threshold: 0.25                # Detection confidence
  nms_threshold: 0.45                 # NMS threshold
  
  # RKNN specific
  rknn:
    target_platform: "rk3588"         # rk3588, rk3568, rk3566, etc.
    core_mask: 7                      # NPU core mask (7 = all 3 cores on RK3588)
  
  # Fallback to ONNX if RKNN fails
  fallback:
    enabled: true
    type: "onnx"
    device: "cpu"

Option C: Ultralytics YOLO (For CUDA systems)

# Frigate-Mini Configuration - Ultralytics YOLO Mode
# For systems with NVIDIA GPU

debug: true
log_level: "info"

detector:
  type: "yolo"                        # Use Ultralytics
  model_path: "models/yolov9t.pt"     # PyTorch model file
  conf_threshold: 0.25
  nms_threshold: 0.45
  
  # YOLO specific
  yolo:
    device: "cuda"                    # cpu, cuda, cuda:0, etc.
    half: true                        # FP16 inference (faster on GPU)

Full Configuration Example (with all options)

Video sources (cameras)

cameras: front_door: enabled: true source: "input/front_door.mp4" # MP4 file path fps: 5 # Processing FPS limit loop: true # Loop video playback

# Detection zones (optional)
detect:
  enabled: true
  width: 1280                     # Detection resolution
  height: 720
  
# Object filtering
objects:
  track:
    - person
    - car
    - dog
  filters:
    person:
      min_area: 1000              # Minimum area in pixels
      max_area: 500000
      min_score: 0.4

backyard: enabled: true source: "input/backyard.mp4" fps: 5 loop: true

Snapshot settings

snapshots: enabled: true output_dir: "output/snapshots"

Trigger settings

trigger: objects: # Objects that trigger snapshot - person - car min_score: 0.5 # Minimum score to trigger cooldown: 2.0 # Seconds between snapshots per object

Output settings

format: "jpg" # jpg or png quality: 95 # JPEG quality clean: true # No annotations on snapshot crop: false # Crop to object bbox retain_days: 7 # Days to keep snapshots

Annotation export

annotations: enabled: true output_dir: "output/labels" format: "yolo" # YOLO format

Pairing

pair_with_snapshots: true # Create snapshot-label pairs

Filtering

min_score: 0.3 classes: null # null = all classes

Debug settings

debug_output: enabled: true output_dir: "output/debug"

Object list display

object_list: enabled: true show_confidence: true show_class: true show_bbox: true

Visualization

visualization: enabled: true draw_boxes: true draw_labels: true draw_confidence: true box_thickness: 2 font_scale: 0.5

Statistics

stats: show_fps: true show_detection_count: true log_interval: 100 # Log stats every N frames

Class definitions

class_names: 0: person 1: bicycle 2: car 3: motorcycle 4: airplane 5: bus 6: train 7: truck 8: boat 9: traffic light 10: fire hydrant

... COCO classes continue


---

## Module Specifications

### 1. yolo_annotator/annotator.py

```python
class YOLOAnnotator:
    """YOLO-based automatic video annotator."""
    
    def __init__(self, config_path: str):
        """Load configuration and initialize model."""
        
    def load_model(self, model_path: str, device: str) -> None:
        """Load YOLOv9t model."""
        
    def process_video(self, video_path: str) -> AnnotationResult:
        """Process entire video and generate annotations."""
        
    def process_frame(self, frame: np.ndarray) -> List[Detection]:
        """Process single frame and return detections."""
        
    def filter_detections(self, detections: List[Detection]) -> List[Detection]:
        """Apply filtering rules to detections."""
        
    def export_annotations(self, output_dir: str) -> None:
        """Export all annotations to YOLO format."""

2. frigate_mini/detector/rknn_detector.py

class RKNNDetector:
    """RKNN-based YOLO detector for Rockchip NPU."""
    
    def __init__(self, model_path: str, target_platform: str):
        """Initialize RKNN runtime."""
        
    def load_model(self) -> bool:
        """Load RKNN model to NPU."""
        
    def preprocess(self, frame: np.ndarray) -> np.ndarray:
        """Preprocess frame for inference."""
        
    def inference(self, input_data: np.ndarray) -> np.ndarray:
        """Run inference on NPU."""
        
    def postprocess(self, outputs: np.ndarray) -> List[Detection]:
        """Parse YOLO outputs and apply NMS."""
        
    def detect(self, frame: np.ndarray) -> List[Detection]:
        """Full detection pipeline."""
        
    def release(self) -> None:
        """Release RKNN resources."""

3. frigate_mini/output/annotation.py

class AnnotationWriter:
    """Write YOLO format annotation files."""
    
    def __init__(self, output_dir: str, class_names: Dict[int, str]):
        """Initialize annotation writer."""
        
    def write_label(self, 
                    image_name: str,
                    detections: List[Detection],
                    image_size: Tuple[int, int]) -> str:
        """Write YOLO label file for image."""
        
    def detection_to_yolo(self,
                          detection: Detection,
                          image_width: int,
                          image_height: int) -> str:
        """Convert detection to YOLO format string."""
        
    def create_dataset_structure(self) -> None:
        """Create YOLO dataset directory structure."""
        
    def write_data_yaml(self, train_path: str, val_path: str) -> str:
        """Generate data.yaml for training."""

4. frigate_mini/debug/object_list.py

class ObjectListDisplay:
    """Display detected objects in debug mode."""
    
    def __init__(self, config: Dict):
        """Initialize display settings."""
        
    def update(self, detections: List[Detection]) -> None:
        """Update object list with new detections."""
        
    def format_detection(self, detection: Detection) -> str:
        """Format single detection for display."""
        
    def print_list(self) -> None:
        """Print current object list to console."""
        
    def save_snapshot_with_labels(self,
                                   frame: np.ndarray,
                                   detections: List[Detection],
                                   output_path: str) -> None:
        """Save debug image with annotations."""

Data Structures

Detection

@dataclass
class Detection:
    class_id: int           # Class index
    class_name: str         # Class name
    confidence: float       # Detection confidence (0-1)
    bbox: BBox              # Bounding box
    track_id: Optional[int] # Tracking ID (if tracked)
    timestamp: float        # Frame timestamp
    frame_id: int           # Frame number
    
@dataclass
class BBox:
    x1: float               # Top-left x (pixels)
    y1: float               # Top-left y (pixels)
    x2: float               # Bottom-right x (pixels)
    y2: float               # Bottom-right y (pixels)
    
    def to_yolo(self, img_w: int, img_h: int) -> Tuple[float, float, float, float]:
        """Convert to YOLO format (x_center, y_center, width, height) normalized."""
        
    def area(self) -> float:
        """Calculate bbox area in pixels."""

AnnotationPair

@dataclass
class AnnotationPair:
    image_path: str          # Path to snapshot image
    label_path: str          # Path to YOLO label file
    detections: List[Detection]
    timestamp: datetime
    camera_name: str
    frame_id: int

Output Format

Directory Structure

output/
├── snapshots/
│   ├── front_door/
│   │   ├── 20240115_143022_001.jpg
│   │   ├── 20240115_143025_002.jpg
│   │   └── ...
│   └── backyard/
│       └── ...
├── labels/
│   ├── front_door/
│   │   ├── 20240115_143022_001.txt
│   │   ├── 20240115_143025_002.txt
│   │   └── ...
│   └── backyard/
│       └── ...
├── debug/
│   ├── front_door/
│   │   ├── 20240115_143022_001_debug.jpg
│   │   └── ...
│   └── object_log.txt
└── manifest.json

YOLO Label Format

# {class_id} {x_center} {y_center} {width} {height}
0 0.456789 0.321456 0.123456 0.234567
2 0.789012 0.654321 0.098765 0.176543

Manifest JSON

{
  "created": "2024-01-15T14:30:22",
  "model": "yolov9t.rknn",
  "total_frames": 1500,
  "total_detections": 3420,
  "pairs": [
    {
      "image": "snapshots/front_door/20240115_143022_001.jpg",
      "label": "labels/front_door/20240115_143022_001.txt",
      "camera": "front_door",
      "frame_id": 150,
      "timestamp": "2024-01-15T14:30:22.500",
      "detections": [
        {"class": "person", "confidence": 0.87},
        {"class": "car", "confidence": 0.92}
      ]
    }
  ]
}

Implementation Phases

Phase 1: Core YOLO Annotator (Week 1)

  • Create yolo_annotator/ module structure
  • Implement YOLOAnnotator class with Ultralytics backend
  • Implement video source handling
  • Implement YOLO label export
  • Create annotator.yaml config loader
  • Add CLI script scripts/annotate.py
  • Test with sample video

Phase 2: Frigate-Mini Base (Week 2)

  • Create frigate_mini/ module structure
  • Implement config schema and loader
  • Implement base detector interface
  • Implement ONNX detector (for testing)
  • Implement MP4 video source
  • Implement basic frame processing loop
  • Test basic detection pipeline

Phase 3: RKNN Integration (Week 3)

  • Implement RKNN detector backend
  • Create ONNX to RKNN conversion script
  • Test on Rockchip hardware (RK3588/RK3568)
  • Optimize for NPU performance
  • Add fallback mechanism

Phase 4: Snapshot & Annotation System (Week 4)

  • Implement snapshot capture system
  • Implement annotation writer
  • Implement snapshot-label pairing
  • Add trigger-based capture logic
  • Create manifest generator

Phase 5: Debug System (Week 5)

  • Implement object list display
  • Implement debug visualization
  • Add statistics tracking
  • Create debug frame saver
  • Add console and file logging

Phase 6: Integration & Testing (Week 6)

  • Integration testing
  • Performance optimization
  • Documentation
  • Example configs for common use cases
  • Package for distribution

Dependencies

New Requirements

# requirements.txt additions

# YOLO
ultralytics>=8.0.0

# RKNN (install separately based on platform)
# rknn-toolkit2  # For conversion (x86)
# rknnlite2      # For inference (ARM)

# Video processing
opencv-python>=4.8.0
av>=10.0.0                  # PyAV for efficient video decoding

# Configuration
pyyaml>=6.0
pydantic>=2.0               # Config validation

# Utilities
tqdm>=4.65.0
numpy>=1.24.0

RKNN Installation Notes

# On x86 host (for model conversion):
pip install rknn-toolkit2

# On Rockchip device (for inference):
pip install rknnlite2

# Or install from Rockchip GitHub releases

Usage Examples

# Step 1: Download pretrained YOLOv9t
wget https://github.com/ultralytics/assets/releases/download/v8.1.0/yolov9t.pt -O models/yolov9t.pt

# Step 2: Convert to ONNX
python scripts/convert_to_onnx.py --input models/yolov9t.pt --output models/yolov9t.onnx

# Step 3a: Auto-annotate video (CPU)
python scripts/annotate.py --config configs/annotator_cpu.yaml
# Or with CLI args:
python scripts/annotate.py \
    --model models/yolov9t.onnx \
    --video input/video.mp4 \
    --device cpu

# Step 3b: Run Frigate-Mini (CPU)
python scripts/frigate_mini.py --config configs/frigate_mini_cpu.yaml
# Or with CLI args:
python scripts/frigate_mini.py \
    --model models/yolov9t.onnx \
    --video input/video.mp4 \
    --output output/ \
    --debug

2. RKNN Workflow (Rockchip NPU)

# Step 1: Convert ONNX to RKNN (on x86 host)
python scripts/convert_to_rknn.py \
    --input models/yolov9t.onnx \
    --output models/yolov9t.rknn \
    --platform rk3588

# Step 2: Copy to Rockchip device and run
python scripts/frigate_mini.py --config configs/frigate_mini.yaml
# Or:
python scripts/frigate_mini.py \
    --model models/yolov9t.rknn \
    --video input/video.mp4 \
    --platform rk3588

3. GPU Workflow (CUDA)

# Using Ultralytics directly with GPU
python scripts/annotate.py \
    --model models/yolov9t.pt \
    --video input/video.mp4 \
    --device cuda

Quick Reference

Task CPU (ONNX) RKNN (NPU) GPU (CUDA)
Model file .onnx .rknn .pt
Config *_cpu.yaml frigate_mini.yaml Use --device cuda
Speed 5-15 FPS 30+ FPS 50+ FPS
Hardware Any CPU Rockchip SBC NVIDIA GPU

Future Enhancements

  1. RTSP Support - Add real camera stream input
  2. Object Tracking - Add ByteTrack/BoT-SORT for consistent IDs
  3. Web UI - Simple web interface for monitoring
  4. Multi-model - Support different models per camera
  5. Event System - Webhooks for detection events
  6. Auto-labeling Refinement - Use SAM2 to refine YOLO boxes
  7. Active Learning - Flag low-confidence detections for review

References