AutoAnnotation: A Pipeline for Automated Video Annotation with Manual Refinement

Overview

AutoAnnotation is a powerful pipeline designed to significantly speed up the process of annotating objects in video streams or images. It was created for recycling video streams but has been made in such a way that it is applicable to any use case. It combines state-of-the-art computer vision models to automatically generate initial bounding box and segmentation mask annotations, which can then be efficiently reviewed and corrected using CVAT.

The pipeline leverages:

DINO-X (via DDS Cloud API) for robust object detection
SAM2 (Segment Anything Model 2) for high-quality segmentation masks
CVAT for intuitive manual correction and refinement

For detailed information about the core annotation mechanism and DINO-X integration, please refer to the Grounded-SAM-2 repository, particularly the DINO-X demo section.

System Requirements

Hardware Requirements

NVIDIA GPU with CUDA support (recommended for SAM2)
Minimum 16GB RAM
Sufficient disk space for:
- Video files
- Extracted frames (can be large)
- Model checkpoints
- Annotation files

Software Requirements

Linux operating system (tested on Ubuntu 20.04 LTS)
Python 3.10 or later
CUDA 12.1 or later
Git

Installation

Clone the Repository

git clone https://github.com/yourusername/AutoAnnotation.git
cd AutoAnnotation

Set Up Python Environment

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # On Linux/Mac
# or
.\venv\Scripts\activate  # On Windows

Install Dependencies
```
pip install -r requirements.txt
```

Install Grounded-SAM-2

# Clone the Grounded-SAM-2 repository
git clone https://github.com/IDEA-Research/Grounded-SAM-2.git
cd Grounded-SAM-2

# Install SAM2 (required for segmentation)
pip install -e .

# Download SAM2 checkpoints (required)
cd checkpoints
bash download_ckpts.sh
cd ../gdino_checkpoints
bash download_ckpts.sh
cd ../..

Set Up API Access
- Create a .env file in the project root:
```
touch .env
```
- Add your DDS Cloud API token (note: the variable name must be API_TOKEN):
```
API_TOKEN=your_token_here
```
- Get your API token from: https://deepdataspace.com/request_api

Project Structure

AutoAnnotation/
├── scripts/                        # Python scripts for the pipeline
│   ├── extract_frames.py          # Frame extraction script
│   ├── nested_coco_sam_dinox_v2.py # Main annotation script
│   └── merge_coco_json.py         # COCO JSON merging script
│
├── input_videos/                  # Input video directory (create this)
│   ├── #1/                       # Camera/source 1
│   │   └── video1.MP4
│   └── #2/                       # Camera/source 2
│       └── videoA.MP4
│
├── frames/                        # Extracted frames (created automatically)
│   ├── #1/                       # Mirrors input_videos structure
│   │   └── video1_frame_0000.png
│   └── #2/
│       └── videoA_frame_0000.png
│
├── outputs/                       # Annotation outputs (created automatically)
│   ├── annotations/              # Per-frame COCO JSON annotations
│   │   ├── #1/
│   │   │   └── video1_frame_0000_coco_annotation.json
│   │   └── #2/
│   └── visualizations/           # Optional visualization images
│
├── annotations_per_video/         # Merged annotations (created automatically)
│   ├── #1_video1_coco_merged.json
│   └── #2_videoA_coco_merged.json
│
├── checkpoints/                   # Model checkpoints (created during setup)
│   ├── sam2/                     # SAM2 model weights
│   └── gdino_checkpoints/        # Grounding DINO model weights
│
├── sam2/                         # SAM2 model code (created during setup)
│   └── ...                      # Model implementation files
│
├── data/                         # Additional data files (if any)
│
├── slurm_logs/                   # Log files for SLURM jobs (if using cluster)
│
├── requirements.txt              # Python dependencies
├── setup.py                      # Package setup file
└── .env                         # API tokens (create this, not in git)

Directory Descriptions

User-Created Directories/Files (you need to create these):
- input_videos/: Place your input video files here
- .env: Create this file to store your API tokens
Automatically Created Directories (created by the scripts):
- frames/: Created by extract_frames.py
- outputs/: Created by nested_coco_sam_dinox_v2.py
- annotations_per_video/: Created by merge_coco_json.py
Setup-Created Directories (created during installation):
- checkpoints/: Created when downloading model weights
- sam2/: Created when installing SAM2
- slurm_logs/: Created if using SLURM for cluster computing

Note: All automatically created directories will be generated when you run the respective scripts. You only need to create the input_videos/ directory and the .env file before starting the pipeline.

Workflow Guide

Input Data Preparation

Video File Organization
- Place your input videos in the input_videos directory
- Organize videos by camera/source in subdirectories
- Supported format: MP4 (other formats may work but untested)
- Example structure:
```
input_videos/
├── #1/
│   ├── video1.MP4
│   └── video2.MP4
└── #2/
    ├── videoA.MP4
    └── videoB.MP4
```

Frame Extraction

Configure Frame Extraction

Open scripts/extract_frames.py

Set the following parameters:

INPUT_ROOT_DIR = "input_videos"  # Input video directory
OUTPUT_ROOT_DIR = "frames"        # Output frames directory
FRAME_INTERVAL_N = 30             # Extract every Nth frame
IMAGE_FORMAT = "png"              # Output image format
VIDEO_EXTENSION = ".MP4"          # Input video extension

Run Frame Extraction
```
python scripts/extract_frames.py
```
- This will create a mirrored directory structure in frames/
- Example output: frames/#1/video1_frame_0000.png

Automated Annotation

Configure Annotation Script

Open scripts/nested_coco_sam_dinox_v2.py

Set key parameters:

# Core detection parameters like
BOX_THRESHOLD = 0.35                    # Detection confidence threshold
IOU_THRESHOLD = 0.5                     # IoU threshold for NMS

# Processing parameters like
FRAMES_DIR = "frames"                   # Input frames directory
FRAME_GLOB_PATTERN = "*.png"           # Frame file pattern
WITH_SLICE_INFERENCE = True            # Enable for high-res images
OUTPUT_DIR_BASE = "outputs"            # Output directory

# Test mode parameters (recommended for parameter tuning)
TEST_MODE = True                       # Enable test mode
NUM_TEST_FRAMES = 10                   # Number of frames to process in test mode

Parameter Tuning
- The annotation quality depends heavily on parameter configuration
- We recommend using test mode (TEST_MODE = True) to experiment with different parameters
- Key parameters to tune:
  - BOX_THRESHOLD: Controls detection confidence
    - Higher values (e.g., 0.5) reduce false positives but may miss objects
    - Lower values (e.g., 0.3) catch more objects but may include false positives
  - IOU_THRESHOLD: Controls overlap between detections
    - Lower values (e.g., 0.3) for crowded scenes with overlapping objects
    - Higher values (e.g., 0.7) for scenes with well-separated objects
  - WITH_SLICE_INFERENCE: Recommended for high-resolution images
    - Helps detect small objects in large images
    - Increases processing time but improves detection quality
- Test different parameter combinations on a small subset of frames
- Monitor API credit usage during testing
- Once optimal parameters are found, disable test mode for full processing
Run Automated Annotation
```
python scripts/nested_coco_sam_dinox_v2.py
```
- This will generate COCO JSON annotations for each frame
- Output structure: outputs/annotations/#1/video1_frame_0000_coco_annotation.json
- In test mode, only processes NUM_TEST_FRAMES frames to save API credits

Preparing for CVAT

Configure Merging Script

Open scripts/merge_coco_json.py

Set parameters:

NESTED_PER_FRAME_ANNOTATIONS_ROOT = "outputs/annotations"
ALL_FRAMES_IMAGES_ROOT = "frames"
OUTPUT_CVAT_READY_MERGED_DIR = "annotations_per_video"

Run Merging Script
```
python scripts/merge_coco_json.py
```
- This creates one COCO JSON file per video
- Output: annotations_per_video/#1_video1_coco_merged.json

Manual Correction in CVAT

Create CVAT Project
- Log in to your CVAT instance
- Create a new project
- Create a new task within the project
- Set task type to "Instance Segmentation" (required for segmentation masks)
Upload Data
- Upload all frames from one video as a task
- Upload the corresponding merged COCO JSON file
- Ensure frame filenames match those in the COCO JSON
Correction Workflow
- Review automatically generated annotations
- Use CVAT tools to:
  - Adjust bounding boxes
  - Refine segmentation masks
  - Correct class labels
  - Add missing annotations
  - Remove false positives
Export Corrected Annotations
- Export in COCO 1.0 format
- Save to a new directory (e.g., corrected_annotations_cvat_export)

Configuration Guide

Frame Extraction Parameters

FRAME_INTERVAL_N: Higher values reduce processing time but may miss fast-moving objects
IMAGE_FORMAT: PNG recommended for quality, JPG for space efficiency

Annotation Parameters

BOX_THRESHOLD: Higher values (e.g., 0.5) reduce false positives but may miss objects
WITH_SLICE_INFERENCE: Enable for high-resolution images with small objects
IOU_THRESHOLD: Adjust based on object density (lower for crowded scenes)

Troubleshooting

Common Issues

ModuleNotFoundError
- Ensure virtual environment is activated
- Verify all dependencies are installed
- Check Grounded-SAM-2 installation
API Token Errors
- Verify .env file exists and contains valid token
- Check token expiration
- Ensure proper API access
CUDA/GPU Issues
- Verify CUDA installation
- Check GPU memory usage
- Consider reducing batch size or image resolution
File Not Found Errors
- Verify directory structure matches documentation
- Check file permissions
- Ensure consistent file extensions
CVAT Import Issues
- Verify frame filenames match COCO JSON
- Check JSON format validity
- Ensure proper task type selection

Performance Optimization

Processing Speed
- Use appropriate image format and resolution
- Consider batch processing for large datasets
Detection Quality
- Adjust BOX_THRESHOLD and IOU_THRESHOLD
- Enable WITH_SLICE_INFERENCE for high-res images

Best Practices

API Credit Management
- Start with test mode to experiment with parameters
- Use a small subset of frames for initial testing
- Monitor API credit usage through the DDS Cloud dashboard
- Keep track of successful parameter combinations for different scenarios
Video Quality
- Use high-resolution videos when possible
- Ensure good lighting conditions
- Minimize motion blur
- Consider video preprocessing if needed (e.g., stabilization, denoising)
Annotation Efficiency
- Start with conservative detection thresholds
- Review a sample of frames before full processing
- Use CVAT's AI-assisted tools when available
- Document successful parameter combinations for different object types
Data Organization
- Maintain consistent naming conventions
- Keep clear directory structure
- Regular backups of annotations
- Track which parameters were used for each annotation batch

Contributing

We welcome contributions to improve AutoAnnotation! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

For major changes, please open an issue first to discuss proposed improvements.

Acknowledgments

Grounded-SAM-2 for the core detection and segmentation models
- See their DINO-X demo section for detailed information about the annotation mechanism
CVAT for the annotation interface
DDS Cloud API for DINO-X access
- Note: 20 yen free credits upon sign-up, paid usage thereafter
- Visit their pricing page for current rates

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoAnnotation: A Pipeline for Automated Video Annotation with Manual Refinement

Overview

Table of Contents

System Requirements

Hardware Requirements

Software Requirements

Installation

Project Structure

Directory Descriptions

Workflow Guide

Input Data Preparation

Frame Extraction

Automated Annotation

Preparing for CVAT

Manual Correction in CVAT

Configuration Guide

Frame Extraction Parameters

Annotation Parameters

Troubleshooting

Common Issues

Performance Optimization

Best Practices

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

VismayVora/AutoAnnotation

Folders and files

Latest commit

History

Repository files navigation

AutoAnnotation: A Pipeline for Automated Video Annotation with Manual Refinement

Overview

Table of Contents

System Requirements

Hardware Requirements

Software Requirements

Installation

Project Structure

Directory Descriptions

Workflow Guide

Input Data Preparation

Frame Extraction

Automated Annotation

Preparing for CVAT

Manual Correction in CVAT

Configuration Guide

Frame Extraction Parameters

Annotation Parameters

Troubleshooting

Common Issues

Performance Optimization

Best Practices

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages