AutoAnnotation is a powerful pipeline designed to significantly speed up the process of annotating objects in video streams or images. It was created for recycling video streams but has been made in such a way that it is applicable to any use case. It combines state-of-the-art computer vision models to automatically generate initial bounding box and segmentation mask annotations, which can then be efficiently reviewed and corrected using CVAT.
The pipeline leverages:
- DINO-X (via DDS Cloud API) for robust object detection
- SAM2 (Segment Anything Model 2) for high-quality segmentation masks
- CVAT for intuitive manual correction and refinement
For detailed information about the core annotation mechanism and DINO-X integration, please refer to the Grounded-SAM-2 repository, particularly the DINO-X demo section.
- System Requirements
- Installation
- Project Structure
- Workflow Guide
- Configuration Guide
- Troubleshooting
- Best Practices
- Contributing
- NVIDIA GPU with CUDA support (recommended for SAM2)
- Minimum 16GB RAM
- Sufficient disk space for:
- Video files
- Extracted frames (can be large)
- Model checkpoints
- Annotation files
- Linux operating system (tested on Ubuntu 20.04 LTS)
- Python 3.10 or later
- CUDA 12.1 or later
- Git
-
Clone the Repository
git clone https://github.com/yourusername/AutoAnnotation.git cd AutoAnnotation -
Set Up Python Environment
# Create and activate a virtual environment python -m venv venv source venv/bin/activate # On Linux/Mac # or .\venv\Scripts\activate # On Windows
-
Install Dependencies
pip install -r requirements.txt
-
Install Grounded-SAM-2
# Clone the Grounded-SAM-2 repository git clone https://github.com/IDEA-Research/Grounded-SAM-2.git cd Grounded-SAM-2 # Install SAM2 (required for segmentation) pip install -e . # Download SAM2 checkpoints (required) cd checkpoints bash download_ckpts.sh cd ../gdino_checkpoints bash download_ckpts.sh cd ../..
-
Set Up API Access
- Create a
.envfile in the project root:touch .env
- Add your DDS Cloud API token (note: the variable name must be
API_TOKEN):API_TOKEN=your_token_here - Get your API token from: https://deepdataspace.com/request_api
- Create a
AutoAnnotation/
├── scripts/ # Python scripts for the pipeline
│ ├── extract_frames.py # Frame extraction script
│ ├── nested_coco_sam_dinox_v2.py # Main annotation script
│ └── merge_coco_json.py # COCO JSON merging script
│
├── input_videos/ # Input video directory (create this)
│ ├── #1/ # Camera/source 1
│ │ └── video1.MP4
│ └── #2/ # Camera/source 2
│ └── videoA.MP4
│
├── frames/ # Extracted frames (created automatically)
│ ├── #1/ # Mirrors input_videos structure
│ │ └── video1_frame_0000.png
│ └── #2/
│ └── videoA_frame_0000.png
│
├── outputs/ # Annotation outputs (created automatically)
│ ├── annotations/ # Per-frame COCO JSON annotations
│ │ ├── #1/
│ │ │ └── video1_frame_0000_coco_annotation.json
│ │ └── #2/
│ └── visualizations/ # Optional visualization images
│
├── annotations_per_video/ # Merged annotations (created automatically)
│ ├── #1_video1_coco_merged.json
│ └── #2_videoA_coco_merged.json
│
├── checkpoints/ # Model checkpoints (created during setup)
│ ├── sam2/ # SAM2 model weights
│ └── gdino_checkpoints/ # Grounding DINO model weights
│
├── sam2/ # SAM2 model code (created during setup)
│ └── ... # Model implementation files
│
├── data/ # Additional data files (if any)
│
├── slurm_logs/ # Log files for SLURM jobs (if using cluster)
│
├── requirements.txt # Python dependencies
├── setup.py # Package setup file
└── .env # API tokens (create this, not in git)
-
User-Created Directories/Files (you need to create these):
input_videos/: Place your input video files here.env: Create this file to store your API tokens
-
Automatically Created Directories (created by the scripts):
frames/: Created byextract_frames.pyoutputs/: Created bynested_coco_sam_dinox_v2.pyannotations_per_video/: Created bymerge_coco_json.py
-
Setup-Created Directories (created during installation):
checkpoints/: Created when downloading model weightssam2/: Created when installing SAM2slurm_logs/: Created if using SLURM for cluster computing
Note: All automatically created directories will be generated when you run the respective scripts. You only need to create the input_videos/ directory and the .env file before starting the pipeline.
- Video File Organization
- Place your input videos in the
input_videosdirectory - Organize videos by camera/source in subdirectories
- Supported format: MP4 (other formats may work but untested)
- Example structure:
input_videos/ ├── #1/ │ ├── video1.MP4 │ └── video2.MP4 └── #2/ ├── videoA.MP4 └── videoB.MP4
- Place your input videos in the
-
Configure Frame Extraction
- Open
scripts/extract_frames.py - Set the following parameters:
INPUT_ROOT_DIR = "input_videos" # Input video directory OUTPUT_ROOT_DIR = "frames" # Output frames directory FRAME_INTERVAL_N = 30 # Extract every Nth frame IMAGE_FORMAT = "png" # Output image format VIDEO_EXTENSION = ".MP4" # Input video extension
- Open
-
Run Frame Extraction
python scripts/extract_frames.py
- This will create a mirrored directory structure in
frames/ - Example output:
frames/#1/video1_frame_0000.png
- This will create a mirrored directory structure in
-
Configure Annotation Script
- Open
scripts/nested_coco_sam_dinox_v2.py - Set key parameters:
# Core detection parameters like BOX_THRESHOLD = 0.35 # Detection confidence threshold IOU_THRESHOLD = 0.5 # IoU threshold for NMS # Processing parameters like FRAMES_DIR = "frames" # Input frames directory FRAME_GLOB_PATTERN = "*.png" # Frame file pattern WITH_SLICE_INFERENCE = True # Enable for high-res images OUTPUT_DIR_BASE = "outputs" # Output directory # Test mode parameters (recommended for parameter tuning) TEST_MODE = True # Enable test mode NUM_TEST_FRAMES = 10 # Number of frames to process in test mode
- Open
-
Parameter Tuning
- The annotation quality depends heavily on parameter configuration
- We recommend using test mode (
TEST_MODE = True) to experiment with different parameters - Key parameters to tune:
BOX_THRESHOLD: Controls detection confidence- Higher values (e.g., 0.5) reduce false positives but may miss objects
- Lower values (e.g., 0.3) catch more objects but may include false positives
IOU_THRESHOLD: Controls overlap between detections- Lower values (e.g., 0.3) for crowded scenes with overlapping objects
- Higher values (e.g., 0.7) for scenes with well-separated objects
WITH_SLICE_INFERENCE: Recommended for high-resolution images- Helps detect small objects in large images
- Increases processing time but improves detection quality
- Test different parameter combinations on a small subset of frames
- Monitor API credit usage during testing
- Once optimal parameters are found, disable test mode for full processing
-
Run Automated Annotation
python scripts/nested_coco_sam_dinox_v2.py
- This will generate COCO JSON annotations for each frame
- Output structure:
outputs/annotations/#1/video1_frame_0000_coco_annotation.json - In test mode, only processes
NUM_TEST_FRAMESframes to save API credits
-
Configure Merging Script
- Open
scripts/merge_coco_json.py - Set parameters:
NESTED_PER_FRAME_ANNOTATIONS_ROOT = "outputs/annotations" ALL_FRAMES_IMAGES_ROOT = "frames" OUTPUT_CVAT_READY_MERGED_DIR = "annotations_per_video"
- Open
-
Run Merging Script
python scripts/merge_coco_json.py
- This creates one COCO JSON file per video
- Output:
annotations_per_video/#1_video1_coco_merged.json
-
Create CVAT Project
- Log in to your CVAT instance
- Create a new project
- Create a new task within the project
- Set task type to "Instance Segmentation" (required for segmentation masks)
-
Upload Data
- Upload all frames from one video as a task
- Upload the corresponding merged COCO JSON file
- Ensure frame filenames match those in the COCO JSON
-
Correction Workflow
- Review automatically generated annotations
- Use CVAT tools to:
- Adjust bounding boxes
- Refine segmentation masks
- Correct class labels
- Add missing annotations
- Remove false positives
-
Export Corrected Annotations
- Export in COCO 1.0 format
- Save to a new directory (e.g.,
corrected_annotations_cvat_export)
FRAME_INTERVAL_N: Higher values reduce processing time but may miss fast-moving objectsIMAGE_FORMAT: PNG recommended for quality, JPG for space efficiency
BOX_THRESHOLD: Higher values (e.g., 0.5) reduce false positives but may miss objectsWITH_SLICE_INFERENCE: Enable for high-resolution images with small objectsIOU_THRESHOLD: Adjust based on object density (lower for crowded scenes)
-
ModuleNotFoundError
- Ensure virtual environment is activated
- Verify all dependencies are installed
- Check Grounded-SAM-2 installation
-
API Token Errors
- Verify
.envfile exists and contains valid token - Check token expiration
- Ensure proper API access
- Verify
-
CUDA/GPU Issues
- Verify CUDA installation
- Check GPU memory usage
- Consider reducing batch size or image resolution
-
File Not Found Errors
- Verify directory structure matches documentation
- Check file permissions
- Ensure consistent file extensions
-
CVAT Import Issues
- Verify frame filenames match COCO JSON
- Check JSON format validity
- Ensure proper task type selection
-
Processing Speed
- Use appropriate image format and resolution
- Consider batch processing for large datasets
-
Detection Quality
- Adjust
BOX_THRESHOLDandIOU_THRESHOLD - Enable
WITH_SLICE_INFERENCEfor high-res images
- Adjust
-
API Credit Management
- Start with test mode to experiment with parameters
- Use a small subset of frames for initial testing
- Monitor API credit usage through the DDS Cloud dashboard
- Keep track of successful parameter combinations for different scenarios
-
Video Quality
- Use high-resolution videos when possible
- Ensure good lighting conditions
- Minimize motion blur
- Consider video preprocessing if needed (e.g., stabilization, denoising)
-
Annotation Efficiency
- Start with conservative detection thresholds
- Review a sample of frames before full processing
- Use CVAT's AI-assisted tools when available
- Document successful parameter combinations for different object types
-
Data Organization
- Maintain consistent naming conventions
- Keep clear directory structure
- Regular backups of annotations
- Track which parameters were used for each annotation batch
We welcome contributions to improve AutoAnnotation! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
For major changes, please open an issue first to discuss proposed improvements.
- Grounded-SAM-2 for the core detection and segmentation models
- See their DINO-X demo section for detailed information about the annotation mechanism
- CVAT for the annotation interface
- DDS Cloud API for DINO-X access
- Note: 20 yen free credits upon sign-up, paid usage thereafter
- Visit their pricing page for current rates