Skip to content
forked from YeeZ93/PAV-SOD

Audio-visual salient object segmentation on 360° videos

Notifications You must be signed in to change notification settings

FannyChao/ASOD60K

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction


Figure 1: Annotation examples from the proposed ASOD60K dataset. (a) Illustration of head movement (HM). The subjects wear Head-Mounted Displays (HMDs) and observe 360° scenes by moving their head to control a field-of-view (FoV) in the range of 360°×180°. (b) Each subject (i.e., Subject 1 to Subject N) watches the video without restriction. (c) The HMD-embedded eye tracker records their eye fixations. (d) According to the fixations, we provide coarse-to-fine annotations for each FoV including (e) super/sub-classes, instance-level masks and attributes (e.g., GD-Geometrical Distortion).

Exploring to what humans pay attention in dynamic panoramic scenes is useful for many fundamental applications, including augmented reality (AR) in retail, AR-powered recruitment, and visual language navigation. With this goal in mind, we propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD), which mimics human attention mechanism by segmenting salient objects with the guidance of audio-visual cues. To support this task, we collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy, thus distinguishing itself with richness, diversity and quality. Specifically, each sequence is marked with both its super-/sub-class, with objects of each sub-class being further annotated with human eye fixations, bounding boxes, object-/instance-level masks, and associated attributes (e.g., geometrical distortion). These coarse-to-fine annotations enable detailed analysis for PV-SOD modeling, e.g., determining the major challenges for existing SOD models, and predicting scanpaths to study the long-term eye fixation behaviors of humans. We systematically benchmark 11 representative approaches on ASOD60K and derive several interesting findings. We hope this study could serve as a good starting point for advancing SOD research towards panoramic videos.

🏃 🏃 🏃 KEEP UPDATING.


Related Dataset Works


Figure 2: Summary of widely used salient object detection (SOD) datasets and the proposed panoramic video SOD (PV-SOD) dataset. #Img: The number of images/frames. #GT: The number of ground-truth masks. Pub. = Publication. Obj.-Level = Object-Level. Ins.-Level = Instance-Level. Fix.GT = Fixation-guided ground truths. † denotes equirectangular (ER) images.


Dataset Annotations and Attributes


Figure 3: Examples of challenging attributes on equirectangular (ER) images from our ASOD60K, with instance-level GT and fixations as annotation guidance. f(k,l,m) denote random frames of a given video.


Figure 4: More annotations. Passed and rejected examples of annotation quality control.


Figure 5: Attributes description and stastistics. (a)/(b) represent the correlation and frequency of ASOD60K’s attributes, respectively.

Dataset Statistics


Figure 6: Statistics of the proposed ASOD60K. (a) Super-/sub-category information. (b) Instance density of each sub-class. (c) Main components of ASOD60K scenes.


Benchmark

Overall Quantitative Results


Figure 7: Performance comparison of 7/3 state-of-the-art conventional I-SOD/V-SOD methods and one PI-SOD method over ASOD60K. ↑/↓ denotes a larger/smaller value is better. Best result of each column is bolded.

Attributes-Specific Quantitative Results


Figure 8: Performance comparison of 7/3/1 state-of-the-art I-SOD/V-SOD/PI-SOD methods based on each of the attributes.

Reference

No. Year Pub. Title Links
01 2019 IEEE CVPR Cascaded Partial Decoder for Fast and Accurate Salient Object Detection Paper/Project
02 2019 IEEE ICCV Stacked Cross Refinement Network for Edge-Aware Salient Object Detection Paper/Project
03 2020 AAAI F3Net: Fusion, Feedback and Focus for Salient Object Detection Paper/Project
04 2020 IEEE CVPR Multi-scale Interactive Network for Salient Object Detection Paper/Project
05 2020 IEEE CVPR Label Decoupling Framework for Salient Object Detection Paper/Project
06 2020 ECCV Highly Efficient Salient Object Detection with 100K Parameters Paper/Project
07 2020 ECCV Suppress and Balance: A Simple Gated Network for Salient Object Detection Paper/Project
08 2019 IEEE CVPR See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks Paper/Project
09 2019 IEEE ICCV Semi-Supervised Video Salient Object Detection Using Pseudo-Labels Paper/Project
10 2020 AAAI Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection Paper/Project
11 2020 IEEE SPL FANet: Features Adaptation Network for 360° Omnidirectional Salient Object Detection Paper/Project

Evaluation Toolbox

All the quantitative results were computed based on one-key Python toolbox: https://github.com/zzhanghub/eval-co-sod .


Downloads

The whole object-/instance-level ground truth with default split can be downloaded from Baidu Dirve(k3h8) or Google Drive.

The videos with default split can be downloaded from Google Drive or OneDrive.

The head movement and eye fixation data can be downloaded from Google Drive

To generate video frames, please refer to video_to_frames.py.

To get access to raw videos on YouTube, please refer to video_seq_link.

To check basic information regarding the raw videos, please refer to video_information.txt (keep updating).


Contact

Please feel free to drop an e-mail to yi.zhang1@insa-rennes.fr for questions or further discussion.

If you have any question on head movement and eye fixation data, please contact fang-yi.chao@tcd.ie


Citation

@article{zhang2021asod60k,
  title={ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos},
  author={Zhang, Yi and Chao, Fang-Yi and Ji, Ge-Peng and Fan, Deng-Ping and Zhang, Lu and Shao, Ling},
  journal={arXiv preprint arXiv:2107.11629},
  year={2021}
}

About

Audio-visual salient object segmentation on 360° videos

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%