You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Through the process of re-labeling, we obtain an attention-based driving event dataset (ADED) consisting of 1101 videos. The dataset provides semantic annotation for each driving video, and the semantic information contains annotations of driving event categories and driving event time windows. The driving event categories contain six categories, which are Driving Normally (DN), Avoiding Pedestrian Crossing (ACP) Waiting for Vehicle Ahead (WVA), Waiting for Red Light (SRL), Stop Sign Stopping (SSS) and Avoiding Lane Changing Vehicle (ALC).
Fig. 1. ADED dataset annotation process. On the left is the annotation process for the entire ADED dataset. The heatmaps are derived from the BDD-A dataset, captured through eye-tracking devices to represent driver’s attention. On the right is the annotation process for event time windows of driving event.
Fig. 2. ADED dataset statistics. (a) The number and proportion of each driving event class. (b) The distribution of the duration of driving events. (c) The distribution of the occurrence of driving events along the timeline.
TABLE I: Comparison of Traffic Scene Datasets in Terms of Weather Conditions, Annotations, and Videos. TABLE II: Comparison of DADA-2000, PSAD, And Our Dataset in Terms of Statistical Properties and t-SNE Feature Visualization.
✨Model
Fig. 3. Perception-inspired Network (VP²Net). Our model takes driving video sequences as input, where the SIE branch extracts bottom-up driving scene information and the APE branch extracts top-down driver attention information (which undergoes attention perception — “where to focus”, attention enhancement — “when to focus”, and information encoding). Subsequently, attention information guides the fusion of driving scene features, further decoded to produce the output. F1 is the attention information encoder. F2 is the event information decoder.
🚀 Quantitative Analysis
TABLE III: Quantitative Results of Different Models on the ADED, DADA-2000, PSAD Datasets.
🚀Visualization of Intermediate Results
Fig. 4. The visualization of the intermediate features.
(a) represents the original image,
(b) depicts the driving scene feature ,
(c) depicts the driving scene feature by Uniformer,
(d) shows the attention information ,
(e) displays the perception-enhanced information ,
and (f) illustrates the attention-encoded information .
These cases demonstrate the network’s mechanism and enhancement strategy, rather than the average performance across the dataset.
💖Support the Project
Thanks to the open-source video action detection models (ViViT, VideoMAE) at huggingface🤗 for supporting this paper.
📄Cite
If you find this repository useful, please use the following BibTeX entry for citation and give us a star⭐.