GitHub - zhao-chunyu/VP2Net: [T-ITS'2025] VP²Net: Visual Perception-Inspired Network for Exploring the Causes of Drivers’ Attention Shift

Authors:Chunyu Zhao, Tao Deng^📧, Pengcheng Du, Wenbo Liu, Yi Huang, Fei Yan

Contact: springyu.zhao@foxmail.com 📧: corresponding author

💻Dataset

Through the process of re-labeling, we obtain an attention-based driving event dataset (ADED) consisting of 1101 videos. The dataset provides semantic annotation for each driving video, and the semantic information contains annotations of driving event categories and driving event time windows. The driving event categories contain six categories, which are Driving Normally (DN), Avoiding Pedestrian Crossing (ACP) Waiting for Vehicle Ahead (WVA), Waiting for Red Light (SRL), Stop Sign Stopping (SSS) and Avoiding Lane Changing Vehicle (ALC).

Fig. 1. ADED dataset annotation process. On the left is the annotation process for the entire ADED dataset. The heatmaps are derived from the BDD-A dataset, captured through eye-tracking devices to represent driver’s attention. On the right is the annotation process for event time windows of driving event.

Fig. 2. ADED dataset statistics. (a) The number and proportion of each driving event class. (b) The distribution of the duration of driving events. (c) The distribution of the occurrence of driving events along the timeline.

TABLE I: Comparison of Traffic Scene Datasets in Terms of Weather Conditions, Annotations, and Videos. TABLE II: Comparison of DADA-2000, PSAD, And Our Dataset in Terms of Statistical Properties and t-SNE Feature Visualization.

✨Model

Fig. 3. Perception-inspired Network (VP²Net). Our model takes driving video sequences as input, where the SIE branch extracts bottom-up driving scene information and the APE branch extracts top-down driver attention information (which undergoes attention perception — “where to focus”, attention enhancement — “when to focus”, and information encoding). Subsequently, attention information guides the fusion of driving scene features, further decoded to produce the output. F1 is the attention information encoder. F2 is the event information decoder.

🚀 Quantitative Analysis

TABLE III: Quantitative Results of Different Models on the ADED, DADA-2000, PSAD Datasets.

🚀Visualization of Intermediate Results

Fig. 4. The visualization of the intermediate features. (a) represents the original image, (b) depicts the driving scene feature

, (c) depicts the driving scene feature

by Uniformer, (d) shows the attention information

, (e) displays the perception-enhanced information

, and (f) illustrates the attention-encoded information

. These cases demonstrate the network’s mechanism and enhancement strategy, rather than the average performance across the dataset.

💖Support the Project

Thanks to the open-source video action detection models (ViViT, VideoMAE) at huggingface🤗 for supporting this paper.

📄Cite

If you find this repository useful, please use the following BibTeX entry for citation and give us a star⭐.

@article{zhao2025vp2net, 
  title={VP²Net: Visual Perception-Inspired Network for Exploring the Causes of Drivers’ Attention Shift}, 
  journal={IEEE Transactions on Intelligent Ttansportation Systems}, 
  author={Zhao, Chunyu and Deng, Tao and Du, Pengcheng and Liu, Wenbo and Huang, Yi and Yan, Fei}, 
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💻Dataset

✨Model

🚀 Quantitative Analysis

🚀Visualization of Intermediate Results

💖Support the Project

📄Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

💻Dataset

✨Model

🚀 Quantitative Analysis

🚀Visualization of Intermediate Results

💖Support the Project

📄Cite

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages