CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction

Sirui Wang^† Zhou Guan^† Bingxi Zhao Tongjia Gu Jie Liu^*

Beijing Jiaotong University

^† Equal Contribution ^* Corresponding Author

📰 News

[2026.01.16] 🚀 Code has been released on GitHub.
[2026.01.09] The preprint version is available on arXiv.
[2025.11.08] 🎉 Our paper has been accepted to AAAI 2026!

📝 Abstract

Accurate prediction of driving intention is key to enhancing the safety and interactive efficiency of human-machine co-driving systems. It serves as a cornerstone for achieving high-level autonomous driving. However, current approaches remain inadequate for accurately modeling the complex spatiotemporal interdependencies and the unpredictable variability of human driving behavior. To address these challenges, we propose CaTFormer, a causal Temporal Transformer that explicitly models causal interactions between driver behavior and environmental context for robust intention prediction. Specifically, CaTFormer introduces a novel Reciprocal Delayed Fusion (RDF) mechanism for precise temporal alignment of interior and exterior feature streams, a Counterfactual Residual Encoding (CRE) module that systematically eliminates spurious correlations to reveal authentic causal dependencies, and an innovative Feature Synthesis Network (FSN) that adaptively synthesizes these purified representations into coherent temporal representations. Experimental results demonstrate that CaTFormer attains state-of-the-art performance on the Brain4Cars dataset. It effectively captures complex causal temporal dependencies and enhances both the accuracy and transparency of driving intention prediction.

🚀 Getting Started

Prerequisites

Python >= 3.7
PyTorch >= 1.7
CUDA (for GPU support)

Installation

Clone the repository:

git clone https://github.com/srwang0506/CaTFormer.git
cd CaTFormer

Setup environment and install dependencies:

# create env
conda create -n catformer python=3.10 -y
conda activate catformer

# install Pytorch and CUDA
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 \
  --index-url https://download.pytorch.org/whl/cu121

# other deps
pip install -r requirements.txt

Dataset Preparation

Download the Brain4Cars dataset and extract all videos into JPG frame sequences (both interior and exterior cameras).
Directory layout we expect after preprocessing:

CaTFormer/
├── brain4cars_data/
│   ├── face_camera/
│   │   ├── end_action/
│   │   ├── lchange/
│   │   ├── lturn/
│   │   ├── rchange/
│   │   ├── rturn/
│   └── road_camera/
│       ├── end_action/
│       ├── lchange/
│       ├── lturn/
│       ├── rchange/
│       ├── rturn/
├── datasets/
│   └── annotation/      # fold0.csv, fold1.csv, ...
└── ...

Clone the official RAFT repo (princeton-vl/RAFT) and install the required dependencies as instructed in its README (incl. pretrained weights), then run demo_brain4cars.py to compute exterior optical flow; set the output path (or move results) so the processed flow frames are located at brain4cars_data/road_camera/flow.

git clone https://github.com/princeton-vl/RAFT.git
# follow RAFT README to set up env + download pretrained weights
python demo_brain4cars.py
# output should be placed under:
# brain4cars_data/road_camera/flow

After optical flow processing, the dataset directory structure is as follows:

CaTFormer/
├── brain4cars_data/
│   ├── face_camera/
│   │   ├── end_action/
│   │   ├── lchange/
│   │   ├── lturn/
│   │   ├── rchange/
│   │   ├── rturn/
│   └── road_camera/
│       └── flow/
│           ├── end_action/
│           ├── flow/
│               ├── end_action/
│               ├── lchange/
│               ├── lturn/
│               ├── rchange/
│               ├── rturn/
│           ├── lchange/
│           ├── lturn/
│           ├── rchange/
│           ├── rturn/
├── datasets/
│   └── annotation/      # fold0.csv, fold1.csv, ...
└── ...

Convert interior face_camera .mat metadata to car_state.txt before training/testing:

python extract_mat.py
# This writes `car_state.txt` beside each video folder for later loading.

🔧 Training

Train on a Single Fold

To train the model on a specific fold (e.g., fold 3), use the provided shell script:

bash train_fold.sh

Train on All Folds

To train on all folds for the 5-fold cross-validation:

bash train_total.sh

🧪 Testing

To evaluate the trained model, use the provided shell script:

bash test.sh

📚 Citation

If you find this work helpful, please consider citing:

@misc{wang2026catformercausaltemporaltransformer,
      title={CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction}, 
      author={Sirui Wang and Zhou Guan and Bingxi Zhao and Tongjia Gu and Jie Liu},
      year={2026},
      eprint={2507.13425},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.13425}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
core		core
datasets		datasets
CaTFormer.py		CaTFormer.py
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
demo_brain4cars.py		demo_brain4cars.py
extract_mat.py		extract_mat.py
opts.py		opts.py
pipeline.jpg		pipeline.jpg
requirements.txt		requirements.txt
test.py		test.py
test.sh		test.sh
train.py		train.py
train_fold.sh		train_fold.sh
train_total.sh		train_total.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction

📰 News

📝 Abstract

🚀 Getting Started

Prerequisites

Installation

Dataset Preparation

🔧 Training

Train on a Single Fold

Train on All Folds

🧪 Testing

📚 Citation

About

Uh oh!

Releases

Packages

Languages

License

srwang0506/CaTFormer

Folders and files

Latest commit

History

Repository files navigation

CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction

📰 News

📝 Abstract

🚀 Getting Started

Prerequisites

Installation

Dataset Preparation

🔧 Training

Train on a Single Fold

Train on All Folds

🧪 Testing

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages