CA²ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

This repository is the official implementation of the "CA²ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition" (IEEE TPAMI)

CA²ST extends our previous work, CAST: Cross-Attention in Space and Time for Video Action Recognition (NeurIPS 2023), by introducing an additional audio expert and cross-attention across audio, space, and time.

Installation (uv)

We recommend using uv for reproducible environment setup.

1) Create a virtual environment

uv venv --python 3.12
source .venv/bin/activate

2) Install dependencies

uv sync

Note. This repository uses PyTorch/DeepSpeed/Triton. If you run into CUDA- or compiler-related build issues, please check your CUDA driver/toolkit setup and install system dependencies accordingly.

Data Preparation

We evaluate CA2ST on 10 datasets spanning spatio-temporal and audio-visual benchmarks.

Spatio-temporal (visual-only) benchmarks

Audio-visual benchmarks

Expert Model Preparation

CA2ST uses frozen expert encoders and learns lightweight modules on top.

Spatial expert (CLIP): use the official CLIP ViT-B/16 weights.
Temporal expert (VideoMAE): use pretrained VideoMAE weights (dataset-specific).
Audio expert (AST): use the official AST pretrained weights.

Download the expert checkpoints and set their paths in your training/evaluation scripts (e.g., CLIP_PATH, VMAE_PATH, AST_PATH).

Fine-tuning

We provide multiple training scripts and configurations under scripts/.

Please refer to:

scripts/README.md for fine-tuning commands and reproducible settings.

TODO (coming soon): We will release polished fine-tuning commands and reproducible configurations in scripts/README.md.

Evaluation

Evaluation commands for the EK100.

python ./run_bidirection_compo.py --fine_tune {YOUR_FINETUNED_WEIGHT} --composition --eval

Evaluation commands for the Others.

python ./run_bidirection_compo.py --fine_tune {YOUR_FINETUNED_WEIGHT} --eval

Model Zoo

Expert composition

CAST: Spatial + Temporal (CLIP + VideoMAE)

CAVA: Spatial + Audio (CLIP + AST)

CA²ST: Spatial + Temporal + Audio (CLIP + VideoMAE + AST)

Cross-dataset evaluation

We also report zero-shot / transfer evaluation on HD-* benchmarks using a model fine-tuned on a source dataset (e.g., EK100 → HD-EPIC, EPIC-SOUNDS → HD-EPIC-SOUNDS).

In the tables below, we add Transfer (HD-*) columns to clearly separate in-domain and cross-dataset results.

EPIC-KITCHENS-100 (EK100)

Method	Spatial	Temporal	Audio	Epoch	#Frames x Clips x Crops	Fine-tune	Top-1 (EK100)	Transfer Top-1 (HD-EPIC)
CAST	CLIP-B/16	VideoMAE-B/16 (pre-trained on EK100)	-	50	16x2x3	log / checkpoint	49.3	18.4
CAST	ResNet-50	VideoMAE-B/16 (pre-trained on EK100)	-	50	16x2x3	log (TODO) / checkpoint (TODO)	43.8	-

Something-Something V2 (SSV2)

Method	Spatial	Temporal	Audio	Epoch	#Frames x Clips x Crops	Fine-tune	Top-1
CAST	CLIP-B/16	VideoMAE-B/16 (pre-trained on SSV2)	-	50	16x2x3	log / checkpoint	71.6

Kinetics-400 (K400)

Method	Spatial	Temporal	Audio	Epoch	#Frames x Clips x Crops	Fine-tune	Top-1
CAST	CLIP-B/16	VideoMAE-B/16 (pre-trained on K400)	-	70	16x5x3	log / checkpoint	85.3

EPIC-SOUNDS (Audio-Visual)

Method	Spatial	Temporal	Audio	Params	Fine-tune	Top-1 (EPIC-SOUNDS)	Transfer Top-1 (HD-EPIC-SOUNDS)
CAST	CLIP-B/16	VideoMAE-B/16 (pre-trained on EK100)	-	45M	log (TODO) / checkpoint (TODO)	47.8	27.8
CAVA	CLIP-B/16	-	AST	44M	log (TODO) / checkpoint (TODO)	60.3	28.6
CA²ST	CLIP-B/16	VideoMAE-B/16 (pre-trained on EK100)	AST	62M	log (TODO) / checkpoint (TODO)	61.0	28.1

VGG-Sound (Audio-Visual)

Method	Spatial	Temporal	Audio	Params	Fine-tune	Top-1
CAST	CLIP-B/16	VideoMAE-B/16 (pre-trained on K400)	-	45M	log (TODO) / checkpoint (TODO)	54.7
CAVA	CLIP-B/16	-	AST	44M	log (TODO) / checkpoint (TODO)	68.2
CA²ST	CLIP-B/16	VideoMAE-B/16 (pre-trained on K400)	AST	62M	log (TODO) / checkpoint (TODO)	68.3

KineticsSound (Audio-Visual)

Method	Spatial	Temporal	Audio	Params	Fine-tune	Top-1
CAST	CLIP-B/16	VideoMAE-B/16 (pre-trained on K400)	-	45M	log (TODO) / checkpoint (TODO)	91.6
CAVA	CLIP-B/16	-	AST	44M	log (TODO) / checkpoint (TODO)	92.9
CA²ST	CLIP-B/16	VideoMAE-B/16 (pre-trained on K400)	AST	62M	log (TODO) / checkpoint (TODO)	93.3

UCF-101 (Audio-Visual)

Method	Spatial	Temporal	Audio	Params	Fine-tune	Top-1
CAST	CLIP-B/16	VideoMAE-B/16 (pre-trained on K400)	-	45M	log (TODO) / checkpoint (TODO)	96.9
CAVA	CLIP-B/16	-	AST	44M	log (TODO) / checkpoint (TODO)	96.1
CA²ST	CLIP-B/16	VideoMAE-B/16 (pre-trained on K400)	AST	62M	log (TODO) / checkpoint (TODO)	97.2

TODO: Release checkpoints/logs for (i) audio-visual datasets.

Acknowledgements

This project is built upon VideoMAE, MAE, CLIP, and BEiT. Thanks to the contributors of these great codebases.

License

This project is under the CC BY-NC 4.0 license. See LICENSE for details.

Citation

@article{lee2025ca2st,
  title={CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition},
  author={Lee, Jongseo and Chang, Joohyun and Lee, Dongho and Choi, Jinwoo},
  journal={TPAMI},
  year={2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataset		dataset
figs		figs
models		models
util_tools		util_tools
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
engine_for_compomodel.py		engine_for_compomodel.py
engine_for_model.py		engine_for_model.py
main.py		main.py
master.sh		master.sh
pyproject.toml		pyproject.toml
run_bidirection_compo.py		run_bidirection_compo.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CA²ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

Installation (uv)

1) Create a virtual environment

2) Install dependencies

Data Preparation

Spatio-temporal (visual-only) benchmarks

Audio-visual benchmarks

Expert Model Preparation

Fine-tuning

Evaluation

Model Zoo

EPIC-KITCHENS-100 (EK100)

Something-Something V2 (SSV2)

Kinetics-400 (K400)

EPIC-SOUNDS (Audio-Visual)

VGG-Sound (Audio-Visual)

KineticsSound (Audio-Visual)

UCF-101 (Audio-Visual)

Acknowledgements

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

KHU-VLL/CA2ST

Folders and files

Latest commit

History

Repository files navigation

CA2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

Installation (uv)

1) Create a virtual environment

2) Install dependencies

Data Preparation

Spatio-temporal (visual-only) benchmarks

Audio-visual benchmarks

Expert Model Preparation

Fine-tuning

Evaluation

Model Zoo

EPIC-KITCHENS-100 (EK100)

Something-Something V2 (SSV2)

Kinetics-400 (K400)

EPIC-SOUNDS (Audio-Visual)

VGG-Sound (Audio-Visual)

KineticsSound (Audio-Visual)

UCF-101 (Audio-Visual)

Acknowledgements

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

CA²ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

Packages