Skip to content
/ CA2ST Public

[IEEE TPAMI] Official implementation of the paper "CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition"

License

Notifications You must be signed in to change notification settings

KHU-VLL/CA2ST

Repository files navigation

This repository is the official implementation of the "CA2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition" (IEEE TPAMI)

Journal arXiv

CA2ST Framework

Installation (uv)

We recommend using uv for reproducible environment setup.

1) Create a virtual environment

uv venv --python 3.12
source .venv/bin/activate

2) Install dependencies

uv sync

Note. This repository uses PyTorch/DeepSpeed/Triton. If you run into CUDA- or compiler-related build issues, please check your CUDA driver/toolkit setup and install system dependencies accordingly.

Data Preparation

We evaluate CA2ST on 10 datasets spanning spatio-temporal and audio-visual benchmarks.

Spatio-temporal (visual-only) benchmarks

Audio-visual benchmarks

Expert Model Preparation

CA2ST uses frozen expert encoders and learns lightweight modules on top.

  • Spatial expert (CLIP): use the official CLIP ViT-B/16 weights.
  • Temporal expert (VideoMAE): use pretrained VideoMAE weights (dataset-specific).
  • Audio expert (AST): use the official AST pretrained weights.

Download the expert checkpoints and set their paths in your training/evaluation scripts (e.g., CLIP_PATH, VMAE_PATH, AST_PATH).

Fine-tuning

We provide multiple training scripts and configurations under scripts/.

Please refer to:

TODO (coming soon): We will release polished fine-tuning commands and reproducible configurations in scripts/README.md.

Evaluation

Evaluation commands for the EK100.

python ./run_bidirection_compo.py --fine_tune {YOUR_FINETUNED_WEIGHT} --composition --eval

Evaluation commands for the Others.

python ./run_bidirection_compo.py --fine_tune {YOUR_FINETUNED_WEIGHT} --eval

Model Zoo

Models

Expert composition

  • CAST: Spatial + Temporal (CLIP + VideoMAE)
  • CAVA: Spatial + Audio (CLIP + AST)
  • CA2ST: Spatial + Temporal + Audio (CLIP + VideoMAE + AST)

Cross-dataset evaluation

  • We also report zero-shot / transfer evaluation on HD-* benchmarks using a model fine-tuned on a source dataset (e.g., EK100 → HD-EPIC, EPIC-SOUNDS → HD-EPIC-SOUNDS).
  • In the tables below, we add Transfer (HD-*) columns to clearly separate in-domain and cross-dataset results.

EPIC-KITCHENS-100 (EK100)

Method Spatial Temporal Audio Epoch #Frames x Clips x Crops Fine-tune Top-1 (EK100) Transfer Top-1 (HD-EPIC)
CAST CLIP-B/16 VideoMAE-B/16 (pre-trained on EK100) - 50 16x2x3 log / checkpoint 49.3 18.4
CAST ResNet-50 VideoMAE-B/16 (pre-trained on EK100) - 50 16x2x3 log (TODO) / checkpoint (TODO) 43.8 -

Something-Something V2 (SSV2)

Method Spatial Temporal Audio Epoch #Frames x Clips x Crops Fine-tune Top-1
CAST CLIP-B/16 VideoMAE-B/16 (pre-trained on SSV2) - 50 16x2x3 log / checkpoint 71.6

Kinetics-400 (K400)

Method Spatial Temporal Audio Epoch #Frames x Clips x Crops Fine-tune Top-1
CAST CLIP-B/16 VideoMAE-B/16 (pre-trained on K400) - 70 16x5x3 log / checkpoint 85.3

EPIC-SOUNDS (Audio-Visual)

Method Spatial Temporal Audio Params Fine-tune Top-1 (EPIC-SOUNDS) Transfer Top-1 (HD-EPIC-SOUNDS)
CAST CLIP-B/16 VideoMAE-B/16 (pre-trained on EK100) - 45M log (TODO) / checkpoint (TODO) 47.8 27.8
CAVA CLIP-B/16 - AST 44M log (TODO) / checkpoint (TODO) 60.3 28.6
CA2ST CLIP-B/16 VideoMAE-B/16 (pre-trained on EK100) AST 62M log (TODO) / checkpoint (TODO) 61.0 28.1

VGG-Sound (Audio-Visual)

Method Spatial Temporal Audio Params Fine-tune Top-1
CAST CLIP-B/16 VideoMAE-B/16 (pre-trained on K400) - 45M log (TODO) / checkpoint (TODO) 54.7
CAVA CLIP-B/16 - AST 44M log (TODO) / checkpoint (TODO) 68.2
CA2ST CLIP-B/16 VideoMAE-B/16 (pre-trained on K400) AST 62M log (TODO) / checkpoint (TODO) 68.3

KineticsSound (Audio-Visual)

Method Spatial Temporal Audio Params Fine-tune Top-1
CAST CLIP-B/16 VideoMAE-B/16 (pre-trained on K400) - 45M log (TODO) / checkpoint (TODO) 91.6
CAVA CLIP-B/16 - AST 44M log (TODO) / checkpoint (TODO) 92.9
CA2ST CLIP-B/16 VideoMAE-B/16 (pre-trained on K400) AST 62M log (TODO) / checkpoint (TODO) 93.3

UCF-101 (Audio-Visual)

Method Spatial Temporal Audio Params Fine-tune Top-1
CAST CLIP-B/16 VideoMAE-B/16 (pre-trained on K400) - 45M log (TODO) / checkpoint (TODO) 96.9
CAVA CLIP-B/16 - AST 44M log (TODO) / checkpoint (TODO) 96.1
CA2ST CLIP-B/16 VideoMAE-B/16 (pre-trained on K400) AST 62M log (TODO) / checkpoint (TODO) 97.2

TODO: Release checkpoints/logs for (i) audio-visual datasets.

Acknowledgements

This project is built upon VideoMAE, MAE, CLIP, and BEiT. Thanks to the contributors of these great codebases.

License

This project is under the CC BY-NC 4.0 license. See LICENSE for details.

Citation

@article{lee2025ca2st,
  title={CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition},
  author={Lee, Jongseo and Chang, Joohyun and Lee, Dongho and Choi, Jinwoo},
  journal={TPAMI},
  year={2025},
}

About

[IEEE TPAMI] Official implementation of the paper "CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published