This repository is the official implementation of the "CA2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition" (IEEE TPAMI)
- CA2ST extends our previous work, CAST: Cross-Attention in Space and Time for Video Action Recognition (NeurIPS 2023), by introducing an additional audio expert and cross-attention across audio, space, and time.
We recommend using uv for reproducible environment setup.
uv venv --python 3.12
source .venv/bin/activateuv syncNote. This repository uses PyTorch/DeepSpeed/Triton. If you run into CUDA- or compiler-related build issues, please check your CUDA driver/toolkit setup and install system dependencies accordingly.
We evaluate CA2ST on 10 datasets spanning spatio-temporal and audio-visual benchmarks.
- EPIC-KITCHENS-100 (EK100)
- Something-Something V2 (SSV2)
- Kinetics-400 (K400) or OpenDataLab
- ActivityNet
- HD-EPIC
CA2ST uses frozen expert encoders and learns lightweight modules on top.
- Spatial expert (CLIP): use the official CLIP ViT-B/16 weights.
- Temporal expert (VideoMAE): use pretrained VideoMAE weights (dataset-specific).
- Audio expert (AST): use the official AST pretrained weights.
Download the expert checkpoints and set their paths in your training/evaluation scripts (e.g., CLIP_PATH, VMAE_PATH, AST_PATH).
We provide multiple training scripts and configurations under scripts/.
Please refer to:
scripts/README.mdfor fine-tuning commands and reproducible settings.
TODO (coming soon): We will release polished fine-tuning commands and reproducible configurations in
scripts/README.md.
Evaluation commands for the EK100.
python ./run_bidirection_compo.py --fine_tune {YOUR_FINETUNED_WEIGHT} --composition --evalEvaluation commands for the Others.
python ./run_bidirection_compo.py --fine_tune {YOUR_FINETUNED_WEIGHT} --evalExpert composition
- CAST: Spatial + Temporal (CLIP + VideoMAE)
- CAVA: Spatial + Audio (CLIP + AST)
- CA2ST: Spatial + Temporal + Audio (CLIP + VideoMAE + AST)
Cross-dataset evaluation
- We also report zero-shot / transfer evaluation on HD-* benchmarks using a model fine-tuned on a source dataset (e.g., EK100 → HD-EPIC, EPIC-SOUNDS → HD-EPIC-SOUNDS).
- In the tables below, we add Transfer (HD-*) columns to clearly separate in-domain and cross-dataset results.
| Method | Spatial | Temporal | Audio | Epoch | #Frames x Clips x Crops | Fine-tune | Top-1 (EK100) | Transfer Top-1 (HD-EPIC) |
|---|---|---|---|---|---|---|---|---|
| CAST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on EK100) | - | 50 | 16x2x3 | log / checkpoint | 49.3 | 18.4 |
| CAST | ResNet-50 | VideoMAE-B/16 (pre-trained on EK100) | - | 50 | 16x2x3 | log (TODO) / checkpoint (TODO) | 43.8 | - |
| Method | Spatial | Temporal | Audio | Epoch | #Frames x Clips x Crops | Fine-tune | Top-1 |
|---|---|---|---|---|---|---|---|
| CAST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on SSV2) | - | 50 | 16x2x3 | log / checkpoint | 71.6 |
| Method | Spatial | Temporal | Audio | Epoch | #Frames x Clips x Crops | Fine-tune | Top-1 |
|---|---|---|---|---|---|---|---|
| CAST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on K400) | - | 70 | 16x5x3 | log / checkpoint | 85.3 |
| Method | Spatial | Temporal | Audio | Params | Fine-tune | Top-1 (EPIC-SOUNDS) | Transfer Top-1 (HD-EPIC-SOUNDS) |
|---|---|---|---|---|---|---|---|
| CAST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on EK100) | - | 45M | log (TODO) / checkpoint (TODO) | 47.8 | 27.8 |
| CAVA | CLIP-B/16 | - | AST | 44M | log (TODO) / checkpoint (TODO) | 60.3 | 28.6 |
| CA2ST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on EK100) | AST | 62M | log (TODO) / checkpoint (TODO) | 61.0 | 28.1 |
| Method | Spatial | Temporal | Audio | Params | Fine-tune | Top-1 |
|---|---|---|---|---|---|---|
| CAST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on K400) | - | 45M | log (TODO) / checkpoint (TODO) | 54.7 |
| CAVA | CLIP-B/16 | - | AST | 44M | log (TODO) / checkpoint (TODO) | 68.2 |
| CA2ST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on K400) | AST | 62M | log (TODO) / checkpoint (TODO) | 68.3 |
| Method | Spatial | Temporal | Audio | Params | Fine-tune | Top-1 |
|---|---|---|---|---|---|---|
| CAST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on K400) | - | 45M | log (TODO) / checkpoint (TODO) | 91.6 |
| CAVA | CLIP-B/16 | - | AST | 44M | log (TODO) / checkpoint (TODO) | 92.9 |
| CA2ST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on K400) | AST | 62M | log (TODO) / checkpoint (TODO) | 93.3 |
| Method | Spatial | Temporal | Audio | Params | Fine-tune | Top-1 |
|---|---|---|---|---|---|---|
| CAST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on K400) | - | 45M | log (TODO) / checkpoint (TODO) | 96.9 |
| CAVA | CLIP-B/16 | - | AST | 44M | log (TODO) / checkpoint (TODO) | 96.1 |
| CA2ST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on K400) | AST | 62M | log (TODO) / checkpoint (TODO) | 97.2 |
TODO: Release checkpoints/logs for (i) audio-visual datasets.
This project is built upon VideoMAE, MAE, CLIP, and BEiT. Thanks to the contributors of these great codebases.
This project is under the CC BY-NC 4.0 license. See LICENSE for details.
@article{lee2025ca2st,
title={CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition},
author={Lee, Jongseo and Chang, Joohyun and Lee, Dongho and Choi, Jinwoo},
journal={TPAMI},
year={2025},
}
