Skip to content

ManuBenavent/AAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[WACV 2026] Action Anticipation at a Glimpse:
To What Extent Can Multimodal Cues Replace Video?

arXiv


🔎 About

Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent can video aggregation be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.

💻 Usage

🛠️ Installation

  1. Clone the repo
git clone https://github.com/ManuBenavent/AAG.git
  1. Install the requirements
    1. Create docker image
      cd docker
      docker build -t AAG .
    2. Run the container
      docker run --gpus all -it --rm -v <path_to_repo>:/AAG AAG

🚀 Training/Inference

  1. Prepare the configuration file with your own paths and hyperparameters (see configs/aag_ikea.yaml for an example).
  2. Run training or inference:
    bash scripts/run_train.sh configs/aag_ikea.yaml

📄 Citation

If you find this work useful in your research, please consider citing:

@inproceedings{aag,
  title={Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?},
  author={Benavent-Lledo, Manuel and Bacharidis, Konstantinos and Manousaki, Victoria and Papoutsakis, Konstantinos and Argyros, Antonis and Garcia-Rodriguez, Jose},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026},
  year={2026}
}

About

[WACV 2026] Official Implementation of "Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors