Skip to content

AdrienneDeganutti/DANTE-AD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

This repository contains the official implementation of:

"DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description" Accepted at CVPR Workshop on AI for Content Creation (AI4CC) 2025 [arXiv]

Overview

DANTE-AD is a Transformer-based video description model designed to improve contextual understanding for audio description (AD). The model explicitly integrates two complementary visual representations:

  • Frame-level features capturing fine-grained visual details

  • Scene-level features capturing long-term and high-level context

These are sequentially fused in a dual-vision attention architecture, enabling more coherent and context-aware descriptions across scene boundaries.

DANTE-AD

Environment Setup

  1. Clone the repo and install required packages:
git clone https://github.com/AdrienneDeganutti/DANTE-AD.git
cd DANTE-AD/
  1. Create and activate the Conda environment
conda env create -f environment.yml
conda activate dante
  1. Install PyTorch
pip install torch torchvision --index-url https://download.pytorch.org
  1. Configure dataset paths
  • Update the paths to your dataset in src/configs/datasets/cmd_ad.yaml
CMD-AD/
├── labels/                          # Ground-truth AD annotations
│   ├── train.tsv
│   └── eval.tsv
├── s4v_features/                    # Scene-level S4V embeddings
│   ├── 2011/
│   │   └── *.pt
│   ├── ...
│   └── 2019/
├── video_qformer_features/          # Frame-level Video Q-Former embeddings (offline loading)
│   ├── 2011/
│   │   └── *.pt
│   ├── ...
│   └── 2019/
├── videos/                          # Segmented CMD-AD videos (online feature extraction)
│   ├── 2011/
│   │   └── *.mkv
│   ├── ...
│   └── 2019/

Data Preparation

Step 1: Scene-level features (S4V)

The scene-level S4V features provided are processed from the action recognition module of Side4Video pre-trained on Kinetics-400.

Follow the instructions in data_preparation/README.md

We provide our pre-computed scene-level features in the Dataset section.

Step 2: Frame-level features

Frame-level features can be handled in two ways:

Option 1: Online feature extraction.

  • Set "load_frame_features": false in the training config file src/configs/training_config.json
  • Set video directory path videos_dir in the dataset config file src/configs/datasets/cmd_ad.yaml

Option 2: Offline feature loading.

  • Set "load_frame_features": true in the training config file src/configs/training_config.json
  • Set feature path video_qformer_ft in the dataset config file src/configs/datasets/cmd_ad.yaml

We provide our pre-computed frame-level features in the Dataset section.

Dataset

The dataset used in this paper is a reduced version of the CMD-AD dataset. Due to various encoding issues with the raw videos, our version of the CMD-AD dataset used in this paper is reduced from approximately 101k down to 96k AD segments as shown in the table below.

CMD-AD DANTE-AD
Total AD segments 101,268 96,873
Train AD segments 93,952 89,798
Eval AD segments 7,316 7,075

Pre-processed data

To improve computational efficiency, we pre-compute the frame-level (CLIP) and scene-level (S4V) visual embeddings offline. We provide the pre-processed visual embeddings and ground-truth annotations below.

For the frame-level CLIP features, we process the following modules offline: EVA-CLIP feature extraction, Q-Former, positional embedding and Video Q-Former. The output of the Video Q-Former has shape ([batch_size, 32, 768]). This can be reproduced by running the data-preparation step 2 online frame-level feature extraction instructions above.

The S4V features we provide are the output of the Side4Video module after Global Average Pooling over each frame within the video sequence. The output features are of shape ([batch_size, 1, 320]).

Download here: Preprocessed CMD-AD

Training

  • Set the path to your checkpoint in src/configs/video_llama/model_config.yaml

  • Set do_train: true in src/configs/training_config.json

Run training:

python main.py --config src/configs/training_config.json

Model Checkpoints

Download Link
Base Movie-LLaMA2 weights Movie-Llama2 weights
DANTE-AD trained checkpoint DANTE-AD model checkpoint

Evaluation

  • Set the path to the checkpoint in src/configs/video_llama/model_config.yaml

  • Set do_train: false and do_eval: true in src/configs/training_config.json

Run evaluation:

python main.py --config src/configs/training_config.json

DANTE-AD output on the CMD-AD dataset: eval-results.tsv

Acknowledgment

This work builds upon the following projects:

  • Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
  • AutoAD: Movie Description in Context
  • GRIT: Faster and Better Image-Captioning Transformer
  • Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

Citation

If you find our project useful, please kindly cite the paper with the following bibtex:

@article{deganutti2025dante,
  title={DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description},
  author={Deganutti, Adrienne and Hadfield, Simon and Gilbert, Andrew},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition - Workshop on AI for Content Creation (AI4CC'25)},
  year={2025}
}

About

"DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description" CVPR Workshop AI4CC 2025

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages