DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

This repository contains the official implementation of:

"DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description" Accepted at CVPR Workshop on AI for Content Creation (AI4CC) 2025 [arXiv]

Overview

DANTE-AD is a Transformer-based video description model designed to improve contextual understanding for audio description (AD). The model explicitly integrates two complementary visual representations:

Frame-level features capturing fine-grained visual details
Scene-level features capturing long-term and high-level context

These are sequentially fused in a dual-vision attention architecture, enabling more coherent and context-aware descriptions across scene boundaries.

Environment Setup

Clone the repo and install required packages:

git clone https://github.com/AdrienneDeganutti/DANTE-AD.git
cd DANTE-AD/

Create and activate the Conda environment

conda env create -f environment.yml
conda activate dante

Install PyTorch

pip install torch torchvision --index-url https://download.pytorch.org

Configure dataset paths

Update the paths to your dataset in src/configs/datasets/cmd_ad.yaml

CMD-AD/
├── labels/                          # Ground-truth AD annotations
│   ├── train.tsv
│   └── eval.tsv
├── s4v_features/                    # Scene-level S4V embeddings
│   ├── 2011/
│   │   └── *.pt
│   ├── ...
│   └── 2019/
├── video_qformer_features/          # Frame-level Video Q-Former embeddings (offline loading)
│   ├── 2011/
│   │   └── *.pt
│   ├── ...
│   └── 2019/
├── videos/                          # Segmented CMD-AD videos (online feature extraction)
│   ├── 2011/
│   │   └── *.mkv
│   ├── ...
│   └── 2019/

Data Preparation

Step 1: Scene-level features (S4V)

The scene-level S4V features provided are processed from the action recognition module of Side4Video pre-trained on Kinetics-400.

Follow the instructions in data_preparation/README.md

We provide our pre-computed scene-level features in the Dataset section.

Step 2: Frame-level features

Frame-level features can be handled in two ways:

Option 1: Online feature extraction.

Set "load_frame_features": false in the training config file src/configs/training_config.json
Set video directory path videos_dir in the dataset config file src/configs/datasets/cmd_ad.yaml

Option 2: Offline feature loading.

Set "load_frame_features": true in the training config file src/configs/training_config.json
Set feature path video_qformer_ft in the dataset config file src/configs/datasets/cmd_ad.yaml

We provide our pre-computed frame-level features in the Dataset section.

Dataset

The dataset used in this paper is a reduced version of the CMD-AD dataset. Due to various encoding issues with the raw videos, our version of the CMD-AD dataset used in this paper is reduced from approximately 101k down to 96k AD segments as shown in the table below.

	CMD-AD	DANTE-AD
Total AD segments	101,268	96,873
Train AD segments	93,952	89,798
Eval AD segments	7,316	7,075

Pre-processed data

To improve computational efficiency, we pre-compute the frame-level (CLIP) and scene-level (S4V) visual embeddings offline. We provide the pre-processed visual embeddings and ground-truth annotations below.

For the frame-level CLIP features, we process the following modules offline: EVA-CLIP feature extraction, Q-Former, positional embedding and Video Q-Former. The output of the Video Q-Former has shape ([batch_size, 32, 768]). This can be reproduced by running the data-preparation step 2 online frame-level feature extraction instructions above.

The S4V features we provide are the output of the Side4Video module after Global Average Pooling over each frame within the video sequence. The output features are of shape ([batch_size, 1, 320]).

Download here: Preprocessed CMD-AD

Training

Set the path to your checkpoint in src/configs/video_llama/model_config.yaml
Set do_train: true in src/configs/training_config.json

Run training:

python main.py --config src/configs/training_config.json

Model Checkpoints

	Download Link
Base Movie-LLaMA2 weights	Movie-Llama2 weights
DANTE-AD trained checkpoint	DANTE-AD model checkpoint

Evaluation

Set the path to the checkpoint in src/configs/video_llama/model_config.yaml
Set do_train: false and do_eval: true in src/configs/training_config.json

Run evaluation:

python main.py --config src/configs/training_config.json

DANTE-AD output on the CMD-AD dataset: eval-results.tsv

Acknowledgment

This work builds upon the following projects:

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
AutoAD: Movie Description in Context
GRIT: Faster and Better Image-Captioning Transformer
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

Citation

If you find our project useful, please kindly cite the paper with the following bibtex:

@article{deganutti2025dante,
  title={DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description},
  author={Deganutti, Adrienne and Hadfield, Simon and Gilbert, Andrew},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition - Workshop on AI for Content Creation (AI4CC'25)},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
data_preparation		data_preparation
figures		figures
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

Overview

Environment Setup

Data Preparation

Step 1: Scene-level features (S4V)

Step 2: Frame-level features

Dataset

Pre-processed data

Training

Model Checkpoints

Evaluation

Acknowledgment

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

Overview

Environment Setup

Data Preparation

Step 1: Scene-level features (S4V)

Step 2: Frame-level features

Dataset

Pre-processed data

Training

Model Checkpoints

Evaluation

Acknowledgment

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages