This repository contains the official implementation of:
"DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description" Accepted at CVPR Workshop on AI for Content Creation (AI4CC) 2025 [arXiv]
DANTE-AD is a Transformer-based video description model designed to improve contextual understanding for audio description (AD). The model explicitly integrates two complementary visual representations:
-
Frame-level features capturing fine-grained visual details
-
Scene-level features capturing long-term and high-level context
These are sequentially fused in a dual-vision attention architecture, enabling more coherent and context-aware descriptions across scene boundaries.
- Clone the repo and install required packages:
git clone https://github.com/AdrienneDeganutti/DANTE-AD.git
cd DANTE-AD/- Create and activate the Conda environment
conda env create -f environment.yml
conda activate dante- Install PyTorch
pip install torch torchvision --index-url https://download.pytorch.org- Configure dataset paths
- Update the paths to your dataset in
src/configs/datasets/cmd_ad.yaml
CMD-AD/
├── labels/ # Ground-truth AD annotations
│ ├── train.tsv
│ └── eval.tsv
├── s4v_features/ # Scene-level S4V embeddings
│ ├── 2011/
│ │ └── *.pt
│ ├── ...
│ └── 2019/
├── video_qformer_features/ # Frame-level Video Q-Former embeddings (offline loading)
│ ├── 2011/
│ │ └── *.pt
│ ├── ...
│ └── 2019/
├── videos/ # Segmented CMD-AD videos (online feature extraction)
│ ├── 2011/
│ │ └── *.mkv
│ ├── ...
│ └── 2019/
The scene-level S4V features provided are processed from the action recognition module of Side4Video pre-trained on Kinetics-400.
Follow the instructions in data_preparation/README.md
We provide our pre-computed scene-level features in the Dataset section.
Frame-level features can be handled in two ways:
Option 1: Online feature extraction.
- Set
"load_frame_features": falsein the training config filesrc/configs/training_config.json - Set video directory path
videos_dirin the dataset config filesrc/configs/datasets/cmd_ad.yaml
Option 2: Offline feature loading.
- Set
"load_frame_features": truein the training config filesrc/configs/training_config.json - Set feature path
video_qformer_ftin the dataset config filesrc/configs/datasets/cmd_ad.yaml
We provide our pre-computed frame-level features in the Dataset section.
The dataset used in this paper is a reduced version of the CMD-AD dataset. Due to various encoding issues with the raw videos, our version of the CMD-AD dataset used in this paper is reduced from approximately 101k down to 96k AD segments as shown in the table below.
| CMD-AD | DANTE-AD | |
|---|---|---|
| Total AD segments | 101,268 | 96,873 |
| Train AD segments | 93,952 | 89,798 |
| Eval AD segments | 7,316 | 7,075 |
To improve computational efficiency, we pre-compute the frame-level (CLIP) and scene-level (S4V) visual embeddings offline. We provide the pre-processed visual embeddings and ground-truth annotations below.
For the frame-level CLIP features, we process the following modules offline: EVA-CLIP feature extraction, Q-Former, positional embedding and Video Q-Former. The output of the Video Q-Former has shape ([batch_size, 32, 768]). This can be reproduced by running the data-preparation step 2 online frame-level feature extraction instructions above.
The S4V features we provide are the output of the Side4Video module after Global Average Pooling over each frame within the video sequence. The output features are of shape ([batch_size, 1, 320]).
Download here: Preprocessed CMD-AD
-
Set the path to your checkpoint in
src/configs/video_llama/model_config.yaml -
Set
do_train: trueinsrc/configs/training_config.json
Run training:
python main.py --config src/configs/training_config.json| Download Link | |
|---|---|
| Base Movie-LLaMA2 weights | Movie-Llama2 weights |
| DANTE-AD trained checkpoint | DANTE-AD model checkpoint |
-
Set the path to the checkpoint in
src/configs/video_llama/model_config.yaml -
Set
do_train: falseanddo_eval: trueinsrc/configs/training_config.json
Run evaluation:
python main.py --config src/configs/training_config.jsonDANTE-AD output on the CMD-AD dataset: eval-results.tsv
This work builds upon the following projects:
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
- AutoAD: Movie Description in Context
- GRIT: Faster and Better Image-Captioning Transformer
- Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
If you find our project useful, please kindly cite the paper with the following bibtex:
@article{deganutti2025dante,
title={DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description},
author={Deganutti, Adrienne and Hadfield, Simon and Gilbert, Andrew},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition - Workshop on AI for Content Creation (AI4CC'25)},
year={2025}
}