python 3.8
librosa 0.7.2
numpy 1.19.0
torch 1.4.0
torchvision 0.5.0
- Download the VoxCeleb, VGGFace
- Latest Paper List Audio-visual matching
-
wav audio data, 1,251 people in total, 39 GB after decompression.
Baidu Cloud link: VoxCeleb1 -
Decompression command:
zip -s 0 split.zip --out unsplit.zip
unzip unslit.zip -
Vox1 official website: VoxCeleb1
-
MP4 video data, files include audio, total of 5,994 people, 255 GB after decompression.
Baidu Cloud link: VoxCeleb2 -
Decompression command:
zip -s 0 vox2_mp4_dev.zip --out unsplit.zip
unzip unslit.zip -
Vox2 official website: VoxCeleb2
If you think this toolkit or the results are helpful to you and your research, please cite us!
If you are interested in our mission, you can contact us for data sharing.
@article{wang2025adaptive,
title={Adaptive Interaction and Correction Attention Network for Audio-Visual Matching},
author={Wang, Jiaxiang and Zheng, Aihua and Liu, Lei and Li, Chenglong and He, Ran and Tang, Jin},
journal={IEEE Transactions on Information Forensics and Security},
volume={20},
number={},
pages={7558-7571},
year={2025},
publisher={IEEE}
}