Official PyTorch implementation for the following paper:
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung
ACM MM 2025
[Paper] [Project]
To help verify your setup and ensure reproducibility, we provide the following debug data.
| Path | Dataset | Debug data |
|---|---|---|
| data/LibriSpeech_debug | LibriSpeech | download |
| data/LRS3_debug | LRS3 | download |
| Path | Train Dataset | Model |
|---|---|---|
| ckpts/AlignDiT_pretrain_hifigan_16k_LibriSpeech_notext/model_500000.pt | LibriSpeech | download |
| ckpts/AlignDiT_finetune_hifigan_16k_LRS3_char/model_400000.pt | LRS3 | download |
We provide audio samples generated by AlignDiT. For VTS task, we use a lip reading model (Auto-AVSR) to transcribe text from the silent video before inference.
| Task | Test Dataset | WER ↓ | AVSync ↑ | spkSIM ↑ | Samples |
|---|---|---|---|---|---|
| ADR (automated dialogue replacement) | LRS3-cross | 1.401 | 0.751 | 0.515 | download |
| VTS (video-to-speech synthesis) | LRS3-cross | 19.513 | 0.688 | 0.508 | download |
conda create -y -n aligndit python=3.10 && conda activate aligndit
git clone https://github.com/kaistmm/AlignDiT.git && cd AlignDiT
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -e .
pip install -e .[eval] # For evaluationWe crop the mouth region from each video following Auto-AVSR and place the resulting videos in data/LRS3_debug/autoavsr/video. For both training and inference, we use these cropped videos.
bash src/aligndit/run/misc/prepare_librispeech_notext.sh
bash src/aligndit/run/misc/prepare_lrs3.shbash src/aligndit/run/misc/extract_mel.shbash src/aligndit/run/misc/extract_hubert.shThis requires Fairseq and AV-HuBERT.
bash src/aligndit/run/misc/extract_avhubert_from_only_video.sh# 1. Pre-train on LibriSpeech for 500k updates
bash src/aligndit/run/train/pretrain.sh
# 2. Fine-tune on LRS3 for 400k updates
bash src/aligndit/run/train/finetune.sh# ADR (automated dialogue replacement)
bash src/aligndit/run/eval/infer.sh
# VTS (video-to-speech synthesis)
bash src/aligndit/run/eval/infer_w_lipreader.shWe follow F5-TTS for evaluation. Further details are avilable here.
# For AVSync metric, run this script beforehand
bash src/aligndit/run/misc/extract_avhubert.sh
bash src/aligndit/run/eval/eval_lrs3_test.shThis repository is built using F5-TTS, AV-HuBERT, Fairseq, CosyVoice, HiFi-GAN, V2SFlow. We appreciate the open source of the projects.
If our work is useful for you, please cite the following paper:
@inproceedings{choi2025aligndit,
title={AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation},
author={Choi, Jeongsoo and Kim, Ji-Hoon and Sung-Bin, Kim and Oh, Tae-Hyun and Chung, Joon Son},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
year={2025}
}This project is released under the MIT License. Please note that the use of AV-HuBERT models is subject to their original license terms.
