Skip to content

[ACM MM 2025] AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation

License

Notifications You must be signed in to change notification settings

kaistmm/AlignDiT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AlignDiT

Official PyTorch implementation for the following paper:

AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung
ACM MM 2025
[Paper] [Project]

Datasets

To help verify your setup and ensure reproducibility, we provide the following debug data.

Path Dataset Debug data
data/LibriSpeech_debug LibriSpeech download
data/LRS3_debug LRS3 download

Model Checkpoints

Path Train Dataset Model
ckpts/AlignDiT_pretrain_hifigan_16k_LibriSpeech_notext/model_500000.pt LibriSpeech download
ckpts/AlignDiT_finetune_hifigan_16k_LRS3_char/model_400000.pt LRS3 download

Test Samples

We provide audio samples generated by AlignDiT. For VTS task, we use a lip reading model (Auto-AVSR) to transcribe text from the silent video before inference.

Task Test Dataset WER ↓ AVSync ↑ spkSIM ↑ Samples
ADR (automated dialogue replacement) LRS3-cross 1.401 0.751 0.515 download
VTS (video-to-speech synthesis) LRS3-cross 19.513 0.688 0.508 download

1. Installation

conda create -y -n aligndit python=3.10 && conda activate aligndit

git clone https://github.com/kaistmm/AlignDiT.git && cd AlignDiT

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -e .
pip install -e .[eval]  # For evaluation

2. Data Preparation

We crop the mouth region from each video following Auto-AVSR and place the resulting videos in data/LRS3_debug/autoavsr/video. For both training and inference, we use these cropped videos.

Metadata

bash src/aligndit/run/misc/prepare_librispeech_notext.sh
bash src/aligndit/run/misc/prepare_lrs3.sh

Mel spectrogram

bash src/aligndit/run/misc/extract_mel.sh

HuBERT feature

bash src/aligndit/run/misc/extract_hubert.sh

AV-HuBERT video feature

This requires Fairseq and AV-HuBERT.

bash src/aligndit/run/misc/extract_avhubert_from_only_video.sh

3. Training

# 1. Pre-train on LibriSpeech for 500k updates
bash src/aligndit/run/train/pretrain.sh

# 2. Fine-tune on LRS3 for 400k updates
bash src/aligndit/run/train/finetune.sh

4. Inference

# ADR (automated dialogue replacement)
bash src/aligndit/run/eval/infer.sh

# VTS (video-to-speech synthesis)
bash src/aligndit/run/eval/infer_w_lipreader.sh

5. Evaluation

We follow F5-TTS for evaluation. Further details are avilable here.

# For AVSync metric, run this script beforehand
bash src/aligndit/run/misc/extract_avhubert.sh

bash src/aligndit/run/eval/eval_lrs3_test.sh

Acknowledgement

This repository is built using F5-TTS, AV-HuBERT, Fairseq, CosyVoice, HiFi-GAN, V2SFlow. We appreciate the open source of the projects.

Citation

If our work is useful for you, please cite the following paper:

@inproceedings{choi2025aligndit,
  title={AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation},
  author={Choi, Jeongsoo and Kim, Ji-Hoon and Sung-Bin, Kim and Oh, Tae-Hyun and Chung, Joon Son},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  year={2025}
}

License

This project is released under the MIT License. Please note that the use of AV-HuBERT models is subject to their original license terms.

About

[ACM MM 2025] AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published