AlignDiT

Official PyTorch implementation for the following paper:

AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung
ACM MM 2025
[Paper] [Project]

Datasets

To help verify your setup and ensure reproducibility, we provide the following debug data.

Path	Dataset	Debug data
data/LibriSpeech_debug	LibriSpeech	download
data/LRS3_debug	LRS3	download

Model Checkpoints

Path	Train Dataset	Model
ckpts/AlignDiT_pretrain_hifigan_16k_LibriSpeech_notext/model_500000.pt	LibriSpeech	download
ckpts/AlignDiT_finetune_hifigan_16k_LRS3_char/model_400000.pt	LRS3	download

Test Samples

We provide audio samples generated by AlignDiT. For VTS task, we use a lip reading model (Auto-AVSR) to transcribe text from the silent video before inference.

Task	Test Dataset	WER ↓	AVSync ↑	spkSIM ↑	Samples
ADR (automated dialogue replacement)	LRS3-cross	1.401	0.751	0.515	download
VTS (video-to-speech synthesis)	LRS3-cross	19.513	0.688	0.508	download

1. Installation

conda create -y -n aligndit python=3.10 && conda activate aligndit

git clone https://github.com/kaistmm/AlignDiT.git && cd AlignDiT

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -e .
pip install -e .[eval]  # For evaluation

2. Data Preparation

We crop the mouth region from each video following Auto-AVSR and place the resulting videos in data/LRS3_debug/autoavsr/video. For both training and inference, we use these cropped videos.

Metadata

bash src/aligndit/run/misc/prepare_librispeech_notext.sh
bash src/aligndit/run/misc/prepare_lrs3.sh

Mel spectrogram

bash src/aligndit/run/misc/extract_mel.sh

HuBERT feature

bash src/aligndit/run/misc/extract_hubert.sh

AV-HuBERT video feature

This requires Fairseq and AV-HuBERT.

bash src/aligndit/run/misc/extract_avhubert_from_only_video.sh

3. Training

# 1. Pre-train on LibriSpeech for 500k updates
bash src/aligndit/run/train/pretrain.sh

# 2. Fine-tune on LRS3 for 400k updates
bash src/aligndit/run/train/finetune.sh

4. Inference

# ADR (automated dialogue replacement)
bash src/aligndit/run/eval/infer.sh

# VTS (video-to-speech synthesis)
bash src/aligndit/run/eval/infer_w_lipreader.sh

5. Evaluation

We follow F5-TTS for evaluation. Further details are avilable here.

# For AVSync metric, run this script beforehand
bash src/aligndit/run/misc/extract_avhubert.sh

bash src/aligndit/run/eval/eval_lrs3_test.sh

Acknowledgement

This repository is built using F5-TTS, AV-HuBERT, Fairseq, CosyVoice, HiFi-GAN, V2SFlow. We appreciate the open source of the projects.

Citation

If our work is useful for you, please cite the following paper:

@inproceedings{choi2025aligndit,
  title={AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation},
  author={Choi, Jeongsoo and Kim, Ji-Hoon and Sung-Bin, Kim and Oh, Tae-Hyun and Chung, Joon Son},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  year={2025}
}

License

This project is released under the MIT License. Please note that the use of AV-HuBERT models is subject to their original license terms.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ckpts		ckpts
data		data
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AlignDiT

Datasets

Model Checkpoints

Test Samples

1. Installation

2. Data Preparation

Metadata

Mel spectrogram

HuBERT feature

AV-HuBERT video feature

3. Training

4. Inference

5. Evaluation

Acknowledgement

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

kaistmm/AlignDiT

Folders and files

Latest commit

History

Repository files navigation

AlignDiT

Datasets

Model Checkpoints

Test Samples

1. Installation

2. Data Preparation

Metadata

Mel spectrogram

HuBERT feature

AV-HuBERT video feature

3. Training

4. Inference

5. Evaluation

Acknowledgement

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages