Yutong Wang1, Haiyu Zhang3,2, Tianfan Xue4,2, Yu Qiao2, Yaohui Wang2, Chang Xu1*, Xinyuan Chen2*
1USYD, 2Shanghai AI Laboratory, 3BUAA, 4CUHK
VDOT is an efficient, unified video creation model that achieves high-quality results in just 4 denoising steps. By employing Computational Optimal Transport (OT) within the distillation process, VDOT ensures training stability and enhances both training and inference efficiency. VDOT unifies a wide range of capabilities, such as Reference-to-Video (R2V), Video-to-Video (V2V), Masked Video Editing (MV2V), and arbitrary composite tasks, matching the versatility of VACE with significantly reduced inference costs.
sour_cover_compressed_.mp4
The codebase was tested with Python 3.10.13, CUDA version 12.4, and PyTorch >= 2.5.1.
We are grateful for the following awesome projects, including VACE, Wan, and Self-Forcing.