Yutong Wang 1, Haiyu Zhang 3,2, Tianfan Xue 4,2, Yu Qiao 2, Yaohui Wang 2, Chang Xu 1*, Xinyuan Chen 2*
1USYD, 2Shanghai AI Laboratory, 3BUAA, 4CUHK
VDOT is an efficient, unified video creation model that achieves high-quality results in just 4 denoising steps. By employing Computational Optimal Transport (OT) within the distillation process, VDOT ensures training stability and enhances both training and inference efficiency. VDOT unifies a wide range of capabilities, such as Reference-to-Video (R2V), Video-to-Video (V2V), Masked Video Editing (MV2V), and arbitrary composite tasks, matching the versatility of VACE with significantly reduced inference costs.
sour_cover_compressed_.mp4
The codebase was tested with Python 3.10.13, CUDA version 12.4, and PyTorch >= 2.5.1.
We are grateful for the following awesome projects, including VACE, Wan, and Self-Forcing.