Skip to content

SAIS-FUXI/HawkT2V

Repository files navigation

HawkT2V

HawkT2V, a diffusion transformer model designed for generating videos from textual inputs. To adeptly handle the intricacies of video data, we incorporate 3D Variational Autoencoder (VAE) that effectively compresses video across spatial and temporal dimensions. Additionally, we enhance the diffusion transformer by integrating a window-based attention mechanism, specifically tailored for video generation tasks. Unfortunately, training text-to-video generative model from scratch demands significant computational resources and data. To overcome these obstacles, we implement a multi-stage training pipeline that optimizes training efficiency and effectiveness. Employing progressive training methodologies, HawkT2V is proficient at crafting coherent, long-duration videos with prominent motion dynamics.
Currently, HawkT2V is be able to generate 2-4s 512x512 video!
As soon as possible we will update the code to support the generation of longer and large resolution videos.

Preparation

Installation

  1. install from pip
pip install -r requirements.txt

To enable xformers which may save the memory cost of GPU, you also need to install xformers and flash-attn first. For xformers installation, please refer to more specific instruction on Xformers. If it is not needed, make sure the xformers setting in the config file is False, which can be closed as the setting below:

enable_xformers_memory_efficient_attention: False

Quick Start

Inference

(1) Download the Model
The 3B model can be downloaded through Huggingface Fudan-FUXI/HawkT2V_1.0_3B
And to run our provided inference example successfully, you can create a sub-directory named 'pretrained_models' in current working directory, then move the downloaded model to the sub-directory.

(2) Run inference
We prepare some example prompts for video generation in 'samples/test_prompt.json', it is easy to do video generation by just using the command below:

bash scripts/inference.sh

Currently the provided chekpoint can generate 2s 512x512 video, we will update the pretrained model as soon as possible to support the generation of up to 8s video.

Finetune

Currently, to finetune on your own dataset, the command below is an example:

bash scripts/train.sh

This script will finetune the 3B model on custom datasets, finnaly after enough training it will be able generate 512x512 videos. To run the training of 3B model smoothly, the memory of GPU should be equal or larger than 80G.

Demos

Here are some 512x512 video examples generated by our 3B model.

Acknowledgement

HawkT2V is built upon many wonderful previous works, included but not limmited to Latte, Pixart, HD-VILA and LAION

License Agreement

The code in this repository is released under the Apache 2.0 License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors