HawkT2V

HawkT2V, a diffusion transformer model designed for generating videos from textual inputs. To adeptly handle the intricacies of video data, we incorporate 3D Variational Autoencoder (VAE) that effectively compresses video across spatial and temporal dimensions. Additionally, we enhance the diffusion transformer by integrating a window-based attention mechanism, specifically tailored for video generation tasks. Unfortunately, training text-to-video generative model from scratch demands significant computational resources and data. To overcome these obstacles, we implement a multi-stage training pipeline that optimizes training efficiency and effectiveness. Employing progressive training methodologies, HawkT2V is proficient at crafting coherent, long-duration videos with prominent motion dynamics.
Currently, HawkT2V is be able to generate 2-4s 512x512 video!
As soon as possible we will update the code to support the generation of longer and large resolution videos.

Preparation

Installation

install from pip

pip install -r requirements.txt

To enable xformers which may save the memory cost of GPU, you also need to install xformers and flash-attn first. For xformers installation, please refer to more specific instruction on Xformers. If it is not needed, make sure the xformers setting in the config file is False, which can be closed as the setting below:

enable_xformers_memory_efficient_attention: False

Quick Start

Inference

(1) Download the Model
The 3B model can be downloaded through Huggingface Fudan-FUXI/HawkT2V_1.0_3B
And to run our provided inference example successfully, you can create a sub-directory named 'pretrained_models' in current working directory, then move the downloaded model to the sub-directory.

(2) Run inference
We prepare some example prompts for video generation in 'samples/test_prompt.json', it is easy to do video generation by just using the command below:

bash scripts/inference.sh

Currently the provided chekpoint can generate 2s 512x512 video, we will update the pretrained model as soon as possible to support the generation of up to 8s video.

Finetune

Currently, to finetune on your own dataset, the command below is an example:

bash scripts/train.sh

This script will finetune the 3B model on custom datasets, finnaly after enough training it will be able generate 512x512 videos. To run the training of 3B model smoothly, the memory of GPU should be equal or larger than 80G.

Demos

Here are some 512x512 video examples generated by our 3B model.

Acknowledgement

HawkT2V is built upon many wonderful previous works, included but not limmited to Latte, Pixart, HD-VILA and LAION

License Agreement

The code in this repository is released under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs/t2v		configs/t2v
datasets		datasets
diffusion		diffusion
models		models
samples		samples
scripts		scripts
source		source
tools		tools
.DS_Store		.DS_Store
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt
train_with_img_t2v.py		train_with_img_t2v.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HawkT2V

Preparation

Installation

Quick Start

Inference

Finetune

Demos

Acknowledgement

License Agreement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HawkT2V

Preparation

Installation

Quick Start

Inference

Finetune

Demos

Acknowledgement

License Agreement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages