Skip to content

currycurry915/currycurry915_old.github.io

Repository files navigation

AttentionFlow: Text-to-Video Editing Using Motion Map Injection Module



Abstract

Text-to-image diffusion, which has been trained with a large amount of text-image pair dataset, shows remarkable performance in generating high-quality images. Recent research using diffusion model has been expanded for text-guided video editing tasks by using text-guided image diffusion models as baseline. Existing video editing studies have devised an implicit method of adding cross-frame attention to estimate frame-frame attention to attention maps, resulting in temporal consistent editing. However, because these methods use generative models trained on text-image pair data, they do not take into account one of the most important characteristics of video: motion. When editing a video with prompts, the attention map of the prompt implying the motion of the video, such as `running' or `moving', is not clearly estimated and accurate editing cannot be performed. In this paper, we propose the `Motion Map Injection' (MMI) module to perform accurate video editing by considering movement information explicitly. The MMI module provides a simple but effective way to convey video motion information to T2V models by performing three steps: 1) extracting motion map, 2) calculating the similarity between the motion map and the attention map of each prompt, and 3) injecting motion map into the attention maps. Considering experimental results, input video can be edited accurately and effectively with MMI module. To the best of our knowledge, our study is the first method that utilizes the motion in video for text-to-video editing.

Setup

pip install -r requirements.txt

The environment is very similar to Video-P2P.

Weights

We use the pre-trained stable diffusion model. You can download it here.

Quickstart

Since we developed our codes based on Video-P2P codes, you could refer to their github, if you need.

Please replace pretrained_model_path with the path to your stable-diffusion.

To download the pre-trained model, please refer to diffusers.

# Stage 1: Tuning to do model initialization.

# You can minimize the tuning epochs to speed up.
python run_tuning.py  --config="configs/cloud-1-tune.yaml"
# Stage 2: Attention Control

python run_attention_flow.py --config="configs/cloud-1-p2p.yaml"

Find your results in Video-P2P/outputs/xxx/results.

Examples

Input Video Video-P2P Ours
"clouds flowing under a skyscraper" "waves flowing under a skyscraper" "waves flowing under a skyscraper"
"clouds flowing on the mountain" "lava flowing on the mountain" "lava flowing on the mountain"
"spinning wings of windmill are beside the river" "yellow spinning wings of windmill are beside the river" "yellow spinning wings of windmill are beside the river"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published