Object-AVEdit: An Object-level Audio-Visual Editing Model

📚 Object-AVEdit

There is a high demand for audio-visual editing in video post-production and the film making field. While numerous models have explored audio and video editing, they struggle with object-level audio-visual operations. Specifically, object-level audio-visual editing requires the ability to perform object addition, replacement, and removal across both audio and visual modalities, while preserving the structural information of the source instances during the editing process. In this paper, we present \textbf{Object-AVEdit}, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm. To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model, bridging the gap in object-controllability between audio and current video generation models. Meanwhile, to achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm, ensuring both information retention during the inversion and better regeneration effect. Extensive experiments demonstrate that our editing model achieved advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment. In addition, our developed audio generation model also achieved advanced performance. Demo--more video samples in our project page!

🛠️ Setup

cd Object-AVEdit
# setup base environment
conda env create -f environment.yml
conda activate avedit
# install flash-attn3
git clone https://github.com/Dao-AILab/flash-attention 
cd flash-attention/hopper
python setup.py install
# if run training
pip install colossalai --no-deps

🚀 Prepare weights

Audio generation weight.

Download from here and place it in \audio_weight.

Video generation weight.

Download from here and place it in \video_weight.

Other weights needed.

T5-large text encoder from here.

AudioLDM Mel vocoder from here.

🚀 Usage

We provide three types of tasks:

Audio Generation Task.
Audio Editing Task.
Video Editing Task.

Audio Generation

Fisrt, change the path in /audio_generation_model/generation.py

cd audio_generation_model

python generation.py --prompt "dog bark" --GPU_num=0

Audio Editing

Fisrt, change the path in /audio_edit_part/audio_edit_main.py

cd /audio_edit_part

python audio_edit_main.py \
--input_audio "/demo_data/replace_data/21.wav" \
--source_prompt="Several brown cows are standing in a green alpine meadow under the tall, snowy mountains." \
--target_prompt="Several brown horses are standing in a green alpine meadow under the tall, snowy mountains." \
--edit_type="Replacement" \
--GPU_num=0 \
--word="cows" \
--seed=42 \
--CA=50 \
--SA=50 \
--threshold=0.1

--word parameter means the target object that you want to edit. In replacement and removal tasks, the --word should be set to the source object to be edtied, and in addition task, --word shouldnot be set.

--threshold affects the mask process. Biger threshold means less change.

--CA means the cross attention control steps.

--SA means the cross attention control steps.

Biger CA or SA means more control steps. In removal task, SA is often set to 0.

Video Editing

Fisrt, change the path in /video_edit_part/config.py.

cd /video_edit_part

python video_edit_main.py \
--input_video "/demo_data/replace/6.mp4" \
--source_prompt="Several brown cows are standing in a green alpine meadow under the tall, snowy mountains." \
--target_prompt="Several brown horses are standing in a green alpine meadow under the tall, snowy mountains." \
--edit_type="Replacement" \
--GPU_num=0 \
--word="cows" \
--seed=42 \
--CA=37\
--SA=37 \
--threshold=0.05

--word parameter means the target object that you want to edit. In replacement and removal tasks, the --word should be set to the source object to be edtied, and in addition task, --word shouldnot be set.

--threshold affects the mask process. Biger threshold means less change.

--CA means the cross attention control steps.

--SA means the cross attention control steps.

Biger CA or SA means more control steps. In removal task, SA is often set to 0.

☀️ Acknowledgements

Our project is partly based on the Mochi-1, AudioLDM model. We would like to thank the authors for their excellent work! ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
audio_edit_part		audio_edit_part
audio_generation_model		audio_generation_model
packages/audioldm		packages/audioldm
video_edit_part		video_edit_part
.gitignore		.gitignore
environment.yml		environment.yml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Object-AVEdit: An Object-level Audio-Visual Editing Model

📚 Object-AVEdit

🛠️ Setup

🚀 Prepare weights

🚀 Usage

Audio Generation

Audio Editing

Video Editing

☀️ Acknowledgements

About

Uh oh!

Releases

Packages

Languages

F-youquan/Object-AVEdit

Folders and files

Latest commit

History

Repository files navigation

Object-AVEdit: An Object-level Audio-Visual Editing Model

📚 Object-AVEdit

🛠️ Setup

🚀 Prepare weights

🚀 Usage

Audio Generation

Audio Editing

Video Editing

☀️ Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages