Skip to content
/ MfM Public

[ICLR 2026] Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

License

Notifications You must be signed in to change notification settings

leeruibin/MfM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MfM-logo

paperlicense

Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

Ruibin Li1,2, Tao Yang1, Yangming Shi1, Weiguo Feng1, Shilei Wen1, Bingyue Peng1, Lei Zhang2

1ByteDance, 2The Hong Kong Polytechnic University,

🔥 Latest News

  • Inference code and model weights has been released, have fun with MfM ⭐⭐.

📌 Progress Checklist

  • Inference Code
  • Model Weights
  • ⬜️ Optimization for Parallel Inference

🔥 Inference

1. Install the requirements

pip install -r requirements.txt

2. Download the pipeline from huggingface

from huggingface_hub import snapshot_download

snapshot_download(repo_id="LetsThink/MfM-Pipieline-8B", local_dir="xxx")


#snapshot_download(repo_id="LetsThink/MfM-Pipieline-2B", local_dir="xxx")

3. Inference

You can refer the inference script in scripts/inference.sh

PIPELINE_PATH=xxx
OUTPUT_DIR=outputs
TASK=t2v

python infer_mfm_pipeline.py \
        --pipeline_path LetsThink/MfM-Pipeline-8B \
        --output_dir $OUTPUT_DIR \
        --task $TASK \
        --crop_type keep_res \
        --num_inference_steps 30 \
        --guidance_scale 9 \
        --motion_score 5 \
        --num_samples 1 \
        --upscale 4 \
        --noise_aug_strength 0.0 \
        --t2v_inputs your_prompt.txt \

In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. 🚀✨

📺 Demo Video

MfM_demo.mp4

📮 Architecture

✍️ Citation

If you find our code or model useful in your research, please cite:

@article{yang2025MfM,
  title={Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks},
  author={Ruibin Li, Tao Yang, Yangming Shi, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang},
  year={2026},
  booktitle={arXiv preprint arXiv:2506.01758},
}

About

[ICLR 2026] Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published