Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

Ruibin Li^1,2, Tao Yang¹, Yangming Shi¹, Weiguo Feng¹, Shilei Wen¹, Bingyue Peng¹, Lei Zhang²

¹ByteDance, ²The Hong Kong Polytechnic University,

🔥 Latest News

Inference code and model weights has been released, have fun with MfM ⭐⭐.

📌 Progress Checklist

✅ Inference Code
✅ Model Weights
⬜️ Optimization for Parallel Inference

🔥 Inference

1. Install the requirements

pip install -r requirements.txt

2. Download the pipeline from huggingface

from huggingface_hub import snapshot_download

snapshot_download(repo_id="LetsThink/MfM-Pipieline-8B", local_dir="xxx")


#snapshot_download(repo_id="LetsThink/MfM-Pipieline-2B", local_dir="xxx")

3. Inference

You can refer the inference script in scripts/inference.sh

PIPELINE_PATH=xxx
OUTPUT_DIR=outputs
TASK=t2v

python infer_mfm_pipeline.py \
        --pipeline_path LetsThink/MfM-Pipeline-8B \
        --output_dir $OUTPUT_DIR \
        --task $TASK \
        --crop_type keep_res \
        --num_inference_steps 30 \
        --guidance_scale 9 \
        --motion_score 5 \
        --num_samples 1 \
        --upscale 4 \
        --noise_aug_strength 0.0 \
        --t2v_inputs your_prompt.txt \

In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. 🚀✨

📺 Demo Video

MfM_demo.mp4

📮 Architecture

✍️ Citation

If you find our code or model useful in your research, please cite:

@article{yang2025MfM,
  title={Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks},
  author={Ruibin Li, Tao Yang, Yangming Shi, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang},
  year={2026},
  booktitle={arXiv preprint arXiv:2506.01758},
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
mfm		mfm
scripts		scripts
.DS_Store		.DS_Store
License		License
README.md		README.md
infer_mfm_pipeline.py		infer_mfm_pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

🔥 Latest News

📌 Progress Checklist

🔥 Inference

1. Install the requirements

2. Download the pipeline from huggingface

3. Inference

📺 Demo Video

📮 Architecture

✍️ Citation

About

Uh oh!

Releases

Packages

Languages

License

leeruibin/MfM

Folders and files

Latest commit

History

Repository files navigation

Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

🔥 Latest News

📌 Progress Checklist

🔥 Inference

1. Install the requirements

2. Download the pipeline from huggingface

3. Inference

📺 Demo Video

📮 Architecture

✍️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages