This repository contains the official code and checkpoints used in the paper "OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows"
conda create --name python=3.10
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
pip install -e .
Checkpoint (v0.5) is available on Huggingface.
Checkpoint (v0.9) is available on Huggingface.. This checkpoint is trained on additional data focusing on audio-visual correspondence.
from omniflow import OmniFlowPipeline
pipeline = OmniFlowPipeline.load_pretrained('ckpts/v0.5',device='cuda')
pipeline.cfg_mode = 'new'
imgs = pipeline("portrait of a cyberpunk girl with neon tattoos and a visor,staring intensely. Standing on top of a building",height=512,width=512,add_token_embed=0,task='t2i')
For more examples of Any-to-Any Generation, checkout scripts/Demo.ipynb
See scripts/training.md. We also release a filtered synthethic dataset containing text-audio-image triplets at Huggingface
If you find OmniFlow useful in your research, please consider cite
@InProceedings{Li_2025_CVPR,
author = {Li, Shufan and Kallidromitis, Konstantinos and Gokul, Akash and Liao, Zichun and Kato, Yusuke and Kozuka, Kazuki and Grover, Aditya},
title = {OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
pages = {13178-13188}
}

