Welcome to Omni-R1! 👋 This repository provides implementation code for "Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning".
We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data.
Important
Faster Evaluation & RL Rollouts with vLLM. Our evaluation and RL rollout pipelines (based on verl) are accelerated by vLLM, which can significantly reduce the inference time of large-scale sampling and long rollouts.
🧭 Two-stage training pipeline
PeSFT introduces perception alignment loss during SFT, and PeRPO applies a perception reward during RL to enhance functional image generation.
🌌 Two training regimes under different supervision
Omni-R1 uses multimodal interleaved supervision, while Omni-R1-Zero bootstraps step-wise visualizations from text-only reasoning data.
🧩 Benchmark
Includes Omni-Bench data and a vLLM-based evaluation script that runs inference efficiently and saves predictions in JSONL format.
- [2026.01] Initial release of Omni-R1.
✅ Reproducibility essentials for Omni-R1 (core code, datasets, checkpoints)
✅ Paper link
✅ Omni-Bench (data + vLLM evaluation script)
⬜ Fully end-to-end PeRPO training framework
⬜ The implementation of bootstrapping step-wise visualizations
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt
pip install ./src/transformersgit clone https://github.com/volcengine/verl && cd verl
# Follow the official install docs:
# https://verl.readthedocs.io/en/latest/start/install.htmlTip
If you only want to run inference with our pretrained checkpoints, you can skip this training section and jump to 3. Inference.
If you only want to evaluate on our benchmark, you can go directly to 4. Omni-Bench.
- Omni-R1 supervision: Zebra-CoT
https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT - Omni-R1-Zero text-only CoT seeds: M3CoT
https://huggingface.co/datasets/LightChen2333/M3CoT
Minimal DeepSpeed config:
DS_JSON='{
"bf16": {"enabled": true},
"zero_optimization": {"stage": 2},
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1
}'Run
export BASE_OR_CKPT=/path/to/base_or_ckpt
export OUT=checkpoints/pesft_run
export JSON_DIR=data/zebra_cot_jsonl
deepspeed --num_gpus 8 src/PeSFT/pesft.py \
--model_path "$BASE_OR_CKPT" \
--output_path "$OUT" \
--json_dir "$JSON_DIR" \
--deepspeed_config_json "$DS_JSON" \
--learning_rate 1e-5 \
--gradient_accumulation_steps 1 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--mode templated # for Omni-R1-Zero
# --mode plain # for Omni-R1What PeSFT does?
- Supervised finetuning with cross-entropy + perception alignment loss to stabilize functional image generation.
Note
The end-to-end PeRPO training recipe is being organized and will be released in a more complete form soon.
Tip
PeRPO can be reproduced by following verl’s DAPO recipe.
In volcengine/verl, you can directly follow:
verl/recipe/dapo
Then, plug in and reuse our reward functions in src/PeRPO/rewards.py as the reward module for the DAPO training loop.
Reward implementation: src/PeRPO/rewards.py
What PeRPO does?
- RL refinement with group-relative optimization using a perception-calibrated reward:
- Accuracy
- Format
- Perception
You can skip training with our pretrained models below:
- Omni-R1: https://huggingface.co/ModalityDance/Omni-R1
- Omni-R1-Zero: https://huggingface.co/ModalityDance/Omni-R1-Zero
export INPUT_JSONL=/path/to/data.jsonl
export OUTDIR=outputs/demo_run
export MODEL=/path/to/ckpt
export PROCESSOR=/path/to/processor_ckpt
python src/Inference/inference.py \
--input "$INPUT_JSONL" \
--output-dir "$OUTDIR" \
--model-path "$MODEL" \
--processor-path "$PROCESSOR" \
--do-sample \
--temperature 1.0 \
--top-p 0.9Key args meaning
--input: JSONL file (or a directory of JSONL files)--output-dir: where predictions are saved--model-path: your checkpoint--processor-path: processor checkpoint path--do-sample,--temperature,--top-p: sampling settings
Download the dataset: https://huggingface.co/datasets/ModalityDance/Omni-Bench
Omni-Bench contains 800 samples spanning 4 Uni-Tasks:
- Natural-Scene Perception: V*
- Structured-Image: ArxivQA, ChartQA
- Diagrammatic Math: Geometry3k, MathVista
- Vision-Operational Scenes: ViC-Bench
python omni-bench/vllm_eval.py \
--parquet_path omni-bench/omni-bench.parquet \
--model_path /path/to/your_model \
--outfile preds.jsonl \
--mm_images_per_prompt 5What this script does?
- Loads Omni-Bench parquet
- Runs batched inference with vLLM
- Saves predictions in JSONL format (
preds.jsonl)
Omni-R1 learns to generate interleaved multimodal reasoning trajectories through a two-stage SFT → RL pipeline.
- Omni-R1: is trained on annotated interleaved multimodal trajectories.
- Omni-R1-Zero: when such annotations are unavailable, bootstraps interleaved trajectories from text-only CoT by visualizing per reasoning step, and then trains with the same pipeline.
- PeSFT: performs supervised fine-tuning with cross-entropy plus a perception alignment loss to stabilize the functional image generation.
- PeRPO: refines the policy with group-relative RL on unified tasks using a composite reward—Accuracy, Format, and Perception.
A high-level overview is illustrated in the figure below.
.
├── omni-bench/
│ ├── omni-bench.parquet # Benchmark dataset (Available in HF)
│ └── vllm_eval.py # vLLM inference / evaluation
│
└── src/
├── Inference/
│ └── inference.py # Inference
│
├── PeRPO/
│ └── rewards.py # Perception reward utilities
│
├── PeSFT/
│ ├── perception.py # Perception module
│ ├── perception_module.ckpt # Perception module checkpoint
│ ├── pesft.py # PeSFT training
│ └── trainer.py # Training utilities
│
└── transformers/
We would like to thank the contributors, open-source projects, and research communities whose work made Omni-R1 possible.
This project is licensed under the MIT License. It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License. Please refer to the LICENSE file for more details.
If you use Omni-R1 in your research or applications, please consider citing:
@misc{cheng2026omnir1unifiedgenerativeparadigm,
title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
year={2026},
eprint={2601.09536},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.09536},
}
