MM-GRPO

A fast and easy-to-use library to support RL training for multi-modal generative models, built on top of verl, vLLM-Omni, and diffusers.

Key Features

Easy integration of diverse RL training algorithms for multi-modal generative models, including FlowGRPO and FlowGRPO-Fast.
Scalable and efficient parallel training with an asynchronous streaming workflow.
Compatibility with diffusion models from diffusers.

Supported Algorithms

Flow-GRPO
Flow-GRPO-Fast
Mix-GRPO (coming soon)
DiffusionNFT (coming soon)

Supported Async Strategies

Async Reward Computation during Rollout
One-Step-Off Async Policy

Supported Models

Stable-Diffusion-3.5

Supported Rewards

Note: This repository is continuously updated. New models, rewards, and algorithms will be added soon.

Get Started

Installation

Install from Source

The latest version of gerl can be installed as follows:

pip install git+https://github.com/leibniz-csi/mm_grpo.git

Install from Local

If you prefer to use the scripts under examples/ directly, please clone the repository and install the package locally:

git clone https://github.com/leibniz-csi/mm_grpo.git && cd mm_grpo
pip install -e . 

# with paddleocr reward support
# pip install -e .[paddleocr]

Quick Start

Flow-GRPO / Flow-GRPO-Fast

Below are examples for post-training SD-3.5-M on an OCR task using the OCR reward.

Dataset

Download the OCR dataset from Flow-GRPO and place it in the dataset folder.
Before training, specify the paths in the configs data.train_files and data.val_files.

Start Training

We provide scripts for a quick start:

# SD3 + Flow-GRPO
bash examples/flowgrpo_trainer/run_sd3.sh

# SD3 + Flow-GRPO-Fast
bash examples/flowgrpo_trainer/run_sd3_fast.sh

Reward Instructions

This repo supports multiple rule-based and model-based rewards (see Supported Rewards).

Reward Usage

Typical steps to use a reward:

Install related dependencies and optionally launch any model server.

Model-based or rule-based reward (CPU only)

For example, to use the PaddleOCR reward, install related dependencies:
```
pip install paddlepaddle "paddleocr>=3.0" python-Levenshtein
```

Model-based reward with vllm online serving.

For example, to use QwenVL-OCR reward or UnifiedReward image reward, we should install vllm package and launch the online serving (bash example):

Environment variable names for vllm serving

To launch the vllm server, environment variables of url and model path should be set:

Reward	URL to access	Model to use
QwenVL-OCR	QWEN_VL_OCR_VLLM_URL	QWEN_VL_OCR_PATH
UnifiedReward	UNIFIED_REWARD_VLLM_URL	UNIFIED_REWARD_PATH

Note: See QwenVLOCRVLLMScorer and UnifiedRewardVLLMScorer in vllm.py for environment variable names and usage.

# vllm installation, please refer to official installation for details
uv pip install vllm

# Launch vllm serving for Qwen2.5-VL-7B-Instruct
CUDA_VISIBLE_DEVICES=0 vllm serve ${CHECKPOINT_HOME}/Qwen/Qwen2.5-VL-7B-Instruct --host localhost --port 9529
# Set access url and model name
export QWEN_VL_OCR_VLLM_URL=http://localhost:9529/v1
export QWEN_VL_OCR_PATH=${CHECKPOINT_HOME}/Qwen/Qwen2.5-VL-7B-Instruct

# Launch vllm serving for UnifiedReward-2.0-qwen3vl-32b
CUDA_VISIBLE_DEVICES=1,2,3,4 vllm serve ${CHECKPOINT_HOME}/CodeGoat24/UnifiedReward-2.0-qwen3vl-32b \
  --host localhost \
  --served-model-name UnifiedReward \
  --gpu-memory-utilization 0.9 \
  --tensor-parallel-size 4 \
  --port 8090
# Set access url and model name
export UNIFIED_REWARD_VLLM_URL=http://localhost:8090/v1
export UNIFIED_REWARD_PATH=UnifiedReward

Add training/validation reward names to training configuration:

Single reward:

python3 -m gerl.trainer.main_flowgrpo \
    data.data_source=ocr \
    data.reward_fn='["paddle-ocr"]' \
    ...

Multiple rewards:

python3 -m gerl.trainer.main_flowgrpo \
    data.data_source=prompt \
    data.reward_fn='["qwenvl-ocr-vllm"]' \ # proxy reward to use
    data.val_reward_fn='["unified-reward-vllm", "qwenvl-ocr-vllm"]' \ # gold reward to use for validation set
    ...

Note: if validation reward val_reward_fn is not set, it defaults to training reward reward_fn.
data.data_source=prompt is required for multi-reward.

Reward Customization

Please refer to Customize Reward Function for details; it describes how to customize reward scorers step by step.

Acknowledgement

We appreciate the contributions of the following works:

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github		.github
assets		assets
docs		docs
examples/flowgrpo_trainer		examples/flowgrpo_trainer
gerl		gerl
recipe/customize_reward		recipe/customize_reward
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MM-GRPO

Key Features

Supported Algorithms

Supported Async Strategies

Supported Models

Supported Rewards

Get Started

Installation

Install from Source

Install from Local

Quick Start

Flow-GRPO / Flow-GRPO-Fast

Reward Instructions

Reward Usage

Reward Customization

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

zhtmike/mm_grpo

Folders and files

Latest commit

History

Repository files navigation

MM-GRPO

Key Features

Supported Algorithms

Supported Async Strategies

Supported Models

Supported Rewards

Get Started

Installation

Install from Source

Install from Local

Quick Start

Flow-GRPO / Flow-GRPO-Fast

Reward Instructions

Reward Usage

Reward Customization

Acknowledgement

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages