A fast and easy-to-use library to support RL training for multi-modal generative models, built on top of verl, vLLM-Omni, and diffusers.
- Easy integration of diverse RL training algorithms for multi-modal generative models, including
FlowGRPOandFlowGRPO-Fast. - Scalable and efficient parallel training with an asynchronous streaming workflow.
- Compatibility with diffusion models from
diffusers.
- Flow-GRPO
- Flow-GRPO-Fast
- Mix-GRPO (coming soon)
- DiffusionNFT (coming soon)
- Async Reward Computation during Rollout
- One-Step-Off Async Policy
Note: This repository is continuously updated. New models, rewards, and algorithms will be added soon.
The latest version of gerl can be installed as follows:
pip install git+https://github.com/leibniz-csi/mm_grpo.gitIf you prefer to use the scripts under examples/ directly, please clone the repository and install the package locally:
git clone https://github.com/leibniz-csi/mm_grpo.git && cd mm_grpo
pip install -e .
# with paddleocr reward support
# pip install -e .[paddleocr]Below are examples for post-training SD-3.5-M on an OCR task using the OCR reward.
- Dataset
Download the OCR dataset from Flow-GRPO and place it in the dataset folder.
Before training, specify the paths in the configs data.train_files and data.val_files.
- Start Training
We provide scripts for a quick start:
# SD3 + Flow-GRPO
bash examples/flowgrpo_trainer/run_sd3.sh
# SD3 + Flow-GRPO-Fast
bash examples/flowgrpo_trainer/run_sd3_fast.shThis repo supports multiple rule-based and model-based rewards (see Supported Rewards).
Typical steps to use a reward:
- Install related dependencies and optionally launch any model server.
-
Model-based or rule-based reward (CPU only)
For example, to use the PaddleOCR reward, install related dependencies:
pip install paddlepaddle "paddleocr>=3.0" python-Levenshtein -
Model-based reward with vllm online serving.
For example, to use QwenVL-OCR reward or UnifiedReward image reward, we should install
vllmpackage and launch the online serving (bash example):Environment variable names for vllm serving
To launch the vllm server, environment variables of url and model path should be set:
Reward URL to access Model to use QwenVL-OCR QWEN_VL_OCR_VLLM_URL QWEN_VL_OCR_PATH UnifiedReward UNIFIED_REWARD_VLLM_URL UNIFIED_REWARD_PATH Note: See
QwenVLOCRVLLMScorerandUnifiedRewardVLLMScorerin vllm.py for environment variable names and usage.# vllm installation, please refer to official installation for details uv pip install vllm # Launch vllm serving for Qwen2.5-VL-7B-Instruct CUDA_VISIBLE_DEVICES=0 vllm serve ${CHECKPOINT_HOME}/Qwen/Qwen2.5-VL-7B-Instruct --host localhost --port 9529 # Set access url and model name export QWEN_VL_OCR_VLLM_URL=http://localhost:9529/v1 export QWEN_VL_OCR_PATH=${CHECKPOINT_HOME}/Qwen/Qwen2.5-VL-7B-Instruct # Launch vllm serving for UnifiedReward-2.0-qwen3vl-32b CUDA_VISIBLE_DEVICES=1,2,3,4 vllm serve ${CHECKPOINT_HOME}/CodeGoat24/UnifiedReward-2.0-qwen3vl-32b \ --host localhost \ --served-model-name UnifiedReward \ --gpu-memory-utilization 0.9 \ --tensor-parallel-size 4 \ --port 8090 # Set access url and model name export UNIFIED_REWARD_VLLM_URL=http://localhost:8090/v1 export UNIFIED_REWARD_PATH=UnifiedReward
- Add training/validation reward names to training configuration:
-
Single reward:
python3 -m gerl.trainer.main_flowgrpo \ data.data_source=ocr \ data.reward_fn='["paddle-ocr"]' \ ... -
Multiple rewards:
python3 -m gerl.trainer.main_flowgrpo \ data.data_source=prompt \ data.reward_fn='["qwenvl-ocr-vllm"]' \ # proxy reward to use data.val_reward_fn='["unified-reward-vllm", "qwenvl-ocr-vllm"]' \ # gold reward to use for validation set ...Note: if validation reward
val_reward_fnis not set, it defaults to training rewardreward_fn.
data.data_source=promptis required for multi-reward.
Please refer to Customize Reward Function for details; it describes how to customize reward scorers step by step.
We appreciate the contributions of the following works: