Skip to content

An ease-to-use and fast library to support RL training for multi-modal generative models

License

Notifications You must be signed in to change notification settings

zhtmike/mm_grpo

 
 

Repository files navigation

MM-GRPO

A fast and easy-to-use library to support RL training for multi-modal generative models, built on top of verl, vLLM-Omni, and diffusers.

Key Features

  • Easy integration of diverse RL training algorithms for multi-modal generative models, including FlowGRPO and FlowGRPO-Fast.
  • Scalable and efficient parallel training with an asynchronous streaming workflow.
  • Compatibility with diffusion models from diffusers.

Supported Algorithms

Supported Async Strategies

Supported Models

Supported Rewards

Note: This repository is continuously updated. New models, rewards, and algorithms will be added soon.

Get Started

Installation

Install from Source

The latest version of gerl can be installed as follows:

pip install git+https://github.com/leibniz-csi/mm_grpo.git

Install from Local

If you prefer to use the scripts under examples/ directly, please clone the repository and install the package locally:

git clone https://github.com/leibniz-csi/mm_grpo.git && cd mm_grpo
pip install -e . 

# with paddleocr reward support
# pip install -e .[paddleocr]

Quick Start

Flow-GRPO / Flow-GRPO-Fast

Below are examples for post-training SD-3.5-M on an OCR task using the OCR reward.

  1. Dataset

Download the OCR dataset from Flow-GRPO and place it in the dataset folder.
Before training, specify the paths in the configs data.train_files and data.val_files.

  1. Start Training

We provide scripts for a quick start:

# SD3 + Flow-GRPO
bash examples/flowgrpo_trainer/run_sd3.sh

# SD3 + Flow-GRPO-Fast
bash examples/flowgrpo_trainer/run_sd3_fast.sh

Reward Instructions

This repo supports multiple rule-based and model-based rewards (see Supported Rewards).

Reward Usage

Typical steps to use a reward:

  1. Install related dependencies and optionally launch any model server.
  • Model-based or rule-based reward (CPU only)

    For example, to use the PaddleOCR reward, install related dependencies:

    pip install paddlepaddle "paddleocr>=3.0" python-Levenshtein
  • Model-based reward with vllm online serving.

    For example, to use QwenVL-OCR reward or UnifiedReward image reward, we should install vllm package and launch the online serving (bash example):

    Environment variable names for vllm serving

    To launch the vllm server, environment variables of url and model path should be set:

    Reward URL to access Model to use
    QwenVL-OCR QWEN_VL_OCR_VLLM_URL QWEN_VL_OCR_PATH
    UnifiedReward UNIFIED_REWARD_VLLM_URL UNIFIED_REWARD_PATH

    Note: See QwenVLOCRVLLMScorer and UnifiedRewardVLLMScorer in vllm.py for environment variable names and usage.

    # vllm installation, please refer to official installation for details
    uv pip install vllm
    
    # Launch vllm serving for Qwen2.5-VL-7B-Instruct
    CUDA_VISIBLE_DEVICES=0 vllm serve ${CHECKPOINT_HOME}/Qwen/Qwen2.5-VL-7B-Instruct --host localhost --port 9529
    # Set access url and model name
    export QWEN_VL_OCR_VLLM_URL=http://localhost:9529/v1
    export QWEN_VL_OCR_PATH=${CHECKPOINT_HOME}/Qwen/Qwen2.5-VL-7B-Instruct
    
    # Launch vllm serving for UnifiedReward-2.0-qwen3vl-32b
    CUDA_VISIBLE_DEVICES=1,2,3,4 vllm serve ${CHECKPOINT_HOME}/CodeGoat24/UnifiedReward-2.0-qwen3vl-32b \
      --host localhost \
      --served-model-name UnifiedReward \
      --gpu-memory-utilization 0.9 \
      --tensor-parallel-size 4 \
      --port 8090
    # Set access url and model name
    export UNIFIED_REWARD_VLLM_URL=http://localhost:8090/v1
    export UNIFIED_REWARD_PATH=UnifiedReward
  1. Add training/validation reward names to training configuration:
  • Single reward:

    python3 -m gerl.trainer.main_flowgrpo \
        data.data_source=ocr \
        data.reward_fn='["paddle-ocr"]' \
        ...
  • Multiple rewards:

    python3 -m gerl.trainer.main_flowgrpo \
        data.data_source=prompt \
        data.reward_fn='["qwenvl-ocr-vllm"]' \ # proxy reward to use
        data.val_reward_fn='["unified-reward-vllm", "qwenvl-ocr-vllm"]' \ # gold reward to use for validation set
        ...

    Note: if validation reward val_reward_fn is not set, it defaults to training reward reward_fn.
    data.data_source=prompt is required for multi-reward.

Reward Customization

Please refer to Customize Reward Function for details; it describes how to customize reward scorers step by step.

Acknowledgement

We appreciate the contributions of the following works:

About

An ease-to-use and fast library to support RL training for multi-modal generative models

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Shell 0.5%