Generative multimodal artificial intelligence (AI) has achieved remarkable progress in recent years, driven by large-scale pre-training and the emergence of powerful foundation models. While these models have demonstrated strong capabilities in perception, reasoning, and content synthesis, their training is predominantly based on supervised objectives, which are often insufficient to capture task-specific goals and user intent. Reinforcement learning (RL) has therefore emerged as a critical training framework for improving generative multimodal models.
This repository collects research papers on reinforcement learning in generative multimodal AI. We primarily focus on three categories of models:
- Multimodal understanding models, which focus on perceiving and reasoning over visual inputs and produce corresponding natural language responses.
- Visual generation models, which synthesize visual content conditioned on textual prompts or inputs from other modalities.
- Unified models, which adopt a single framework to jointly support visual understanding and visual generation, allowing multimodal inputs and flexibly producing outputs in the visual or textual form.
-
[2601] [VAR-RL] VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
-
[2512] [EMA-GRPO] OneThinker: All-in-one Reasoning Model for Image and Video
-
[2510] [PIVOT] RL makes MLLMs see better than SFT
-
[2510] [RewardMap] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning
-
[2509] [STAGE] STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation
-
[2509] [GCPO] Group Critical-token Policy Optimization for Autoregressive Image Generation
-
[2508] [DGRPO] Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning
-
[2508] [AR-GRPO] AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning
-
[2507] [X-Omni] X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
-
[2507] [3D-R1] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding
-
[2507] [Multi-Image] Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
-
[2506] [VL-GenRM] VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training
-
[2506] [Temporal-RLT] Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency
-
[2506] [SRPO] SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
-
[2506] [RAPID] Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning
-
[2506] [GThinker] GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking
-
[2506] [DeepVideo-R1] DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
-
[2506] [SVQA-R1] SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization
-
[2506] [ViLaSR] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
-
[2506] [GRPO-CARE] GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
-
[2506] [WeThink] WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
-
[2506] [FocusDiff] FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL
-
[2506] [ViCrit] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
-
[2506] [MiMo-VL] MiMo-VL Technical Report
-
[2506] [Scene-R1] Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations
-
[2506] [AV-Reasoner] AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
-
[2506] [Q-Ponder] Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment
-
[2506] [VQ-Insight] VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning
-
[2506] [TimeMaster] TimeMaster: Training Time-Series Multimodal LLMs to Reason via Reinforcement Learning
-
[2506] [EgoVLM] EgoVLM: Policy Optimization for Egocentric Video Understanding
-
[2506] [RePIC] RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models
-
[2505] [R1-Reward] R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
-
[2505] [TW-GRPO] Reinforcing Video Reasoning with Focused Thinking
-
[2505] [ViGoRL] Grounded Reinforcement Learning for Visual Reasoning
-
[2505] [GRIT] GRIT: Teaching MLLMs to Think with Images
-
[2505] [Ground-R1] Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning
-
[2505] [Pixel Reasoner] Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
-
[2505] [DeepEyes] DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
-
[2505] [CoF] Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
-
[2505] [OpenThinkIMG] OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
-
[2505] [Active-O3] Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
-
[2505] [Qwen-LA] Qwen Look Again: Guiding Vision-Language Reasoning
-
[2505] [VRAG-RL] VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
-
[2505] [VAR-GRPO] Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization
-
[2505] [ReasonGen-R1] ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL
-
[2505] [BiCoT-GRPO] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
-
[2505] [SelfTok] Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
-
[2505] [CoRL] Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
-
[2505] [UniRL] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
-
[2505] [Observe-R1] Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning
-
[2505] [Visionary-R1] Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
-
[2505] [SATORI-R1] SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards
-
[2505] [DIP-R1] DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
-
[2505] [V-Triune] One RL to See Them All: Visual Triple Unified Reinforcement Learning
-
[2505] [Omni-R1] Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
-
[2505] [Skywork-VL] Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
-
[2505] [UnifiedReward-Think] Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
-
[2505] [GoT-R1] GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
-
[2505] [MoDoMoDo] MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning
-
[2505] [VisionReasoner] VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning
-
[2505] [VisualQuality-R1] VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
-
[2505] [EchoInk-R1] EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
-
[2505] [VAU-R1] VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning
-
[2505] [Jigsaw-R1] Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
-
[2505] [G1] G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
-
[2505] [VisTA] VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
-
[2505] [STAR-R1] STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
-
[2505] [UniVG-R1] UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
-
[2505] [ProxyThinker] ProxyThinker: Test-Time Guidance through Small Visual Reasoners
-
[2505] [Visual Planning] Visual Planning: Let's Think Only with Images
-
[2505] [RLRF] Rendering-Aware Reinforcement Learning for Vector Graphics Generation
-
[2505] [Omni-R1] Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
-
[2504] [VARGPT-v1.1] VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
-
[2504] [VLM-R1] VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
-
[2504] [Skywork R1V2] Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
-
[2504] [Perception-R1] Perception-R1: Pioneering Perception Policy with Reinforcement Learning
-
[2504] [TinyLLaVA-Video-R1] TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
-
[2504] [SimpleAR] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
-
[2504] [SpaceR] SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
-
[2504] [VideoChat-R1] VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
-
[2504] [vsGRPO] Improved Visual-Spatial Reasoning via R1-Zero-Like Training
-
[2504] [Phys-AR] Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning
-
[2503] [VisRL] VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
-
[2503] [Reason-RFT] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
-
[2503] [MetaSpatial] MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
-
[2503] [R1-Onevision] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
-
[2503] [T-GRPO] Video-R1: Reinforcing Video Reasoning in MLLMs
-
[2503] [Vision-R1] Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning
-
[2503] [Visual-RFT] Visual-RFT: Visual Reinforcement Fine-Tuning
-
[2503] [Time-R1] Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
-
[2503] [UnifiedReward] Unified Reward Model for Multimodal Understanding and Generation
-
[2503] [SEED-Bench-R1] Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
-
[2503] [MM-Eureka] MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
-
[2503] [R1-Omni] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning
-
[2502] [HermesFlow] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
-
[2501] [PARM] Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
-
[2411] [MPO] Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
-
[2312] [RLHF-V] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
-
[2603] [PhyPrompt] PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation
-
[2602] [UnifiedReward-Flex] Unified Personalized Reward Model for Vision Generation
-
[2601] [PhysRVG] PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models
-
[2601] [Talk2Move] Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
-
[2601] [DenseGRPO] DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment
-
[2601] [TAGRPO] TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment
-
[2512] [GARDO] GARDO: Reinforcing Diffusion Models without Reward Hacking
-
[2512] [PhyGDPO] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation
-
[2510] [Identity-GRPO] Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
-
[2510] [IPRO] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
-
[2510] [DGPO] Reinforcing Diffusion Models by Direct Group Preference Optimization
-
[2510] [Smart-GRPO] Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models
-
[2510] [G2RPO] Fine-Grained GRPO for Precise Preference Alignment in Flow Models
-
[2509] [Dynamic-TreeRPO] Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling
-
[2509] [BruPA/FluPA] Follow-Your-Preference: Towards Preference-Aligned Image Inpainting
-
[2509] [EditScore] EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling
-
[2509] [DiffusionNFT] DiffusionNFT: Online Diffusion Reinforcement with Forward Process
-
[2509] [AWM] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models
-
[2509] [PCPO] PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models
-
[2509] [CPS] Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching
-
[2509] [BranchGRPO] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
-
[2508] [Pref-GRPO] Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
-
[2508] [TempFlow-GRPO] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
-
[2507] [MixGRPO] MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
-
[2507] [LLaVA-Reward] Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
-
[2506] [DreamCS] DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision
-
[2506] [RDPO] RDPO: Real Data Preference Optimization for Physics Consistency Video Generation
-
[2505] [Flow-GRPO] Flow-GRPO: Training Flow Matching Models via Online RL
-
[2505] [SmPO-Diffusion] Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences
-
[2505] [Diffusion-NPO] Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models
-
[2505] [DanceGRPO] DanceGRPO: Unleashing GRPO on Visual Generation
-
[2505] [InfLVG] InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO
-
[2505] [D-Fusion] D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples
-
[2505] [RePrompt] RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning
-
[2505] [Self-Reward] Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation
-
[2504] [GAPO] Aligning Anime Video Generation with Human Feedback
-
[2503] [B2-DiffuRL] Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
-
[2503] [LOOP] A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning
-
[2503] [InPO] InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
-
[2503] [GRADEO] GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning
-
[2502] [Continuous Time RL] Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning
-
[2502] [DreamDPO] DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization
-
[2502] [CaPO] Calibrated Multi-Preference Optimization for Aligning Diffusion Models
-
[2501] [Flow-DPO] Improving Video Generation with Human Feedback
-
[2501] [PPD] Personalized Preference Fine-tuning of Diffusion Models
-
[2412] [DiffPPO] DiffPPO: Reinforcement Learning Fine-Tuning of Diffusion Models for Text-to-Image Generation
-
[2412] [VideoDPO] VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
-
[2412] [MVReward] MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences
-
[2412] [TPDM] Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation
-
[2410] [SSPO] Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization
-
[2410] [PrefPaint] PrefPaint: Aligning Image Inpainting Diffusion Model with Human Preference
-
[2409] [VideoRM] Boosting Text-to-Video Generative Model with MLLMs Feedback
-
[2407] [RPO] Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning
-
[2407] [DPG] Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning
-
[2406] [Diversity Reward] Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
-
[2406] [VideoScore] VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
-
[2406] [Diffusion-RPO] Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization
-
[2406] [SPO] Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization
-
[2405] [Curriculum DPO] Curriculum Direct Preference Optimization for Diffusion and Consistency Models
-
[2405] [DNO] Inference-Time Alignment of Diffusion Models with Direct Noise Optimization
-
[2404] [RLCM] RL for Consistency Models: Faster Reward Guided Text-to-Image Generation
-
[2404] [Diffusion-KTO] Aligning Diffusion Models by Optimizing Human Utility
-
[2404] [PAE] Dynamic Prompt Optimizing for Text-to-Image Generation
-
[2403] [DreamReward] DreamReward: Text-to-3D Generation with Human Preference
-
[2402] [TDPO-R] Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases
-
[2402] [DenseReward] A Dense Reward View on Aligning Text-to-Image Diffusion with Preference
-
[2401] [Parrot] Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
-
[2312] [InstructVideo] InstructVideo: Instructing Video Diffusion Models with Human Feedback
-
[2311] [D3PO] Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
-
[2311] [Diffusion-DPO] Diffusion Model Alignment Using Direct Preference Optimization
-
[2311] [TextForce] Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
-
[2309] [DRaFT] Directly Fine-Tuning Diffusion Models on Differentiable Rewards
-
[2305] [DDPO] Training Diffusion Models with Reinforcement Learning
-
[2305] [DPOK] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models
-
[2304] [ImageReward] ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
-
[2304] [RAFT] RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
-
[2302] [Reward-weighted Method] Aligning Text-to-Image Models using Human Feedback
-
[2212] [Promptist] Optimizing Prompts for Text-to-Image Generation
