Skip to content

LogosRoboticsGroup/ProphRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Ā 

History

7 Commits
Ā 
Ā 
Ā 
Ā 
Ā 
Ā 

Repository files navigation

Reinforcing Action Policies by Prophesying

arXiv Website
Jiahui Zhang1 3 *, Ze Huang1 3 *, Chun Gu1 2 3, Zipei Ma1 2, Li Zhang1 2 3
1School of Data Science, Fudan Universityā€ƒ 2Shanghai Innovation Instituteā€ƒ 3Logos Roboticsā€ƒ

ProphRL uses a world model as a real-world–facing simulator to post-train VLA policies. Our world model Prophet extends a video generator with history-aware mechanism and dual action conditioning, and is pretrained on large-scale robot trajectories to model action-to-video dynamics. The pretrained Prophet enables `prophesying' precise, physically plausible long-horizon rollouts, and can be rapidly adapted via few-shot fine-tuning to new environments, objects, and trajectories. Upon Prophet, we introduce the FA-GRPO with FlowScale RL algorithm to more stably and efficiently improve policies. Together, our training paradigm turns diverse logged data and a single pretrained world model into a unified engine for scalable, data-efficient, and safely improvable VLA systems.

šŸ“ Abstract

Vision–Language–Action (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real-robot interaction is expensive and conventional simulators are hard to engineer and transfer.

We address both data efficiency and optimization stability in VLA post-training via a learned world model and an RL procedure tailored to flow-based action heads. Specifically, we introduce Prophet, a unified action-to-video robot actuation model pretrained across large-scale, heterogeneous robot data to learn reusable action–outcome dynamics. Prophet can be few-shot adapted to new robots, objects, and environments, yielding a rollout-ready simulator.

Upon Prophet, we reinforce action policies with Flow-action-GRPO (FA-GRPO), which adapts Flow-GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per-step gradients in the flow head. Together, Prophet, FA-GRPO, and FlowScale constitute ProphRL, a practical, data- and compute-efficient path to VLA post-training. Experiments show 5–17% success gains on public benchmarks and 24–30% gains on real robots across different VLA variants.

šŸ“š Bibtex

If you find this project or dataset helpful, please consider citing our paper:

@article{zhang2025prophrl,
    title={Reinforcing Action Policies by Prophesying},
    author={Zhang, Jiahui and Huang, Ze and Gu, Chun and Ma, Zipei and Zhang, Li},
    year={2025},
    journal={arXiv preprint arXiv:2511.20633},
}

About

Reinforcing Action Policies by Prophesying

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published