Recipe

Update 2025/11/25: recipes have been moved to a new repository: verl-recipe.

verl is designed to be a modular, extensible framework for post-training: SFT and RL. Recipe is expected to import verl as a library, with necessary extensions to build specific RL training pipeline. If you find verl can't meet recipe's requirements, please open an issue or PR to verl.

There's still some incubation recipes kept here, which is expected to be offically supported in verl in the future.

fully_async_policy: fully asynchronous off-policy training with decoupled trainer and rollout.
transfer_queue: high performance asynchronous streaming data management system.
vla: VLA model RL training.

Awesome work using verl

FlowRL: Matching reward distributions via flow balance for diverse exploration and generalizable reasoning
Logic-RL: a reproduction of DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset.
Seed-Coder: RL training of Seed-Coder boosts performance on competitive programming
all-hands/openhands-lm-32b-v0.1: A strong, open coding agent model, trained with multi-turn fine-tuning
s3 Efficient Yet Effective Search Agent Training via RL
Rec-R1: Bridging Generative Large Language Models and Recommendation Systems via Reinforcement Learning
Explore RL Data Scaling: Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
FIRE: Flaming-hot initiation with regular execution sampling for large language models
DQO: Enhancing multi-Step reasoning abilities of language models through direct Q-function optimization
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
cognition-engineering: Test time scaling drives cognition engineering.
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning.
AdaRFT: Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
critic-rl: LLM critics for code generation
self-rewarding-reasoning-LLM: self-rewarding and correction with generative reward models
DeepEnlighten: Reproduce R1 with social reasoning tasks and analyze key findings
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
PURE: Credit assignment is the key to successful reinforcement fine-tuning using process reward model
cognitive-behaviors: Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
deepscaler: iterative context scaling with GRPO
DAPO: the fully open source SOTA RL algorithm that beats DeepSeek-R1-zero-32B
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation