Update 2025/11/25: recipes have been moved to a new repository: verl-recipe.
verl is designed to be a modular, extensible framework for post-training: SFT and RL. Recipe is expected to import verl as a library, with necessary extensions to build specific RL training pipeline. If you find verl can't meet recipe's requirements, please open an issue or PR to verl.
There's still some incubation recipes kept here, which is expected to be offically supported in verl in the future.
- fully_async_policy: fully asynchronous off-policy training with decoupled trainer and rollout.
- transfer_queue: high performance asynchronous streaming data management system.
- vla: VLA model RL training.
- FlowRL: Matching reward distributions via flow balance for diverse exploration and generalizable reasoning
- Logic-RL: a reproduction of DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset.
- Seed-Coder: RL training of Seed-Coder boosts performance on competitive programming
- all-hands/openhands-lm-32b-v0.1: A strong, open coding agent model, trained with multi-turn fine-tuning
- s3 Efficient Yet Effective Search Agent Training via RL
- Rec-R1: Bridging Generative Large Language Models and Recommendation Systems via Reinforcement Learning
- Explore RL Data Scaling: Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
- FIRE: Flaming-hot initiation with regular execution sampling for large language models
- DQO: Enhancing multi-Step reasoning abilities of language models through direct Q-function optimization
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- cognition-engineering: Test time scaling drives cognition engineering.
- Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning.
- AdaRFT: Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
- critic-rl: LLM critics for code generation
- self-rewarding-reasoning-LLM: self-rewarding and correction with generative reward models
- DeepEnlighten: Reproduce R1 with social reasoning tasks and analyze key findings
- MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
- PURE: Credit assignment is the key to successful reinforcement fine-tuning using process reward model
- cognitive-behaviors: Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
- deepscaler: iterative context scaling with GRPO
- DAPO: the fully open source SOTA RL algorithm that beats DeepSeek-R1-zero-32B
- NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation