Awesome RL for AI Agents
A curated list of recent progress and resources on Reinforcement Learning for AI Agents.
Reinforcement learning (RL) is rapidly becoming a driving force for AI agents that can reason, act, and adapt in the real world. While large language models (LLMs) provide powerful priors for reasoning, they remain static without feedback. RL closes this gap by enabling agents to learn from interactions—through self-reflection, outcome-based rewards, and tool or human feedback.
This repository curates up-to-date resources on RL for AI agents , organized along three main axes:
Agentic workflows without training – prompting strategies that enhance reasoning without fine-tuning.
Evaluation and benchmarks – systematic tests for reasoning, tool use, and automation.
RL for single and multi-agent systems – advancing self-evolution, efficient tool use, and collaboration.
Tables provide quick overviews, while accompanying descriptions highlight deeper insights.
Agentic Workflow without Training
Title
Short title
Venue
Year
Materials
Description
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
ToT
ICML
2023
Paper
Search over reasoning trees to explore alternatives before committing.
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion
NeurIPS
2023
Paper
Self-critique and retry loops that emulate feedback without training.
Self-Refine: Iterative Refinement with Self-Feedback
Self-Refine
NeurIPS
2023
Paper
Iterative editing using self-generated feedback to improve outputs.
ReAct: Synergizing Reasoning and Acting in Language Models
ReAct
ICLR
2023
Paper
Interleaves chain-of-thought with tool calls for grounded reasoning.
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks
SwiftSage
ACL
2023
Paper
Splits fast vs slow planning to balance cost and performance.
DynaSaur: Large Language Agents Beyond Predefined Actions
DynaSaur
arXiv
2024
Paper
Dynamically extends the agent’s action space beyond fixed tool sets.
Agent Evaluation and Benchmarks
Title
Short title
Venue
Year
Materials
Description
GAIA: A Benchmark for General AI Assistants
GAIA
arXiv
2023
Paper
466 real-world tasks spanning tools and reasoning.
TaskBench: Benchmarking Large Language Models for Task Automation
TaskBench
EMNLP
2023
Paper
Evaluates multi-step automation and tool integration.
AgentBench: Evaluating LLMs as Agents
AgentBench
arXiv
2023
Paper
51 scenarios to test agentic behaviors and robustness.
ACEBench: Who Wins the Match Point in Tool Usage?
ACEBench
arXiv
2025
Paper
Fine-grained tool-use evaluation with step sensitivity.
Agent Leaderboard (Galileo)
Galileo LB
HF
2024
Dataset
Community leaderboard built around GAIA-style tasks.
Agentic Predictor: Performance Prediction for Agentic Workflows
Agentic Predictor
arXiv
2025
Paper
Predicts workflow performance for better design-time choices.
Agent Training Frameworks
Title
Short title
Year
🌟 Stars
Materials
Description
Agent Lightning: Train ANY AI Agents with Reinforcement Learning
Agent Lightning
2025
Paper | Code
Unified MDP; decouples execution and training with scalable workers.
SkyRL-v0: Train Real-World Long-Horizon Agents via RL
SkyRL-v0
2025
Blog | Code
Online RL pipeline for long-horizon agent training.
OpenManus-RL: Live-Streamed RL Tuning Framework for LLM Agents
OpenManus-RL
2025
Code | Dataset
Live-streamed tuning of LLM agents with dataset support.
MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems
MASLab
2025
Paper | Code
Unified MAS codebase integrating 20+ multi-agent system methods.
VerlTool: Towards Holistic Agentic RL with Tool Use
VerlTool
2025
Paper | Code
Modular ARLT; supports asynchronous rollouts.
L0: Reinforcement Learning to Become General Agents
L0
2025
Paper | Code
Scalable RL pipeline; NB-Agent scaffold; concurrent worker pool.
verl-agent: Extension of veRL for LLM Agents
verl-agent
2025
Code
Step-independent multi-turn rollouts; memory modules; GiGPO RL algorithm.
ART: Agent Reinforcement Trainer
ART
2025
Code
Python harness for GRPO-based RL; OpenAI API-compatible; notebook examples.
AReaL: Ant Reasoning RL for LLMs
AReaL
2025
Paper | Code
Fully async RL system; scalable from 1→1K GPUs; open & reproducible.
Agent-R1: End-to-End RL for Tool-using Agents
Agent-R1
2025
--
(No public repo found)
Multi-tool coordination; process rewards; reward normalization (described, not yet open).
siiRL: Scalable Infrastructure for Interactive RL Agents
siiRL
2025
Paper | Code
Infrastructure and algorithms for large-scale RL training.
slime: Self-Improving LLM Agents
slime
2025
Blog | Code
Continuous improvement framework for LLM agents.
ROLL: RL for Open-Ended LLM Agents
ROLL
2025
Paper | Code
Alibaba’s RL framework for multi-task LLM agents.
MARTI: Multi-Agent RL with Tool Integration
MARTI
2025
Code
Tsinghua’s tool-augmented multi-agent RL.
RL2: Reinforcement Learning Reloaded
RL2
2025
Code
Individual repo exploring advanced RL.
verifiers: Benchmarking LLM Verification
verifiers
2025
Code
Verification-focused RL experiments.
oat: Optimizing Agent Training
oat
2024
Paper | Code
NUS / Sea AI’s agent optimization framework.
veRL: Volcengine RL Framework
veRL
2024
Paper | Code
ByteDance’s general-purpose RL framework.
OpenRLHF: Open Reinforcement Learning from Human Feedback
OpenRLHF
2023
Paper | Code
Open-source RLHF training platform.
TRL: Transformer Reinforcement Learning
TRL
2019
Code
HuggingFace’s RL library for transformers.
Reinforcement learning methods that focus on individual agents (typically LLMs), enabling them to adapt, self-improve, and use tools effectively.
Title
Short title
Venue
Year
Materials
Description
TTRL: Test-Time Reinforcement Learning
TTRL
ICLR
2025
Paper
Inference-time RL via majority-vote rewards.
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
ProRL
ICLR
2025
Paper
KL-control with reference resets for longer reasoning.
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
RAGEN / StarPO
ICLR
2025
Paper
Multi-turn critic-based RL for evolving behaviors.
Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution
Alita
GAIA LB
2025
Paper
Modular framework for online self-evolution.
Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement
Gödel Agent
ACL / arXiv
2024–2025
Paper
Recursive self-modification with reasoning loops.
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Darwin GM
arXiv
2025
Paper
Darwinian exploration for open-ended agent improvement.
SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning
SkyRL-v0
arXiv / GitHub
2025
Blog | Code
Long-horizon online RL training pipeline.
RL for Tool Use & Agent Training
Title
Short title
Venue
Year
Materials
Description
AGILE: RL Framework for LLM Agents
AGILE
arXiv
2024
Paper
Combines RL, memory, and tool use.
AgentOptimizer: Functions as Learnable Weights
AgentOptimizer
ICML
2024
Paper
Offline training with learnable tool weights.
FireAct: Fine-tuning LLM Agents
FireAct
arXiv
2023
Paper
Multi-task SFT baseline for RL comparison.
Tool-Integrated Reinforcement Learning
ToRL
arXiv
2025
Paper
Large-scale tool-integrated RL training.
ToolRL: Reward is All Tool Learning Needs
ToolRL
arXiv
2025
Paper
Studies reward shaping for tool use.
ARTIST: Unified Reasoning & Tools
ARTIST
arXiv
2025
Paper
Joint reasoning + tool integration.
ZeroTIR: Scaling Law for Tool RL
ZeroTIR
arXiv
2025
Paper
Scaling behavior of tool-augmented RL.
OTC: Acting Less is Reasoning More
OTC
arXiv
2025
Paper
Optimizes efficiency by reducing unnecessary tool calls.
WebAgent-R1: End-to-End Multi-Turn RL
WebAgent-R1
arXiv
2025
Paper
Trains web agents on multi-turn environments.
GiGPO: Group-in-Group PPO
GiGPO
arXiv
2025
Paper
Hierarchical PPO for agent training.
Nemotron-Research-Tool-N1
Nemotron-Tool-N1
arXiv
2025
Paper
Pure RL setup for tool reasoning.
CATP-LLM: Cost-Aware Tool Planning
CATP-LLM
ICCV / arXiv
2024–2025
Paper | Code
Optimizes tool usage under cost constraints.
Tool-Star: Multi-Tool RL via Hierarchical Rewards
Tool-Star
arXiv
2025
Paper
Reinforcement with structured multi-tool reasoning.
Memory & Knowledge Management
Title
Short title
Venue
Year
Materials
Description
Memory-R1: RL Memory Manager
Memory-R1
arXiv
2025
Paper
RL-based memory controller for better retrieval.
A-MEM: Agentic Memory for LLM Agents
A-MEM
arXiv
2025
Paper
Zettelkasten-style dynamic memory management.
KnowAgent: Knowledge-Augmented Planning
KnowAgent
NAACL Findings
2025
Paper
Planning with structured knowledge bases.
Fine-Grained RL & Trajectory Calibration
Title
Short title
Venue
Year
Materials
Description
StepTool: Multi-Step Tool Usage
StepTool
CIKM
2025
Paper
Step-grained rewards for tool usage.
RLTR: Process-Centric Rewards
RLTR
arXiv
2025
Paper
Rewards good reasoning trajectories, not just outcomes.
SPA-RL: Stepwise Progress Attribution
SPA-RL
arXiv
2025
Paper
Credits progress at intermediate steps.
STeCa: Step-Level Trajectory Calibration
STeCa
ACL Findings
2025
Paper
Calibrates suboptimal steps for better learning.
SWEET-RL: Multi-Turn Collaborative RL
SWEET-RL
arXiv
2025
Paper
Multi-turn reasoning with collaborative critic.
ATLaS: Critical Step Selection
ATLaS
ACL
2025
Paper
Focuses learning on critical reasoning steps.
Algorithm Families (PPO, DPO, GRPO, etc.)
Summarizes key algorithm families, objectives, and available implementations.
Method
Year
Objective
Clip
KL Penalty
Mechanism
Signal
Link
Resource
PPO family
PPO
2017
Policy gradient
Yes
No
Policy ratio clipping
Reward
Paper
-
VAPO
2025
Policy gradient
Yes
Adaptive
Adaptive KL penalty + variance control
Reward + variance
Paper
-
PF-PPO
2024
Policy gradient
Yes
Yes
Policy filtration
Noisy reward
Paper
Code
VinePPO
2024
Policy gradient
Yes
Yes
Unbiased value estimates
Reward
Paper
Code
PSGPO
2024
Policy gradient
Yes
Yes
Process supervision
Process reward
Paper
-
DPO family
DPO
2024
Preference optimization
No
Yes
Implicit reward
Human preference
Paper
-
β-DPO
2024
Preference optimization
No
Adaptive
Dynamic KL coefficient
Human preference
Paper
Code
SimPO
2024
Preference optimization
No
Scaled
Avg log-prob as implicit reward
Human preference
Paper
Code
IPO
2024
Implicit preference
No
No
Preference classification
Rank
Paper
-
KTO
2024
Knowledge transfer optimization
No
Yes
Teacher-student stabilization
Logits
Paper
Code
ORPO
2024
Online regularized PO
No
Yes
Online stabilization
Feedback reward
Paper
Code
Step-DPO
2024
Step-wise preference
No
Yes
Step-level supervision
Step preference
Paper
Code
LCPO
2025
Length-conditioned PO
No
Yes
Length preference
Reward
Paper
-
GRPO family
GRPO
2025
Policy gradient (group reward)
Yes
Yes
Group-based relative reward, no value estimates
Group reward
Paper
-
DAPO
2025
Surrogate of GRPO
Yes
Yes
Decoupled clip + dynamic sampling
Dynamic group reward
Paper
Code | Model | Website
GSPO
2025
Surrogate of GRPO
Yes
Yes
Sequence-level clipping & reward
Smooth group reward
Paper
-
GMPO
2025
Surrogate of GRPO
Yes
Yes
Geometric mean of token rewards
Margin-based reward
Paper
Code
ProRL
2025
Same as GRPO
Yes
Yes
Reference policy reset
Group reward
Paper
Model
Posterior-GRPO
2025
Same as GRPO
Yes
Yes
Rewards only successful processes
Process reward
Paper
-
Dr.GRPO
2025
Unbiased GRPO
Yes
Yes
Removes bias in optimization
Group reward
Paper
Code | Model
Step-GRPO
2025
Same as GRPO
Yes
Yes
Rule-based reasoning reward
Step-wise reward
Paper
Code | Model
SRPO
2025
Same as GRPO
Yes
Yes
Two-stage history resampling
Reward
Paper
Model
GRESO
2025
Same as GRPO
Yes
Yes
Pre-rollout filtering
Reward
Paper
Code | Website
StarPO
2025
Same as GRPO
Yes
Yes
Reasoning-guided multi-turn
Group reward
Paper
Code | Website
GHPO
2025
Policy gradient
Yes
Yes
Adaptive prompt refinement
Reward
Paper
Code
Skywork R1V2
2025
GRPO (hybrid signal)
Yes
Yes
Selective buffer, multimodal reward
Multimodal
Paper
Code | Model
ASPO
2025
GRPO (shaped advantage)
Yes
Yes
Clipped advantage bias
Group reward
Paper
-
TreePO
2025
Surrogate of GRPO
Yes
Yes
Self-guided rollout
Group reward
Paper
Code | Model | Website
EDGE-GRPO
2025
Same as GRPO
Yes
Yes
Entropy-driven advantage + error correction
Group reward
Paper
Code | Model
DARS
2025
Same as GRPO
Yes
No
Multi-stage hardest problems
Group reward
Paper
Code | Model
CHORD
2025
Weighted GRPO + SFT
Yes
Yes
Auxiliary supervised loss
Group reward
Paper
Code
PAPO
2025
Surrogate of GRPO
Yes
Yes
Implicit perception loss
Group reward
Paper
Code | Model | Website
Pass@k Training
2025
Same as GRPO
Yes
Yes
Pass@k metric as reward
Group reward
Paper
Code
Cost-Aware Reasoning & Budget-Constrained RL
As agents scale, cost, latency, and efficiency become critical. These works tackle budget-aware reasoning, token efficiency, and cost-sensitive planning.
Title
Short title
Venue
Year
Materials
Description
Cost-Augmented Monte Carlo Tree Search for LLM-Assisted Planning
CATS
arXiv
2025
Paper
Incorporates cost into MCTS for planning under constraints.
Token-Budget-Aware LLM Reasoning
TALE
arXiv
2024
Paper
Allocates token budget optimally across reasoning steps.
FrugalGPT: Using LLMs While Reducing Cost
FrugalGPT
arXiv
2023
Paper
Early exploration of cost minimization by routing queries.
Efficient Contextual LLM Cascades via Budget-Constrained Policy Learning
TREACLE
arXiv
2024
Paper
Learns cascades balancing budget and accuracy.
BudgetMLAgent: Cost-Effective Multi-Agent System for ML Automation
BudgetMLAgent
AIMLSystems
2025
—
Multi-agent framework designed for cost efficiency.
The Cost of Dynamic Reasoning: A Systems View
Systems Cost
arXiv
2025
Paper
Measures latency, energy, and financial cost of agent reasoning.
Budget-Aware Evaluation of LLM Reasoning Strategies
BudgetEval
EMNLP
2024
Paper
Proposes evaluation framework accounting for budget limits.
LLM Cascades with Mixture of Thoughts for Cost-Efficient Reasoning
MoT Cascade
ICLR / arXiv
2024
Paper | Code
Uses “mixture of thoughts” cascades for efficiency.
BudgetThinker: Budget-Aware LLM Reasoning with Control Tokens
BudgetThinker
arXiv
2025
Paper
Introduces control tokens to manage budget during inference.
Beyond One-Preference-Fits-All Alignment: Multi-Objective DPO
MODPO
arXiv
2023–2024
Paper
Extends DPO with multi-objective alignment.
RL for Multi-Agent Systems
Title
Short title
Venue
Year
Materials
Description
OWL: Optimized Workforce Learning for Real-World Automation
OWL
arXiv
2025
Paper
Planner + workers
Profile-Aware Maneuvering for GAIA by AWorld
AWorld
NeurIPS
2024
Paper
Guard agents
Plan-over-Graph: Towards Parallelable Agent Schedule
Plan-over-Graph
arXiv
2025
Paper
Graph scheduling
LLM-Based Multi-Agent Reinforcement Learning: Directions
MARL Survey
arXiv
2024
Paper
Survey
Self-Resource Allocation in Multi-Agent LLM Systems
Self-ResAlloc
arXiv
2025
Paper
Planner vs orchestrator
MASLab (duplicate listing)
MASLab
arXiv
2025
Paper
Unified MAS APIs
Dynamic Speculative Agent Planning
DSP
arXiv
2025
Paper
Lossless agent planning acceleration via dynamic speculation; trades off latency vs cost; no pre-deployment setup.
Title
Short title
Venue
Year
Materials
Description
ACC-Collab: Actor-Critic for Multi-Agent Collaboration
ACC-Collab
ICLR
2025
Paper
Joint actor-critic
Chain of Agents: Collaborating on Long-Context Tasks
Chain of Agents
arXiv
2024
Paper
Long-context chains
Scaling LLM-Based Multi-Agent Collaboration
Scaling MAC
arXiv
2024
Paper
Scaling study
MMAC-Copilot: Multi-Modal Agent Collaboration
MMAC-Copilot
arXiv
2024
Paper
Multi-modal collab
CORY: Sequential Cooperative Multi-Agent Reinforcement Learning
CORY
NeurIPS
2024
Paper | Code
Role-swapping PPO
OpenManus-RL (duplicate listing)
OpenManus-RL
GitHub
2025
Code
Live-streamed tuning
MAPoRL: Multi-Agent Post-Co-Training with Reinforcement Learning
MAPoRL
arXiv
2025
Paper
Co-refine + verifier
Embodied Agents & World Models
Title
Short title
Venue
Year
Materials
Description
An Embodied Generalist Agent in 3D World
LEO
ICML
2024
Paper
3D embodied agent
DreamerV3: Mastering Diverse Domains through World Models
DreamerV3
arXiv
2023
Paper
World-model RL
World-Model-Augmented Web Agent
WMA Web Agent
ICLR/arXiv
2025
Paper
Simulative web agent
WorldCoder: Model-Based LLM Agent
WorldCoder
arXiv
2024
Paper
Code-based world model
WALL-E 2.0: World Alignment via Neuro-Symbolic Learning
WALL-E 2.0
arXiv
2025
Paper
Neuro-symbolic alignment
WorldLLM: Curiosity-Driven World Modeling
WorldLLM
arXiv
2025
Paper
Curiosity + world model
SimuRA: Simulative Reasoning Architecture with World Model
SimuRA
arXiv
2025
Paper
Mental simulation
Task: Search & Research Agents
Method
Category
Base LLM
Link
Resource
DeepRetrieval
External
Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct
Paper
Code
Search-R1
External
Qwen2.5-3B/7B-Base/Instruct
Paper
Code
R1-Searcher
External
Qwen2.5-7B, Llama3.1-8B-Instruct
Paper
Code
WebThinker
External
QwQ-32B, DeepSeek-R1-Distilled-Qwen-7B/14B/32B
Paper
Code
WebSailor
External
Qwen2.5-3B/7B/32B/72B
Paper
Code
SSRL
Internal
Qwen2.5-1.5B/3B/7B/14B/32B/72B-Instruct, Llama-3.2-1B/8B-Instruct
Paper
Code
OpenAI Deep Research
External
OpenAI Models
Blog
Website
Perplexity DeepResearch
External
-
Blog
Website
Method
RL Reward Type
Base LLM
Link
Resource
AceCoder
Outcome
Qwen2.5-Coder-7B-Base/Instruct
Paper
Code
DeepCoder-14B
Outcome
DeepSeek-R1-Distilled-Qwen-14B
Blog
Code
CodeBoost
Process
Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct
Paper
Code
R1-Code-Interpreter
Outcome
Qwen2.5-7B/14B-Instruct-1M
Paper
Code
SWE-RL
Outcome
Llama-3.3-70B-Instruct
Paper
Code
Satori-SWE
Outcome
Qwen-2.5-Math-7B
Paper
Code
Task: Mathematical Agents
Surveys & Position Papers
Title
Short title
Venue
Year
Materials
Description
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
ARL-Surv
arXiv
2025
Paper
Comprehensive ARL landscape
Budget-Aware Evaluation of LLM Reasoning Strategies
BudgetEval
EMNLP
2024
Paper
Budget-aware reasoning evaluation
Alignment & Preference Optimization in LLM Agents
Align-Pos
arXiv
2023
Paper
Alignment and multi-objective methods
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
SE Survey
arXiv
2025
Paper
Taxonomy and methods for self-evolving agents.
Method
Category
Base LLM
Link
Resource
DeepRetrieval
External
Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct
Paper
Code
Search-R1
External
Qwen2.5-3B/7B-Base/Instruct
Paper
Code
R1-Searcher
External
Qwen2.5-7B, Llama3.1-8B-Instruct
Paper
Code
WebThinker
External
QwQ-32B, DeepSeek-R1-Distilled-Qwen-7B/14B/32B
Paper
Code
WebSailor
External
Qwen2.5-3B/7B/32B/72B
Paper
Code
SSRL
Internal
Qwen2.5-1.5B/3B/7B/14B/32B/72B-Instruct, Llama-3.2-1B/8B-Instruct
Paper
Code
OpenAI Deep Research
External
OpenAI Models
Blog
Website
Perplexity DeepResearch
External
-
Blog
Website
Method
RL Reward Type
Base LLM
Link
Resource
AceCoder
Outcome
Qwen2.5-Coder-7B-Base/Instruct
Paper
Code
DeepCoder-14B
Outcome
DeepSeek-R1-Distilled-Qwen-14B
Blog
Code
CodeBoost
Process
Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct
Paper
Code
R1-Code-Interpreter
Outcome
Qwen2.5-7B/14B-Instruct-1M
Paper
Code
SWE-RL
Outcome
Llama-3.3-70B-Instruct
Paper
Code
Satori-SWE
Outcome
Qwen-2.5-Math-7B
Paper
Code
Concluding Remarks
Reinforcement learning for AI agents is rapidly evolving, driving breakthroughs in reasoning, autonomy, and collaboration. As new methods and frameworks emerge, staying current is essential for both research and practical deployment. This curated list aims to support the community in navigating the dynamic landscape and make contributions!
💡 Pull requests welcome to keep this list up to date!
https://github.com/xhyumiracle/Awesome-AgenticLLM-RL-Papers
https://github.com/0russwest0/Awesome-Agent-RL
https://github.com/thinkwee/AgentsMeetRL
🌟 If you find this resource helpful, star the repo and share your favorite RL agent papers or frameworks! Let's build the future of intelligent agents together.