Awesome RL for AI Agents

A curated list of recent progress and resources on Reinforcement Learning for AI Agents.

🔎 Quick Navigation

Agentic Workflow without Training
Agent Evaluation and Benchmarks
Agent Training Frameworks
RL for Single Agent
Cost-Aware Reasoning & Budget-Constrained RL
RL for Multi-Agent Systems
- Planning
- Collaboration
Embodied Agents & World Models
Task Agents
Surveys & Position Papers
Concluding Remarks

Reinforcement learning (RL) is rapidly becoming a driving force for AI agents that can reason, act, and adapt in the real world. While large language models (LLMs) provide powerful priors for reasoning, they remain static without feedback. RL closes this gap by enabling agents to learn from interactions—through self-reflection, outcome-based rewards, and tool or human feedback.

This repository curates up-to-date resources on RL for AI agents, organized along three main axes:

Agentic workflows without training – prompting strategies that enhance reasoning without fine-tuning.
Evaluation and benchmarks – systematic tests for reasoning, tool use, and automation.
RL for single and multi-agent systems – advancing self-evolution, efficient tool use, and collaboration.

Tables provide quick overviews, while accompanying descriptions highlight deeper insights.

Agentic Workflow without Training

Title	Short title	Venue	Year	Materials	Description
Tree of Thoughts: Deliberate Problem Solving with Large Language Models	ToT	ICML	2023	Paper	Search over reasoning trees to explore alternatives before committing.
Reflexion: Language Agents with Verbal Reinforcement Learning	Reflexion	NeurIPS	2023	Paper	Self-critique and retry loops that emulate feedback without training.
Self-Refine: Iterative Refinement with Self-Feedback	Self-Refine	NeurIPS	2023	Paper	Iterative editing using self-generated feedback to improve outputs.
ReAct: Synergizing Reasoning and Acting in Language Models	ReAct	ICLR	2023	Paper	Interleaves chain-of-thought with tool calls for grounded reasoning.
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks	SwiftSage	ACL	2023	Paper	Splits fast vs slow planning to balance cost and performance.
DynaSaur: Large Language Agents Beyond Predefined Actions	DynaSaur	arXiv	2024	Paper	Dynamically extends the agent’s action space beyond fixed tool sets.

Agent Evaluation and Benchmarks

Title	Short title	Venue	Year	Materials	Description
GAIA: A Benchmark for General AI Assistants	GAIA	arXiv	2023	Paper	466 real-world tasks spanning tools and reasoning.
TaskBench: Benchmarking Large Language Models for Task Automation	TaskBench	EMNLP	2023	Paper	Evaluates multi-step automation and tool integration.
AgentBench: Evaluating LLMs as Agents	AgentBench	arXiv	2023	Paper	51 scenarios to test agentic behaviors and robustness.
ACEBench: Who Wins the Match Point in Tool Usage?	ACEBench	arXiv	2025	Paper	Fine-grained tool-use evaluation with step sensitivity.
Agent Leaderboard (Galileo)	Galileo LB	HF	2024	Dataset	Community leaderboard built around GAIA-style tasks.
Agentic Predictor: Performance Prediction for Agentic Workflows	Agentic Predictor	arXiv	2025	Paper	Predicts workflow performance for better design-time choices.

Agent Training Frameworks

Title	Short title	Year	🌟 Stars	Materials	Description
Agent Lightning: Train ANY AI Agents with Reinforcement Learning	Agent Lightning	2025		Paper \| Code	Unified MDP; decouples execution and training with scalable workers.
SkyRL-v0: Train Real-World Long-Horizon Agents via RL	SkyRL-v0	2025		Blog \| Code	Online RL pipeline for long-horizon agent training.
OpenManus-RL: Live-Streamed RL Tuning Framework for LLM Agents	OpenManus-RL	2025		Code \| Dataset	Live-streamed tuning of LLM agents with dataset support.
MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems	MASLab	2025		Paper \| Code	Unified MAS codebase integrating 20+ multi-agent system methods.
VerlTool: Towards Holistic Agentic RL with Tool Use	VerlTool	2025		Paper \| Code	Modular ARLT; supports asynchronous rollouts.
L0: Reinforcement Learning to Become General Agents	L0	2025		Paper \| Code	Scalable RL pipeline; NB-Agent scaffold; concurrent worker pool.
verl-agent: Extension of veRL for LLM Agents	verl-agent	2025		Code	Step-independent multi-turn rollouts; memory modules; GiGPO RL algorithm.
ART: Agent Reinforcement Trainer	ART	2025		Code	Python harness for GRPO-based RL; OpenAI API-compatible; notebook examples.
AReaL: Ant Reasoning RL for LLMs	AReaL	2025		Paper \| Code	Fully async RL system; scalable from 1→1K GPUs; open & reproducible.
Agent-R1: End-to-End RL for Tool-using Agents	Agent-R1	2025	--	(No public repo found)	Multi-tool coordination; process rewards; reward normalization (described, not yet open).
siiRL: Scalable Infrastructure for Interactive RL Agents	siiRL	2025		Paper \| Code	Infrastructure and algorithms for large-scale RL training.
slime: Self-Improving LLM Agents	slime	2025		Blog \| Code	Continuous improvement framework for LLM agents.
ROLL: RL for Open-Ended LLM Agents	ROLL	2025		Paper \| Code	Alibaba’s RL framework for multi-task LLM agents.
MARTI: Multi-Agent RL with Tool Integration	MARTI	2025		Code	Tsinghua’s tool-augmented multi-agent RL.
RL2: Reinforcement Learning Reloaded	RL2	2025		Code	Individual repo exploring advanced RL.
verifiers: Benchmarking LLM Verification	verifiers	2025		Code	Verification-focused RL experiments.
oat: Optimizing Agent Training	oat	2024		Paper \| Code	NUS / Sea AI’s agent optimization framework.
veRL: Volcengine RL Framework	veRL	2024		Paper \| Code	ByteDance’s general-purpose RL framework.
OpenRLHF: Open Reinforcement Learning from Human Feedback	OpenRLHF	2023		Paper \| Code	Open-source RLHF training platform.
TRL: Transformer Reinforcement Learning	TRL	2019		Code	HuggingFace’s RL library for transformers.

RL for Single Agent

Reinforcement learning methods that focus on individual agents (typically LLMs), enabling them to adapt, self-improve, and use tools effectively.

Title	Short title	Venue	Year	Materials	Description
TTRL: Test-Time Reinforcement Learning	TTRL	ICLR	2025	Paper	Inference-time RL via majority-vote rewards.
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models	ProRL	ICLR	2025	Paper	KL-control with reference resets for longer reasoning.
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning	RAGEN / StarPO	ICLR	2025	Paper	Multi-turn critic-based RL for evolving behaviors.
Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution	Alita	GAIA LB	2025	Paper	Modular framework for online self-evolution.
Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement	Gödel Agent	ACL / arXiv	2024–2025	Paper	Recursive self-modification with reasoning loops.
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents	Darwin GM	arXiv	2025	Paper	Darwinian exploration for open-ended agent improvement.
SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning	SkyRL-v0	arXiv / GitHub	2025	Blog \| Code	Long-horizon online RL training pipeline.

RL for Tool Use & Agent Training

Title	Short title	Venue	Year	Materials	Description
AGILE: RL Framework for LLM Agents	AGILE	arXiv	2024	Paper	Combines RL, memory, and tool use.
AgentOptimizer: Functions as Learnable Weights	AgentOptimizer	ICML	2024	Paper	Offline training with learnable tool weights.
FireAct: Fine-tuning LLM Agents	FireAct	arXiv	2023	Paper	Multi-task SFT baseline for RL comparison.
Tool-Integrated Reinforcement Learning	ToRL	arXiv	2025	Paper	Large-scale tool-integrated RL training.
ToolRL: Reward is All Tool Learning Needs	ToolRL	arXiv	2025	Paper	Studies reward shaping for tool use.
ARTIST: Unified Reasoning & Tools	ARTIST	arXiv	2025	Paper	Joint reasoning + tool integration.
ZeroTIR: Scaling Law for Tool RL	ZeroTIR	arXiv	2025	Paper	Scaling behavior of tool-augmented RL.
OTC: Acting Less is Reasoning More	OTC	arXiv	2025	Paper	Optimizes efficiency by reducing unnecessary tool calls.
WebAgent-R1: End-to-End Multi-Turn RL	WebAgent-R1	arXiv	2025	Paper	Trains web agents on multi-turn environments.
GiGPO: Group-in-Group PPO	GiGPO	arXiv	2025	Paper	Hierarchical PPO for agent training.
Nemotron-Research-Tool-N1	Nemotron-Tool-N1	arXiv	2025	Paper	Pure RL setup for tool reasoning.
CATP-LLM: Cost-Aware Tool Planning	CATP-LLM	ICCV / arXiv	2024–2025	Paper \| Code	Optimizes tool usage under cost constraints.
Tool-Star: Multi-Tool RL via Hierarchical Rewards	Tool-Star	arXiv	2025	Paper	Reinforcement with structured multi-tool reasoning.

Memory & Knowledge Management

Title	Short title	Venue	Year	Materials	Description
Memory-R1: RL Memory Manager	Memory-R1	arXiv	2025	Paper	RL-based memory controller for better retrieval.
A-MEM: Agentic Memory for LLM Agents	A-MEM	arXiv	2025	Paper	Zettelkasten-style dynamic memory management.
KnowAgent: Knowledge-Augmented Planning	KnowAgent	NAACL Findings	2025	Paper	Planning with structured knowledge bases.

Fine-Grained RL & Trajectory Calibration

Title	Short title	Venue	Year	Materials	Description
StepTool: Multi-Step Tool Usage	StepTool	CIKM	2025	Paper	Step-grained rewards for tool usage.
RLTR: Process-Centric Rewards	RLTR	arXiv	2025	Paper	Rewards good reasoning trajectories, not just outcomes.
SPA-RL: Stepwise Progress Attribution	SPA-RL	arXiv	2025	Paper	Credits progress at intermediate steps.
STeCa: Step-Level Trajectory Calibration	STeCa	ACL Findings	2025	Paper	Calibrates suboptimal steps for better learning.
SWEET-RL: Multi-Turn Collaborative RL	SWEET-RL	arXiv	2025	Paper	Multi-turn reasoning with collaborative critic.
ATLaS: Critical Step Selection	ATLaS	ACL	2025	Paper	Focuses learning on critical reasoning steps.

Algorithm Families (PPO, DPO, GRPO, etc.)

Summarizes key algorithm families, objectives, and available implementations.

Method	Year	Objective	Clip	KL Penalty	Mechanism	Signal	Link	Resource
PPO family
PPO	2017	Policy gradient	Yes	No	Policy ratio clipping	Reward	Paper	-
VAPO	2025	Policy gradient	Yes	Adaptive	Adaptive KL penalty + variance control	Reward + variance	Paper	-
PF-PPO	2024	Policy gradient	Yes	Yes	Policy filtration	Noisy reward	Paper	Code
VinePPO	2024	Policy gradient	Yes	Yes	Unbiased value estimates	Reward	Paper	Code
PSGPO	2024	Policy gradient	Yes	Yes	Process supervision	Process reward	Paper	-
DPO family
DPO	2024	Preference optimization	No	Yes	Implicit reward	Human preference	Paper	-
β-DPO	2024	Preference optimization	No	Adaptive	Dynamic KL coefficient	Human preference	Paper	Code
SimPO	2024	Preference optimization	No	Scaled	Avg log-prob as implicit reward	Human preference	Paper	Code
IPO	2024	Implicit preference	No	No	Preference classification	Rank	Paper	-
KTO	2024	Knowledge transfer optimization	No	Yes	Teacher-student stabilization	Logits	Paper	Code
ORPO	2024	Online regularized PO	No	Yes	Online stabilization	Feedback reward	Paper	Code
Step-DPO	2024	Step-wise preference	No	Yes	Step-level supervision	Step preference	Paper	Code
LCPO	2025	Length-conditioned PO	No	Yes	Length preference	Reward	Paper	-
GRPO family
GRPO	2025	Policy gradient (group reward)	Yes	Yes	Group-based relative reward, no value estimates	Group reward	Paper	-
DAPO	2025	Surrogate of GRPO	Yes	Yes	Decoupled clip + dynamic sampling	Dynamic group reward	Paper	Code \| Model \| Website
GSPO	2025	Surrogate of GRPO	Yes	Yes	Sequence-level clipping & reward	Smooth group reward	Paper	-
GMPO	2025	Surrogate of GRPO	Yes	Yes	Geometric mean of token rewards	Margin-based reward	Paper	Code
ProRL	2025	Same as GRPO	Yes	Yes	Reference policy reset	Group reward	Paper	Model
Posterior-GRPO	2025	Same as GRPO	Yes	Yes	Rewards only successful processes	Process reward	Paper	-
Dr.GRPO	2025	Unbiased GRPO	Yes	Yes	Removes bias in optimization	Group reward	Paper	Code \| Model
Step-GRPO	2025	Same as GRPO	Yes	Yes	Rule-based reasoning reward	Step-wise reward	Paper	Code \| Model
SRPO	2025	Same as GRPO	Yes	Yes	Two-stage history resampling	Reward	Paper	Model
GRESO	2025	Same as GRPO	Yes	Yes	Pre-rollout filtering	Reward	Paper	Code \| Website
StarPO	2025	Same as GRPO	Yes	Yes	Reasoning-guided multi-turn	Group reward	Paper	Code \| Website
GHPO	2025	Policy gradient	Yes	Yes	Adaptive prompt refinement	Reward	Paper	Code
Skywork R1V2	2025	GRPO (hybrid signal)	Yes	Yes	Selective buffer, multimodal reward	Multimodal	Paper	Code \| Model
ASPO	2025	GRPO (shaped advantage)	Yes	Yes	Clipped advantage bias	Group reward	Paper	-
TreePO	2025	Surrogate of GRPO	Yes	Yes	Self-guided rollout	Group reward	Paper	Code \| Model \| Website
EDGE-GRPO	2025	Same as GRPO	Yes	Yes	Entropy-driven advantage + error correction	Group reward	Paper	Code \| Model
DARS	2025	Same as GRPO	Yes	No	Multi-stage hardest problems	Group reward	Paper	Code \| Model
CHORD	2025	Weighted GRPO + SFT	Yes	Yes	Auxiliary supervised loss	Group reward	Paper	Code
PAPO	2025	Surrogate of GRPO	Yes	Yes	Implicit perception loss	Group reward	Paper	Code \| Model \| Website
Pass@k Training	2025	Same as GRPO	Yes	Yes	Pass@k metric as reward	Group reward	Paper	Code

Cost-Aware Reasoning & Budget-Constrained RL

As agents scale, cost, latency, and efficiency become critical. These works tackle budget-aware reasoning, token efficiency, and cost-sensitive planning.

Title	Short title	Venue	Year	Materials	Description
Cost-Augmented Monte Carlo Tree Search for LLM-Assisted Planning	CATS	arXiv	2025	Paper	Incorporates cost into MCTS for planning under constraints.
Token-Budget-Aware LLM Reasoning	TALE	arXiv	2024	Paper	Allocates token budget optimally across reasoning steps.
FrugalGPT: Using LLMs While Reducing Cost	FrugalGPT	arXiv	2023	Paper	Early exploration of cost minimization by routing queries.
Efficient Contextual LLM Cascades via Budget-Constrained Policy Learning	TREACLE	arXiv	2024	Paper	Learns cascades balancing budget and accuracy.
BudgetMLAgent: Cost-Effective Multi-Agent System for ML Automation	BudgetMLAgent	AIMLSystems	2025	—	Multi-agent framework designed for cost efficiency.
The Cost of Dynamic Reasoning: A Systems View	Systems Cost	arXiv	2025	Paper	Measures latency, energy, and financial cost of agent reasoning.
Budget-Aware Evaluation of LLM Reasoning Strategies	BudgetEval	EMNLP	2024	Paper	Proposes evaluation framework accounting for budget limits.
LLM Cascades with Mixture of Thoughts for Cost-Efficient Reasoning	MoT Cascade	ICLR / arXiv	2024	Paper \| Code	Uses “mixture of thoughts” cascades for efficiency.
BudgetThinker: Budget-Aware LLM Reasoning with Control Tokens	BudgetThinker	arXiv	2025	Paper	Introduces control tokens to manage budget during inference.
Beyond One-Preference-Fits-All Alignment: Multi-Objective DPO	MODPO	arXiv	2023–2024	Paper	Extends DPO with multi-objective alignment.

RL for Multi-Agent Systems

Planning

Title	Short title	Venue	Year	Materials	Description
OWL: Optimized Workforce Learning for Real-World Automation	OWL	arXiv	2025	Paper	Planner + workers
Profile-Aware Maneuvering for GAIA by AWorld	AWorld	NeurIPS	2024	Paper	Guard agents
Plan-over-Graph: Towards Parallelable Agent Schedule	Plan-over-Graph	arXiv	2025	Paper	Graph scheduling
LLM-Based Multi-Agent Reinforcement Learning: Directions	MARL Survey	arXiv	2024	Paper	Survey
Self-Resource Allocation in Multi-Agent LLM Systems	Self-ResAlloc	arXiv	2025	Paper	Planner vs orchestrator
MASLab (duplicate listing)	MASLab	arXiv	2025	Paper	Unified MAS APIs
Dynamic Speculative Agent Planning	DSP	arXiv	2025	Paper	Lossless agent planning acceleration via dynamic speculation; trades off latency vs cost; no pre-deployment setup.

Collaboration

Title	Short title	Venue	Year	Materials	Description
ACC-Collab: Actor-Critic for Multi-Agent Collaboration	ACC-Collab	ICLR	2025	Paper	Joint actor-critic
Chain of Agents: Collaborating on Long-Context Tasks	Chain of Agents	arXiv	2024	Paper	Long-context chains
Scaling LLM-Based Multi-Agent Collaboration	Scaling MAC	arXiv	2024	Paper	Scaling study
MMAC-Copilot: Multi-Modal Agent Collaboration	MMAC-Copilot	arXiv	2024	Paper	Multi-modal collab
CORY: Sequential Cooperative Multi-Agent Reinforcement Learning	CORY	NeurIPS	2024	Paper \| Code	Role-swapping PPO
OpenManus-RL (duplicate listing)	OpenManus-RL	GitHub	2025	Code	Live-streamed tuning
MAPoRL: Multi-Agent Post-Co-Training with Reinforcement Learning	MAPoRL	arXiv	2025	Paper	Co-refine + verifier

Embodied Agents & World Models

Title	Short title	Venue	Year	Materials	Description
An Embodied Generalist Agent in 3D World	LEO	ICML	2024	Paper	3D embodied agent
DreamerV3: Mastering Diverse Domains through World Models	DreamerV3	arXiv	2023	Paper	World-model RL
World-Model-Augmented Web Agent	WMA Web Agent	ICLR/arXiv	2025	Paper	Simulative web agent
WorldCoder: Model-Based LLM Agent	WorldCoder	arXiv	2024	Paper	Code-based world model
WALL-E 2.0: World Alignment via Neuro-Symbolic Learning	WALL-E 2.0	arXiv	2025	Paper	Neuro-symbolic alignment
WorldLLM: Curiosity-Driven World Modeling	WorldLLM	arXiv	2025	Paper	Curiosity + world model
SimuRA: Simulative Reasoning Architecture with World Model	SimuRA	arXiv	2025	Paper	Mental simulation

Task: Search & Research Agents

Method	Category	Base LLM	Link	Resource
DeepRetrieval	External	Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct	Paper	Code
Search-R1	External	Qwen2.5-3B/7B-Base/Instruct	Paper	Code
R1-Searcher	External	Qwen2.5-7B, Llama3.1-8B-Instruct	Paper	Code
WebThinker	External	QwQ-32B, DeepSeek-R1-Distilled-Qwen-7B/14B/32B	Paper	Code
WebSailor	External	Qwen2.5-3B/7B/32B/72B	Paper	Code
SSRL	Internal	Qwen2.5-1.5B/3B/7B/14B/32B/72B-Instruct, Llama-3.2-1B/8B-Instruct	Paper	Code
OpenAI Deep Research	External	OpenAI Models	Blog	Website
Perplexity DeepResearch	External	-	Blog	Website

Task: Code Agents

Method	RL Reward Type	Base LLM	Link	Resource
AceCoder	Outcome	Qwen2.5-Coder-7B-Base/Instruct	Paper	Code
DeepCoder-14B	Outcome	DeepSeek-R1-Distilled-Qwen-14B	Blog	Code
CodeBoost	Process	Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct	Paper	Code
R1-Code-Interpreter	Outcome	Qwen2.5-7B/14B-Instruct-1M	Paper	Code
SWE-RL	Outcome	Llama-3.3-70B-Instruct	Paper	Code
Satori-SWE	Outcome	Qwen-2.5-Math-7B	Paper	Code

Task: Mathematical Agents

Method	Reward	Link	Resource
ARTIST	Outcome	Paper	-
ToRL	Outcome	Paper	Code
ZeroTIR	Outcome	Paper	Code
TTRL	Outcome	Paper	Code
DeepSeek-Prover-v1.5	Formal	Paper	Code
Leanabell-Prover	Formal	Paper	Code

Task: GUI Agents

Method	Paradigm	Environment	Link	Resource
MM-Navigator	Vanilla VLM	-	Paper	Code
SeeAct	Vanilla VLM	-	Paper	Code
GUI-R1	RL	Static	Paper	Code
UI-R1	RL	Static	Paper	Code
UI-TARS	RL	Interactive	Paper	Code

Surveys & Position Papers

Title	Short title	Venue	Year	Materials	Description
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey	ARL-Surv	arXiv	2025	Paper	Comprehensive ARL landscape
Budget-Aware Evaluation of LLM Reasoning Strategies	BudgetEval	EMNLP	2024	Paper	Budget-aware reasoning evaluation
Alignment & Preference Optimization in LLM Agents	Align-Pos	arXiv	2023	Paper	Alignment and multi-objective methods
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence	SE Survey	arXiv	2025	Paper	Taxonomy and methods for self-evolving agents.

Task Agents

Search & Research Agents

Method	Category	Base LLM	Link	Resource
DeepRetrieval	External	Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct	Paper	Code
Search-R1	External	Qwen2.5-3B/7B-Base/Instruct	Paper	Code
R1-Searcher	External	Qwen2.5-7B, Llama3.1-8B-Instruct	Paper	Code
WebThinker	External	QwQ-32B, DeepSeek-R1-Distilled-Qwen-7B/14B/32B	Paper	Code
WebSailor	External	Qwen2.5-3B/7B/32B/72B	Paper	Code
SSRL	Internal	Qwen2.5-1.5B/3B/7B/14B/32B/72B-Instruct, Llama-3.2-1B/8B-Instruct	Paper	Code
OpenAI Deep Research	External	OpenAI Models	Blog	Website
Perplexity DeepResearch	External	-	Blog	Website

Code Agents

Method	RL Reward Type	Base LLM	Link	Resource
AceCoder	Outcome	Qwen2.5-Coder-7B-Base/Instruct	Paper	Code
DeepCoder-14B	Outcome	DeepSeek-R1-Distilled-Qwen-14B	Blog	Code
CodeBoost	Process	Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct	Paper	Code
R1-Code-Interpreter	Outcome	Qwen2.5-7B/14B-Instruct-1M	Paper	Code
SWE-RL	Outcome	Llama-3.3-70B-Instruct	Paper	Code
Satori-SWE	Outcome	Qwen-2.5-Math-7B	Paper	Code

Mathematical Agents

Method	Reward	Link	Resource
ARTIST	Outcome	Paper	-
ToRL	Outcome	Paper	Code
ZeroTIR	Outcome	Paper	Code
TTRL	Outcome	Paper	Code
DeepSeek-Prover-v1.5	Formal	Paper	Code
Leanabell-Prover	Formal	Paper	Code

GUI Agents

Method	Paradigm	Environment	Link	Resource
MM-Navigator	Vanilla VLM	-	Paper	Code
SeeAct	Vanilla VLM	-	Paper	Code
GUI-R1	RL	Static	Paper	Code
UI-R1	RL	Static	Paper	Code
InFiGUI-R1	RL	Static	Paper	Code
UI-TARS	RL	Interactive	Paper	Code

Concluding Remarks

Reinforcement learning for AI agents is rapidly evolving, driving breakthroughs in reasoning, autonomy, and collaboration. As new methods and frameworks emerge, staying current is essential for both research and practical deployment. This curated list aims to support the community in navigating the dynamic landscape and make contributions!

💡 Pull requests welcome to keep this list up to date!

References

https://github.com/xhyumiracle/Awesome-AgenticLLM-RL-Papers

https://github.com/0russwest0/Awesome-Agent-RL

https://github.com/thinkwee/AgentsMeetRL

🌟 If you find this resource helpful, star the repo and share your favorite RL agent papers or frameworks! Let's build the future of intelligent agents together.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
LICENSE		LICENSE
README.md		README.md
RL.png		RL.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome RL for AI Agents

🔎 Quick Navigation

Agentic Workflow without Training

Agent Evaluation and Benchmarks

Agent Training Frameworks

RL for Single Agent

RL for Tool Use & Agent Training

Memory & Knowledge Management

Fine-Grained RL & Trajectory Calibration

Algorithm Families (PPO, DPO, GRPO, etc.)

Cost-Aware Reasoning & Budget-Constrained RL

RL for Multi-Agent Systems

Planning

Collaboration

Embodied Agents & World Models

Task: Search & Research Agents

Task: Code Agents

Task: Mathematical Agents

Task: GUI Agents

Surveys & Position Papers

Task Agents

Search & Research Agents

Code Agents

Mathematical Agents

GUI Agents

Concluding Remarks

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Folders and files

Latest commit

History

Repository files navigation

Awesome RL for AI Agents

🔎 Quick Navigation

Agentic Workflow without Training

Agent Evaluation and Benchmarks

Agent Training Frameworks

RL for Single Agent

RL for Tool Use & Agent Training

Memory & Knowledge Management

Fine-Grained RL & Trajectory Calibration

Algorithm Families (PPO, DPO, GRPO, etc.)

Cost-Aware Reasoning & Budget-Constrained RL

RL for Multi-Agent Systems

Planning

Collaboration

Embodied Agents & World Models

Task: Search & Research Agents

Task: Code Agents

Task: Mathematical Agents

Task: GUI Agents

Surveys & Position Papers

Task Agents

Search & Research Agents

Code Agents

Mathematical Agents

GUI Agents

Concluding Remarks

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Packages