Agent-RRM: Exploring Reasoning Reward Model for Agents

Introduction

Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still rely on sparse outcome-based rewards for training, which fail to differentiate intermediate reasoning quality, leading to suboptimal training results.

Reagent introduces Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including:

Explicit reasoning trace: Step-by-step reasoning analysis
Focused critique: Refinement guidance highlighting reasoning flaws
Overall score: Process performance evaluation

We systematically investigate three integration strategies:

Reagent-C: Text-augmented refinement with critique-based improvements
Reagent-R: Reward-augmented guidance for policy optimization
Reagent-U: Unified feedback integration combining all signals

Performance Highlights

Extensive evaluations across 12 diverse benchmarks demonstrate substantial performance improvements:

GAIA: 43.7% accuracy
WebWalkerQA: 46.2% accuracy
Superior results on multi-hop QA and web navigation tasks, mathematical tasks and general agent tasks.

Framework Overview

Figure: Overview of the Reagent framework showing Agent-RRM structure and training strategies

Benchmark Performance

We evaluate Reagent across diverse benchmarks covering Knowledge-Intensive Reasoning, Mathematical Reasoning, and General Agent and Search Tasks.

Table 1: Comprehensive Evaluation on General Agent and Search Benchmarks.

Table 2: Results on Knowledge-Intensive Reasoning and Math Benchmarks. (HQA: HotpotQA)

Quick Start

Installation

Create and activate conda environment:

conda create -n reagent python=3.11
conda activate reagent

Install dependencies:

pip install -r requirements.txt

Install ffmpeg for audio processing:

conda install -c conda-forge ffmpeg

Training with Agent-RRM

Step 1: Setup Agent Reasoning Reward Model

Start the reward model servers (requires multiple GPUs):

bash ./examples/reagent/reward_model.sh

Step 2: Prepare Training Data

Process and prepare the training datasets:

python ./examples/reagent/prepare_data.py

Step 3: Start Training

Launch the training process:

bash ./examples/reagent/train.sh

Note: Make sure to configure paths and hyperparameters in the respective shell scripts before training.

For SFT Training, please refer to LLaMA-Factory.

Evaluation

We provide a unified evaluation framework supporting 18 diverse benchmarks with automatic tool and prompt configuration.

Supported Benchmarks

Multimodal Tasks:

gaia - GAIA benchmark with multimodal capabilities
gaia_text - GAIA text-only subset

Math Reasoning:

math500 - MATH500 benchmark
aime24, aime25 - AIME competition problems
gsm8k - GSM8K math reasoning

Multi-hop QA & Web Navigation:

2wiki - 2WikiMultihopQA
hotpotqa - HotpotQA
musique - MuSiQue
bamboogle - Bamboogle
webwalker - WebWalker
hle - HLE benchmark
xbench - XBench-DeepSearch

Quick Evaluation

cd Evaluation/unified_eval

# 1. Configure your settings in run.sh
# Set MODEL_PATH, OUTPUT_PATH, and API keys

# 2. Start vLLM servers (if not already running)
bash start_server.sh

# 3. Run evaluation on any dataset
# Just change the DATASET variable in run.sh
DATASET="gaia"        # For multimodal tasks
# DATASET="math500"   # For math reasoning
# DATASET="xbench"    # For web navigation

bash run.sh

Direct Python Usage

python3 run_multi_react_py.py \
    --dataset gaia \
    --model /path/to/model \
    --output ./output \
    --max_workers 16 \
    --roll_out_count 3 \
    --temperature 0.6 \
    --auto_judge \
    --judge_engine deepseekchat

Key Features:

Automatic Configuration: Tools and prompts are automatically configured per dataset
Single Codebase: One set of scripts for all 18 benchmarks
Easy Dataset Switching: Change evaluation target by modifying just one variable
Parallel Processing: Multi-worker support for faster evaluation
Auto Judgment: Built-in automatic answer evaluation

Calculating Metrics

After evaluation, use get_score.py to calculate pass@1 and pass@k metrics:

cd Evaluation

# Calculate pass@1 for a single result file
python get_score.py --file output/model_name_unified/gaia/iter1.jsonl

# Calculate pass@k across multiple rollouts
python get_score.py --folder output/model_name_unified/gaia/

The script will:

pass@1: Calculate accuracy for each individual rollout file
pass@k: Calculate the percentage of questions where at least one rollout succeeded

Citation

If you find this work helpful, please consider citing:

@article{fan2026exploring,
  title={Exploring Reasoning Reward Model for Agents},
  author={Fan, Kaixuan and Feng, Kaituo and Zhang, Manyuan and Peng, Tianshuo and Li, Zhixun and Jiang, Yilei and Chen, Shuang and Pei, Peng and Cai, Xunliang and Yue, Xiangyu},
  journal={arXiv preprint arXiv:2601.22154},
  year={2026}
}

Acknowledgments

The implementation of this project is built upon LLaMA-Factory, veRL and rLLM. We deeply appreciate these teams for their contributions to the open-source research community.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Evaluation		Evaluation
assets		assets
connection		connection
docs		docs
examples		examples
rllm		rllm
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_rllm.md		README_rllm.md
build_docs.sh		build_docs.sh
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent-RRM: Exploring Reasoning Reward Model for Agents

Introduction

Performance Highlights

Framework Overview

Benchmark Performance

Quick Start

Installation

Training with Agent-RRM

Evaluation

Supported Benchmarks

Quick Evaluation

Direct Python Usage

Calculating Metrics

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

kxfan2002/Reagent

Folders and files

Latest commit

History

Repository files navigation

Agent-RRM: Exploring Reasoning Reward Model for Agents

Introduction

Performance Highlights

Framework Overview

Benchmark Performance

Quick Start

Installation

Training with Agent-RRM

Evaluation

Supported Benchmarks

Quick Evaluation

Direct Python Usage

Calculating Metrics

Citation

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages