[📖paper] [🤗Reagent model] [🤗Agent-RRM]
[🤗Reagent-SFT-55.6K Dataset] [🤗Reagent-RL-709K Dataset]
[🤗Reagent-RRM-SFT-28K Dataset] [🤗Reagent-RRM-RL-90K Dataset]
Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still rely on sparse outcome-based rewards for training, which fail to differentiate intermediate reasoning quality, leading to suboptimal training results.
Reagent introduces Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including:
- Explicit reasoning trace: Step-by-step reasoning analysis
- Focused critique: Refinement guidance highlighting reasoning flaws
- Overall score: Process performance evaluation
We systematically investigate three integration strategies:
- Reagent-C: Text-augmented refinement with critique-based improvements
- Reagent-R: Reward-augmented guidance for policy optimization
- Reagent-U: Unified feedback integration combining all signals
Extensive evaluations across 12 diverse benchmarks demonstrate substantial performance improvements:
- GAIA: 43.7% accuracy
- WebWalkerQA: 46.2% accuracy
- Superior results on multi-hop QA and web navigation tasks, mathematical tasks and general agent tasks.
We evaluate Reagent across diverse benchmarks covering Knowledge-Intensive Reasoning, Mathematical Reasoning, and General Agent and Search Tasks.
- Create and activate conda environment:
conda create -n reagent python=3.11
conda activate reagent- Install dependencies:
pip install -r requirements.txt- Install ffmpeg for audio processing:
conda install -c conda-forge ffmpegStep 1: Setup Agent Reasoning Reward Model
Start the reward model servers (requires multiple GPUs):
bash ./examples/reagent/reward_model.shStep 2: Prepare Training Data
Process and prepare the training datasets:
python ./examples/reagent/prepare_data.pyStep 3: Start Training
Launch the training process:
bash ./examples/reagent/train.shNote: Make sure to configure paths and hyperparameters in the respective shell scripts before training.
For SFT Training, please refer to LLaMA-Factory.
We provide a unified evaluation framework supporting 18 diverse benchmarks with automatic tool and prompt configuration.
Multimodal Tasks:
gaia- GAIA benchmark with multimodal capabilitiesgaia_text- GAIA text-only subset
Math Reasoning:
math500- MATH500 benchmarkaime24,aime25- AIME competition problemsgsm8k- GSM8K math reasoning
Multi-hop QA & Web Navigation:
2wiki- 2WikiMultihopQAhotpotqa- HotpotQAmusique- MuSiQuebamboogle- Bambooglewebwalker- WebWalkerhle- HLE benchmarkxbench- XBench-DeepSearch
cd Evaluation/unified_eval
# 1. Configure your settings in run.sh
# Set MODEL_PATH, OUTPUT_PATH, and API keys
# 2. Start vLLM servers (if not already running)
bash start_server.sh
# 3. Run evaluation on any dataset
# Just change the DATASET variable in run.sh
DATASET="gaia" # For multimodal tasks
# DATASET="math500" # For math reasoning
# DATASET="xbench" # For web navigation
bash run.shpython3 run_multi_react_py.py \
--dataset gaia \
--model /path/to/model \
--output ./output \
--max_workers 16 \
--roll_out_count 3 \
--temperature 0.6 \
--auto_judge \
--judge_engine deepseekchatKey Features:
- Automatic Configuration: Tools and prompts are automatically configured per dataset
- Single Codebase: One set of scripts for all 18 benchmarks
- Easy Dataset Switching: Change evaluation target by modifying just one variable
- Parallel Processing: Multi-worker support for faster evaluation
- Auto Judgment: Built-in automatic answer evaluation
After evaluation, use get_score.py to calculate pass@1 and pass@k metrics:
cd Evaluation
# Calculate pass@1 for a single result file
python get_score.py --file output/model_name_unified/gaia/iter1.jsonl
# Calculate pass@k across multiple rollouts
python get_score.py --folder output/model_name_unified/gaia/The script will:
- pass@1: Calculate accuracy for each individual rollout file
- pass@k: Calculate the percentage of questions where at least one rollout succeeded
If you find this work helpful, please consider citing:
@article{fan2026exploring,
title={Exploring Reasoning Reward Model for Agents},
author={Fan, Kaixuan and Feng, Kaituo and Zhang, Manyuan and Peng, Tianshuo and Li, Zhixun and Jiang, Yilei and Chen, Shuang and Pei, Peng and Cai, Xunliang and Yue, Xiangyu},
journal={arXiv preprint arXiv:2601.22154},
year={2026}
}The implementation of this project is built upon LLaMA-Factory, veRL and rLLM. We deeply appreciate these teams for their contributions to the open-source research community.


