Skip to content

[ICLR 2026] Official codebase for "Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models"

License

Notifications You must be signed in to change notification settings

RyanLiu112/AttnRL

Repository files navigation

AttnRL

Survey Github HF Paper

🔔 News

  • [2026-01-26] 🎉 Our work is accepted by ICLR 2026!
  • [2025-10-21] 📢 Our work is reported by Synced (机器之心)!
  • [2025-10-10] ✨ Code is now available.
  • [2025-09-30] 📄 Our paper is released on arXiv.

🚀 Getting Started

Installation

Clone the repository:

git clone https://github.com/RyanLiu112/AttnRL.git
cd AttnRL

Create a new conda environment and install the dependencies:

conda create -n attnrl python=3.10
conda activate attnrl
bash scripts/install_vllm_sglang_mcore.sh

Data Preparation

The training dataset (DeepScaleR-Preview-Dataset) is at data/train/deepscaler_train.parquet, which contains 40.3k mathematical reasoning data. The evaluation datasets are in data/eval/ and the suffix _${K} indicates the number of duplicate samples for each question.

Training

For training AttnRL with DeepSeek-R1-Distill-Qwen-1.5B backbone on 8 H100 GPUs, run:

bash recipe/attnrl/run_attnrl_r1_distill_1.5b_8k.sh

Evaluation

Evaluation scripts are the same as the training scripts. +trainer.val_only=True should be added to perform evaluation only. We recommend setting data.max_prompt_length=2048 and data.max_response_length=32768.

📝 Citation

If you find this work helpful, please kindly cite our paper:

@inproceedings{AttnRL,
    title     = {Attention as a Compass: Efficient Exploration for Process-Supervised {RL} in Reasoning Models},
    author    = {Runze Liu and Jiakang Wang and Yuling Shi and Zhihui Xie and Chenxin An and Kaiyan Zhang and Jian Zhao and Xiaodong Gu and Lei Lin and Wenping Hu and Xiu Li and Fuzheng Zhang and Guorui Zhou and Kun Gai},
    booktitle = {The Fourteenth International Conference on Learning Representations},
    year      = {2026},
    url       = {https://openreview.net/forum?id=NCN8oUsiNf}
}

💡 Acknowledgements

Our code is based on verl (commit) and TreeRL. Our training dataset is from DeepScaleR-Preview-Dataset and rule-based verifier is based on Skywork-OR1, and Archer.

About

[ICLR 2026] Official codebase for "Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models"

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 381