codes/bandit.pycontains our complete, runnable code for the bandit setting.codes/risk_sensitive_adv.pycontains the implementation based VeRL.
Exploration Dilemma: Current RL methods for LLMs improve pass@1 but hurt pass@k performance. They sharpen the policy distribution around a few solutions, leading to a collapse in solution diversity. This prevents the discovery of novel reasoning strategies.
We argue this dilemma arises from a fundamental mismatch between the optimization landscape of LLMs and the dynamics of standard RL algorithms. LLMs begin with a highly specialized policy distribution that is already sharply peaked around certain solutions. If those initial peaks are not supported in the regions that yield optimal rewards, standard RL optimizers face a significant challenge: they struggle to escape the gravitational pull of the pretrained model's biases and tend to converge to a nearby, but often suboptimal mode. This prevents the discovery of more diverse and powerful reasoning paths.
We introduce a Risk-Sensitive RL framework to enhance exploration. Our method, RS-GRPO, replaces the standard mean-reward objective with a risk-seeking one with a risk-seeking one that instead interpolates smoothly between the mean and the maximum reward.
The Risk-Sensitive Objective is defined as:
where
- As
$\beta \rightarrow 0$ , the objective recovers the standard expected reward,$\mathbb{E}[r(y)]$ . - As
$\beta \to +\infty$ , the objective approaches the maximum reward,$\max_y r(y)$ , encouraging exploration.
The corresponding Risk-Sensitive Policy Gradient is:
with the Risk-Sensitive Advantage can be approximated as:
A key feature of this formulation is that it only alters the advantage computation while leaving the policy gradient structure intact. This allows our risk-sensitive advantage to serve as a drop-in replacement in existing GRPO-based RL algorithms, requiring only minimal code modifications.
We study the
We compare a single policy update for both standard policy gradient and our risk-sensitive policy gradient.
Our analysis reveals a key weakness in the standard policy gradient: it can decrease the probability of the optimal action.
Lemma 1: The standard policy gradient update can decrease the probability of the optimal action.
In contrast, our risk-sensitive approach ensures improvement for the optimal action with a sufficiently large risk-sensitivity parameter
Lemma 2: For any policy, the risk-sensitive update increases the probability of the optimal action when
$\beta$ is large enough.
These lemmas explain why increasing
More details can be found in the paper.
A bandit experiment demonstrating that risk-sensitive RL can escape a local optimum that traps its standard RL counterpart.
- Left: The reward landscape shows a global optimum and a distinct local optimum where the policy is initialized.
-
Right: A standard risk-neutral policy (
$\beta=0$ ) is trapped locally, while risk-sensitive policies ($\beta \geq 4$ ) converge to the global optimum.
The table provides a more comprehensive evaluation, covering five base models and three training datasets (math12k, deepmath103k, dapo17k). RS-GRPO consistently improves pass@k performance over standard GRPO algorithm. While many pass@k-oriented methods fail to improve pass@1 over GRPO, RS-GRPO achieves at least comparable Pass@1 performance and exceeds GRPO by an average of about 2% across three models (Qwen2.5-7B-Math, Qwen2.5-7B, Qwen3-4B).
We recommend setting
If you find our work useful, please cite our paper:
@article{jiang2025riskrl,
title={Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models},
author={Yuhua Jiang and Jiawei Huang and Yufeng Yuan and Xin Mao and Yu Yue and Qianchuan Zhao and Lin Yan},
year={2025},
journal={arXiv preprint arXiv:2509.24261},
url={https://arxiv.org/abs/2509.24261},
}



