-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Hi authors,
Thank you for releasing this codebase and for the great work on RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents (ICLR 2026 submission).
I am trying to reproduce the RLVER PPO experiments using this repo, but I ran into an inconsistency between the paper and the released code, and I would really appreciate some clarification.
1. Multi-turn rollout & custom_repeat_by_counts
In the RL training loop, there is a version of fit() (in ray_trainer_think.py) that uses multi-turn rollouts and operates on per-sample dialogue turn counts:
turn_count_list = gen_batch_output.non_tensor_batch['dialogue_turns'].tolist()
print("turn_count_list", turn_count_list)
consolidated_turn_count = []
current_count = turn_count_list[0]
remaining = current_count
for count in turn_count_list:
if count == current_count and remaining > 0:
remaining -= 1
else:
consolidated_turn_count.append(current_count)
current_count = count
remaining = current_count - 1
if remaining == 0:
consolidated_turn_count.append(current_count)
batch = batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True)
batch = batch.custom_repeat_by_counts(repeat_counts=consolidated_turn_count, interleave=True)
batch = batch.union(gen_batch_output)
However, in the DataProto class shipped in this repo, there is no custom_repeat_by_counts method, so running this version of fit() raises:
AttributeError: 'DataProto' object has no attribute 'custom_repeat_by_counts'
At the same time, there is another simpler version of fit() which only does:
batch = batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True)
batch = batch.union(gen_batch_output)
self._balance_batch(batch, metrics=metrics)
This version does not use dialogue_turns or per-sample repeat counts, and seems more like the standard VERL PPO loop (without the multi-turn alignment logic described in the RLVER paper).
- My questions
To faithfully reproduce the RLVER results on multi-turn conversations, I’m not sure which implementation I should follow, and what the intended behavior is. Concretely, could you please clarify:
Where is DataProto.custom_repeat_by_counts defined?
Is there an internal / modified verl version that includes this method, which hasn’t been pushed to this repo yet?
If possible, could you share the implementation or the intended semantics of custom_repeat_by_counts (e.g., how it uses consolidated_turn_count to expand the batch)?
Which fit() implementation corresponds to the RLVER paper experiments?
Is the RLVER paper using the multi-turn code path that relies on dialogue_turns + custom_repeat_by_counts?
Or are the released results based on the simpler batch.repeat(...); _balance_batch(...) version?
Multi-turn reward shaping details
Since RLVER uses multi-turn simulated users and turn-level emotion scores, it would be very helpful if you could confirm:
How the dialogue is grouped into episodes when dialogue_turns varies per sample;
How the per-turn rewards are aggregated / aligned with the PPO rollout n (e.g., when rollout.n=1 vs n>1).
- Minimal reproduction
Task: RLVER PPO training with multi-turn environment (vllm_multi_turn_via_chat)
Model: e.g., Qwen/Qwen2.5-3B-Instruct
Code path: use the fit() version that calls batch.custom_repeat_by_counts(...)
Error:
AttributeError: 'DataProto' object has no attribute 'custom_repeat_by_counts'
If I instead switch to the simpler fit() version (without custom_repeat_by_counts), the code runs, but then I’m not confident that this matches the multi-turn rollout and reward handling described in the RLVER paper.
Thank you very much for your time, and for any guidance you can provide. Having these details (or the missing helper function) would greatly help the community to reproduce and build upon RLVER.