Thanks for the great work! Noticed that the samples are reordered to distribute across the GPUs to reduce the long-tailed effect of imbalance across GPUs, so compared to the original verl there is a repeat without interleave here.
However, my question is, the original rollout.n was passed to vllm rollout, and I don't seem to find places that set this to 1. Wouldn't this extra repeat cause each prompt to be evaluated by n * n times? The vllm config was taken from config here.