-
Notifications
You must be signed in to change notification settings - Fork 16
[algo] Adding CISPO policy loss #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: verl-latest
Are you sure you want to change the base?
Conversation
| # else: | ||
| # is_correct = are_equal_under_sympy(ground_truth_elem, given_elem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to remove this?
| expr = expr.replace("\\dfrac", "\\frac") | ||
| expr = expr.replace("\\frac", " \\frac") # Play nice with mixed numbers. | ||
| expr = latex2text.LatexNodes2Text().latex_to_text(expr) | ||
| # expr = latex2text.LatexNodes2Text().latex_to_text(expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also add the following comment # Added by Reasoning360
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, it would be ideal to create another directory called cispo instead of adding modifications in dapo
What does this PR do?
This PR adds CISPO to
core_algos.pyto start integrations toward a full implementation of scaleRL.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingAPI and Usage Example
CISPO is a sampled policy gradient loss that adopts a lot from the REINFORCE family of algorithms. How it differs from GRPO, etc is that the IS ratio is clipped directly rather than the fully policy clipping that is done in PPO derivatives. This introduces two new hyperparameters
cispo_clip_ratio_highandcispo_clip_ratio_lowto handle this clipping. They are each defaulted to 0.2.Also, we've introduced a new policy loss function for CISPO, which is employed when adjusting the
loss_modein the run configuration. Altogether this looks like: