-
Notifications
You must be signed in to change notification settings - Fork 491
Add GRPO callbacks for OLMo-core Trainer (GRPO olmo-core: PR 3 of 5) #1397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: finbarr/grpo-train-module
Are you sure you want to change the base?
Add GRPO callbacks for OLMo-core Trainer (GRPO olmo-core: PR 3 of 5) #1397
Conversation
Summary of ChangesHello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces foundational components for Group Relative Policy Optimization (GRPO) within the OLMo-core framework. It establishes new callback mechanisms for managing model weight synchronization with vLLM inference engines and updating reference policies, alongside a dedicated training module that implements the GRPO algorithm. These additions lay the groundwork for advanced reinforcement learning from human feedback (RLHF) training workflows, specifically for the OLMo-core actor. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3a8a137290
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces GRPO-specific callbacks and a training module for OLMo-core, which are foundational components for the OLMo-core actor. The changes include VLLMWeightSyncCallback for synchronizing weights, RefPolicyUpdateCallback for Polyak updates, and GRPOTrainModule for the GRPO training algorithm. The pyproject.toml file has been updated to include these new files for type checking. Overall, the code is well-structured and follows existing patterns, but there are a few areas for improvement regarding type safety, code duplication, and error handling specificity.
e48dd77 to
cd3503f
Compare
f1f3628 to
691668a
Compare
6f3c21c to
71c6bff
Compare
…e: PR 2 of 5) Adds a GRPO training module that subclasses OLMo-core's TransformerTrainModule. This inherits optim_step, zero_grads, eval_batch, state_dict, etc. and only overrides train_batch with the GRPO loss computation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8c2e075 to
0baad0b
Compare
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Deduplicates the GRPO loss computation (DAPO/CISPO branching, clipping, KL penalty) from grpo_fast.py and olmo_core_train_modules.py into a shared function, following the DPOLossType enum pattern from dpo_utils.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…variable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…e: PR 2 of 5) Adds a GRPO training module that subclasses OLMo-core's TransformerTrainModule. This inherits optim_step, zero_grads, eval_batch, state_dict, etc. and only overrides train_batch with the GRPO loss computation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds callbacks for GRPO training with OLMo-core's Trainer: - VLLMWeightSyncCallback: syncs weights to vLLM engines after each step - RefPolicyUpdateCallback: Polyak averaging for reference policy updates Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
563461f to
477fe56
Compare
- Remove olmo_core_train_modules.py (belongs in PR #1412) - Add name_mapper parameter for parameter name translation (e.g., OLMo-core to HF) - Make deepspeed_stage optional, auto-detect FSDP models - Use FSDP.summon_full_params for FSDP weight gathering - Update _send_to_vllm to accept shape directly Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Inline get_shape and map_name as ternary expressions - Refactor broadcast_params to return refs instead of using nonlocal - Add pre-commit hook to ban nonlocal keyword Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Refactor to eliminate nested function and reduce code duplication: - Add _broadcast_params_to_vllm with explicit parameters - Simplify broadcast_weights_to_vllm to use the helper - Add validation for FSDP with gather_whole_model=False Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add _get_fsdp2_submodules() to find FSDP2-wrapped submodules - Add _broadcast_fsdp2_block_params() for block-by-block unshard/reshard - Support gather_whole_model=False for FSDP2 (reduces peak memory) - Use unshard_and_reshard() context manager for FSDP2 with gather_whole_model=True - Auto-detect DeepSpeed stage 3, remove deepspeed_stage parameter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
hamishivi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to merge this, although probably we want to more thoroughly test these code chunks once more the the GRPO implementation is in.
Agreed! I did add some test, but I agree completely. |
edfb0e3 to
63008d7
Compare
Adds callbacks for GRPO training with OLMo-core's Trainer:
VLLMWeightSyncCallback: syncs weights to vLLM engines after each stepRefPolicyUpdateCallback: Polyak averaging for reference policy updatesBased on PR #1412 (GRPOTrainModule)
GPU_TESTS=01KFKCWJQKNEB71EZSA93XRKWF