Add GRPO callbacks for OLMo-core Trainer (GRPO olmo-core: PR 3 of 5) #1397

finbarrtimbers · 2026-01-20T19:40:46Z

Adds callbacks for GRPO training with OLMo-core's Trainer:

VLLMWeightSyncCallback: syncs weights to vLLM engines after each step
RefPolicyUpdateCallback: Polyak averaging for reference policy updates

Based on PR #1412 (GRPOTrainModule)

gemini-code-assist · 2026-01-20T19:41:03Z

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces foundational components for Group Relative Policy Optimization (GRPO) within the OLMo-core framework. It establishes new callback mechanisms for managing model weight synchronization with vLLM inference engines and updating reference policies, alongside a dedicated training module that implements the GRPO algorithm. These additions lay the groundwork for advanced reinforcement learning from human feedback (RLHF) training workflows, specifically for the OLMo-core actor.

Highlights

New GRPO Callbacks: Introduced grpo_callbacks.py containing VLLMWeightSyncCallback for synchronizing model weights to vLLM inference engines, RefPolicyUpdateCallback for Polyak averaging of reference policy, and DataPreparationActorCheckpointCallback for managing actor state during checkpointing.
GRPO Training Module: Added olmo_core_train_modules.py with GRPOTrainModule, which integrates Group Relative Policy Optimization (GRPO) training into the OLMo-core framework. This module supports PPO-style training, various loss functions (DAPO/CISPO), KL penalty computation, and importance sampling with clipping.
HuggingFace Name Mapping Utility: Included a utility function olmo_core_to_hf_name to convert OLMo-core parameter names to HuggingFace format, facilitating compatibility with Qwen3/LLaMA models.
Type Checking Integration: Updated pyproject.toml to include the newly added GRPO callback and training module files for static type checking, ensuring code quality and maintainability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3a8a137290

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

open_instruct/olmo_core_train_modules.py

gemini-code-assist

Code Review

This pull request introduces GRPO-specific callbacks and a training module for OLMo-core, which are foundational components for the OLMo-core actor. The changes include VLLMWeightSyncCallback for synchronizing weights, RefPolicyUpdateCallback for Polyak updates, and GRPOTrainModule for the GRPO training algorithm. The pyproject.toml file has been updated to include these new files for type checking. Overall, the code is well-structured and follows existing patterns, but there are a few areas for improvement regarding type safety, code duplication, and error handling specificity.

open_instruct/grpo_callbacks.py

open_instruct/olmo_core_train_modules.py

…e: PR 2 of 5) Adds a GRPO training module that subclasses OLMo-core's TransformerTrainModule. This inherits optim_step, zero_grads, eval_batch, state_dict, etc. and only overrides train_batch with the GRPO loss computation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Deduplicates the GRPO loss computation (DAPO/CISPO branching, clipping, KL penalty) from grpo_fast.py and olmo_core_train_modules.py into a shared function, following the DPOLossType enum pattern from dpo_utils.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…variable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…e: PR 2 of 5) Adds a GRPO training module that subclasses OLMo-core's TransformerTrainModule. This inherits optim_step, zero_grads, eval_batch, state_dict, etc. and only overrides train_batch with the GRPO loss computation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds callbacks for GRPO training with OLMo-core's Trainer: - VLLMWeightSyncCallback: syncs weights to vLLM engines after each step - RefPolicyUpdateCallback: Polyak averaging for reference policy updates Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove olmo_core_train_modules.py (belongs in PR #1412) - Add name_mapper parameter for parameter name translation (e.g., OLMo-core to HF) - Make deepspeed_stage optional, auto-detect FSDP models - Use FSDP.summon_full_params for FSDP weight gathering - Update _send_to_vllm to accept shape directly Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…module

- Inline get_shape and map_name as ternary expressions - Refactor broadcast_params to return refs instead of using nonlocal - Add pre-commit hook to ban nonlocal keyword Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Refactor to eliminate nested function and reduce code duplication: - Add _broadcast_params_to_vllm with explicit parameters - Simplify broadcast_weights_to_vllm to use the helper - Add validation for FSDP with gather_whole_model=False Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

open_instruct/grpo_callbacks.py

open_instruct/vllm_utils.py

- Add _get_fsdp2_submodules() to find FSDP2-wrapped submodules - Add _broadcast_fsdp2_block_params() for block-by-block unshard/reshard - Support gather_whole_model=False for FSDP2 (reduces peak memory) - Use unshard_and_reshard() context manager for FSDP2 with gather_whole_model=True - Auto-detect DeepSpeed stage 3, remove deepspeed_stage parameter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

hamishivi

Happy to merge this, although probably we want to more thoroughly test these code chunks once more the the GRPO implementation is in.

finbarrtimbers · 2026-01-26T21:28:20Z

Happy to merge this, although probably we want to more thoroughly test these code chunks once more the the GRPO implementation is in.

Agreed! I did add some test, but I agree completely.

finbarrtimbers mentioned this pull request Jan 20, 2026

Add OLMo-core Ray actor (GRPO olmo-core: PR 4 of 5) #1398

Open

2 tasks

chatgpt-codex-connector bot reviewed Jan 20, 2026

View reviewed changes

open_instruct/olmo_core_train_modules.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

finbarrtimbers force-pushed the finbarr/grpo-callbacks-module branch from e48dd77 to cd3503f Compare January 20, 2026 22:45

finbarrtimbers force-pushed the finbarr/grpo-utils-config branch from f1f3628 to 691668a Compare January 20, 2026 22:53

Base automatically changed from finbarr/grpo-utils-config to main January 20, 2026 23:43

finbarrtimbers force-pushed the finbarr/grpo-callbacks-module branch 2 times, most recently from 6f3c21c to 71c6bff Compare January 22, 2026 14:52

finbarrtimbers force-pushed the finbarr/grpo-callbacks-module branch from 8c2e075 to 0baad0b Compare January 22, 2026 17:23

finbarrtimbers changed the title ~~Add GRPO callbacks and training module (GRPO olmo-core implementation: PR 2 of 4)~~ Add GRPO callbacks for OLMo-core Trainer (GRPO olmo-core: PR 3 of 5) Jan 22, 2026

finbarrtimbers and others added 11 commits January 22, 2026 13:58

Apply suggestion from @gemini-code-assist[bot]

99fb5ad

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Cleaned up PR

2c73c32

Cleans up single-use variable.

9a5b967

update code

f84b17d

cleaned up PR.

f0532af

Remove unnecessary comments, defensive checks, and inline single-use …

037a6e5

…variable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge branch 'main' into finbarr/grpo-train-module

dd48e54

Updated the changelog

5b305fc

Fixed changelog

00c7cae

Merge branch 'main' into finbarr/grpo-train-module

edfb0e3

finbarrtimbers changed the base branch from main to finbarr/grpo-train-module January 26, 2026 15:37

finbarrtimbers and others added 3 commits January 26, 2026 09:22

Updated changelog

477fe56

finbarrtimbers force-pushed the finbarr/grpo-callbacks-module branch from 563461f to 477fe56 Compare January 26, 2026 16:25

finbarrtimbers and others added 2 commits January 26, 2026 09:37

Merge branch 'finbarr/grpo-train-module' into finbarr/grpo-callbacks-…

28ff42f

…module

finbarrtimbers and others added 3 commits January 26, 2026 10:29

Refactor broadcast_weights_to_vllm and ban nonlocal keyword

7de046d

- Inline get_shape and map_name as ternary expressions - Refactor broadcast_params to return refs instead of using nonlocal - Add pre-commit hook to ban nonlocal keyword Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

clean up PR

c6ea1bb

hamishivi reviewed Jan 26, 2026

View reviewed changes

open_instruct/grpo_callbacks.py Show resolved Hide resolved

open_instruct/vllm_utils.py Show resolved Hide resolved

hamishivi approved these changes Jan 26, 2026

View reviewed changes

finbarrtimbers force-pushed the finbarr/grpo-train-module branch from edfb0e3 to 63008d7 Compare January 26, 2026 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GRPO callbacks for OLMo-core Trainer (GRPO olmo-core: PR 3 of 5) #1397

Add GRPO callbacks for OLMo-core Trainer (GRPO olmo-core: PR 3 of 5) #1397

Uh oh!

finbarrtimbers commented Jan 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamishivi left a comment

Uh oh!

finbarrtimbers commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add GRPO callbacks for OLMo-core Trainer (GRPO olmo-core: PR 3 of 5) #1397

Are you sure you want to change the base?

Add GRPO callbacks for OLMo-core Trainer (GRPO olmo-core: PR 3 of 5) #1397

Uh oh!

Conversation

finbarrtimbers commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

finbarrtimbers commented Jan 20, 2026 •

edited

Loading