Add OLMo-core Ray actor (GRPO olmo-core: PR 4 of 5) #1398

finbarrtimbers · 2026-01-20T19:42:22Z

Summary

Add grpo_olmo_core_actor.py with PolicyTrainerOLMoCoreProcess and OLMoCoreModelGroup classes
Add file to type checking in pyproject.toml

This Ray actor wraps OLMo-core's training infrastructure for distributed GRPO training across multiple GPUs and nodes.

Depends on: #1397

Test plan

make style && make quality passes
Integration tests with Ray

🤖 Generated with Claude Code

…ion: PR 1 of 4) This refactoring extracts the shared configuration class that both grpo_fast.py (existing DeepSpeed trainer) and the upcoming grpo.py (new OLMo-core trainer) need. - Create grpo_utils.py with ExperimentConfig dataclass (moved from grpo_fast.py Args) - Update grpo_fast.py to import from grpo_utils - Update benchmark_generators.py to import from grpo_utils Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…: PR 2 of 4) Add foundational components for the OLMo-core GRPO trainer: - grpo_callbacks.py: VLLMWeightSyncCallback, RefPolicyUpdateCallback, olmo_core_to_hf_name() - olmo_core_train_modules.py: GRPOTrainModule class for OLMo-core training - pyproject.toml: Add both files to type checking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add the Ray actor that wraps OLMo-core training: - grpo_olmo_core_actor.py: PolicyTrainerOLMoCoreProcess and OLMoCoreModelGroup classes - pyproject.toml: Add file to type checking The actor coordinates distributed training via torch.distributed for FSDP gradient synchronization. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist · 2026-01-20T19:42:39Z

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates OLMo-core's training capabilities into a distributed GRPO training system by introducing a dedicated Ray actor and its manager. This enables scalable training of large language models across multiple GPUs and nodes, leveraging Ray for efficient distributed orchestration. The changes facilitate the use of OLMo-core models within a distributed reinforcement learning framework, supporting various training configurations and essential callbacks for monitoring and synchronization.

Highlights

New OLMo-core Ray Actor: Introduced PolicyTrainerOLMoCoreProcess, a new Ray actor designed to wrap OLMo-core's training infrastructure for distributed GRPO (Generalized Reinforcement Learning with Policy Optimization) training across multiple GPUs and nodes. Each actor manages training on a single GPU.
OLMo-core Model Group Manager: Added OLMoCoreModelGroup, a manager class responsible for orchestrating multiple PolicyTrainerOLMoCoreProcess actors within a Ray placement group, facilitating distributed training setup and coordination.
Distributed Training Setup: The new actor handles the initialization of the distributed training environment (e.g., torch.distributed, OLMo-core's prepare_training_environment), loading HuggingFace model weights into an OLMo-core model, and configuring data loaders and optimizers.
Callback Integration: The PolicyTrainerOLMoCoreProcess supports various callbacks for critical functionalities such as vLLM weight synchronization, reference policy updates, Beaker progress tracking, and Weights & Biases logging.
Type Checking Update: The pyproject.toml file has been updated to include the newly added grpo_olmo_core_actor.py for type checking, ensuring code quality and consistency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new Ray actor (grpo_olmo_core_actor.py) for distributed GRPO training using OLMo-core. The implementation wraps OLMo-core's training infrastructure, including model setup, data loading, and callback management for weight synchronization and reference policy updates. The pyproject.toml file has been updated to include the new file for type checking. Overall, the code is well-structured and follows the intended design for integrating OLMo-core with Ray for distributed training. However, there are several areas for improvement regarding robustness, consistency, and clarity, particularly in parameter handling, type consistency, and resource allocation.

gemini-code-assist · 2026-01-20T19:44:48Z

open_instruct/grpo_olmo_core_actor.py

+            for i, engine in enumerate(vllm_engines)
+        ]
+
+        torch.cuda.set_device(0)


Hardcoding torch.cuda.set_device(0) is problematic in a distributed Ray actor setup where each actor is assigned a specific GPU. Instead, torch.cuda.set_device(self.local_rank) should be used to ensure each actor utilizes its assigned GPU, preventing resource contention and incorrect device usage.

Suggested change

torch.cuda.set_device(0)

torch.cuda.set_device(self.local_rank)

gemini-code-assist · 2026-01-20T19:44:48Z

open_instruct/grpo_olmo_core_actor.py

+
+        if self.single_gpu_mode:
+            logger.info(f"[Rank {self.rank}] Converting model to bfloat16 for single_gpu_mode")
+            self.model = self.model.to(dtype=torch.bfloat16)


The model is converted to bfloat16 only if single_gpu_mode is enabled. However, the dp_config (lines 195-201) specifies param_dtype=DType.bfloat16 for distributed mode. This inconsistency could lead to type mismatches or unexpected behavior if the model is not explicitly converted to bfloat16 in distributed mode, or if build does not handle it automatically. Ensure type consistency across all modes.

gemini-code-assist · 2026-01-20T19:44:48Z

open_instruct/grpo_olmo_core_actor.py

+    def __init__(
+        self,
+        world_size: int,
+        rank: int,
+        local_rank: int,
+        master_addr: str | None,
+        master_port: int | None,
+        num_nodes: int,
+        local_world_size: int,
+        model_name_or_path: str,
+        grpo_config: grpo_utils.ExperimentConfig,
+        learning_rate: float,
+        weight_decay: float,
+        max_grad_norm: float,
+        lr_scheduler_type: str,
+        warm_up_steps: int,
+        warmup_ratio: float,
+        num_training_steps: int,
+        num_epochs: int,
+        num_mini_batches: int,
+        per_device_train_batch_size: int,
+        max_sequence_length: int,
+        single_gpu_mode: bool,
+        load_ref_policy: bool,
+        beta: float,
+        seed: int,
+        output_dir: str,
+        streaming_config: data_loader_lib.StreamingDataLoaderConfig,
+        vllm_config: data_loader_lib.VLLMConfig,
+        data_prep_actor_name: str,
+        tokenizer: transformers.PreTrainedTokenizer,
+    ):


The __init__ method has a very large number of parameters. While many are necessary, consider grouping related configuration parameters into dedicated dataclasses (e.g., ModelConfig, TrainingConfig) to improve readability, maintainability, and reduce the method signature's complexity. This makes it easier to understand and manage the configuration.

gemini-code-assist · 2026-01-20T19:44:48Z

open_instruct/grpo_olmo_core_actor.py

+
+        model_basename = self.model_name_or_path.split("/")[-1]
+        config_name = model_basename.replace("-", "_").replace(".", "_")
+        config_name = config_name[:-1].lower() + "B" if config_name.endswith("B") else config_name.lower()


The logic for deriving config_name from model_basename is specific and might be fragile. It assumes a particular naming convention (e.g., ending with 'B' for billion) and converts to lowercase. This could lead to issues if model names deviate from this pattern or if TransformerConfig expects specific casing. Consider making this more robust or adding comments about expected model naming conventions.

gemini-code-assist · 2026-01-20T19:44:48Z

open_instruct/grpo_olmo_core_actor.py

+        if not hasattr(TransformerConfig, config_name):
+            available = [
+                m for m in dir(TransformerConfig) if not m.startswith("_") and callable(getattr(TransformerConfig, m))
+            ]
+            raise ValueError(f"No TransformerConfig.{config_name}() found. Available: {available}")


The ValueError message is helpful but could be improved by suggesting common model name patterns or providing a reference to olmo_core documentation for supported TransformerConfig names. This would make it easier for users to resolve the issue.

gemini-code-assist · 2026-01-20T19:44:48Z

open_instruct/grpo_olmo_core_actor.py

+        if hasattr(self, "vllm_engines") and self.vllm_engines:
+            trainer_callbacks["vllm_sync"] = VLLMWeightSyncCallback(
+                vllm_engines=self.vllm_engines,
+                model_update_group=getattr(self, "model_update_group", None),
+                actor_manager=getattr(self, "actor_manager", None),
+                name_mapper=olmo_core_to_hf_name,


Using getattr(self, "vllm_engines", None) and getattr(self, "actor_manager", None) with a default of None can mask potential issues. If vllm_engines or actor_manager are expected to be initialized by setup_model_update_group and setup_callbacks respectively, it's better to ensure they are always set or raise an explicit error if they are not, rather than silently proceeding with None.

gemini-code-assist · 2026-01-20T19:44:48Z

open_instruct/grpo_olmo_core_actor.py

+        if is_beaker_job() and hasattr(self, "json_config"):
+            trainer_callbacks["beaker"] = BeakerCallbackV2(config=self.json_config)
+


Similar to the previous comment, using getattr(self, "json_config", None) might hide issues if json_config is expected to be present after setup_callbacks. Consider ensuring it's always set or handling its absence more explicitly.

gemini-code-assist · 2026-01-20T19:44:48Z

open_instruct/grpo_olmo_core_actor.py

+    ):
+        self.pg = pg
+        self.num_gpus_per_node = num_gpus_per_node
+        self.num_gpus_per_actor = 0.48 if single_gpu_mode else 1


The num_gpus_per_actor is set to 0.48 in single_gpu_mode. While Ray supports fractional GPU requests, 0.48 is an unusual value. If the intention is to allow two actors on a single GPU, 0.5 would be more conventional and clearer. Consider if this specific value is intentional and justified, or if 0.5 would be more appropriate.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 386a649e76

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-20T19:45:52Z

open_instruct/grpo_olmo_core_actor.py

+        state_dict = {}
+        model = self.train_module.model
+        for name, param in model.named_parameters():
+            hf_name = olmo_core_to_hf_name(name)
+            state_dict[hf_name] = param.data.cpu()


Gather full FSDP weights before saving checkpoint

When world_size > 1 and single_gpu_mode is false, setup_model enables HSDP/FSDP, so self.train_module.model is sharded across ranks. This save_model implementation saves only model.named_parameters() from rank 0 without FSDP.summon_full_params()/state_dict() gathering, so the checkpoint will contain only local shards (and no full-state buffers) and will be incomplete or unloadable in distributed runs. Consider using the train module’s state dict or FSDP full-parameter gathering before writing pytorch_model.bin so multi-GPU training produces a valid checkpoint.

Useful? React with 👍 / 👎.

finbarrtimbers and others added 3 commits January 20, 2026 12:39

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

finbarrtimbers mentioned this pull request Jan 20, 2026

Add GRPO main entry point and scripts (GRPO olmo-core: PR 5 of 5) #1399

Open

3 tasks

chatgpt-codex-connector bot reviewed Jan 20, 2026

View reviewed changes

finbarrtimbers force-pushed the finbarr/grpo-callbacks-module branch 4 times, most recently from 8c2e075 to 0baad0b Compare January 22, 2026 17:23

finbarrtimbers changed the title ~~Add OLMo-core Ray actor (GRPO olmo-core implementation: PR 3 of 4)~~ Add OLMo-core Ray actor (GRPO olmo-core: PR 4 of 5) Jan 22, 2026

finbarrtimbers force-pushed the finbarr/grpo-callbacks-module branch from 563461f to 477fe56 Compare January 26, 2026 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OLMo-core Ray actor (GRPO olmo-core: PR 4 of 5) #1398

Add OLMo-core Ray actor (GRPO olmo-core: PR 4 of 5) #1398

finbarrtimbers commented Jan 20, 2026

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 20, 2026

Uh oh!

gemini-code-assist bot Jan 20, 2026

Uh oh!

gemini-code-assist bot Jan 20, 2026

Uh oh!

gemini-code-assist bot Jan 20, 2026

Uh oh!

gemini-code-assist bot Jan 20, 2026

Uh oh!

gemini-code-assist bot Jan 20, 2026

Uh oh!

gemini-code-assist bot Jan 20, 2026

Uh oh!

gemini-code-assist bot Jan 20, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	torch.cuda.set_device(0)
	torch.cuda.set_device(self.local_rank)

		if is_beaker_job() and hasattr(self, "json_config"):
		trainer_callbacks["beaker"] = BeakerCallbackV2(config=self.json_config)

Add OLMo-core Ray actor (GRPO olmo-core: PR 4 of 5) #1398

Are you sure you want to change the base?

Add OLMo-core Ray actor (GRPO olmo-core: PR 4 of 5) #1398

Conversation

finbarrtimbers commented Jan 20, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants