Skip to content

Conversation

@finbarrtimbers
Copy link
Collaborator

Summary

  • Add grpo.py main training orchestration script using OLMo-core's Trainer
  • Add test scripts:
    • scripts/train/debug/single_gpu_grpo.sh - Single GPU test
    • scripts/train/debug/multi_node_grpo.sh - Multi-node test
    • scripts/train/debug/tool_grpo.sh - Tool use test
  • Add grpo.py to type checking in pyproject.toml
  • Update CHANGELOG.md to document the new trainer

This is the final integration that brings together all the OLMo-core GRPO components.

Depends on: #1398

Test plan

  • make style && make quality passes
  • Run scripts/train/debug/single_gpu_grpo.sh successfully
  • Run scripts/train/debug/tool_grpo.sh successfully

🤖 Generated with Claude Code

… PR 4 of 4)

Add the final integration for the OLMo-core GRPO trainer:

- grpo.py: Main training orchestration script using OLMo-core's Trainer
- scripts/train/debug/single_gpu_grpo.sh: Single GPU test script
- scripts/train/debug/multi_node_grpo.sh: Multi-node test script
- scripts/train/debug/tool_grpo.sh: Tool use test script
- pyproject.toml: Add grpo.py to type checking
- CHANGELOG.md: Document the new trainer

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers the foundational implementation for Group Relative Policy Optimization (GRPO) training, integrating it deeply with the OLMo-core framework. The changes enable robust distributed training capabilities, shifting from DeepSpeed to FSDP, and introduce comprehensive scripting for various training scenarios, including single-GPU, multi-node, and tool-augmented training. This is a crucial step in completing the OLMo-core GRPO component integration, providing a unified and scalable training solution.

Highlights

  • GRPO Core Implementation: Introduces grpo.py as the main entry point for Group Relative Policy Optimization (GRPO) training, leveraging OLMo-core's Trainer and replacing DeepSpeed with FSDP for distributed training.
  • New Debug Scripts: Adds three new debug scripts (single_gpu_grpo.sh, multi_node_grpo.sh, tool_grpo.sh) to facilitate testing of single-GPU, multi-node, and tool-use GRPO training configurations.
  • Type Checking Integration: Integrates the new grpo.py module into the project's type checking configuration in pyproject.toml to maintain code quality.
  • Changelog Update: Updates CHANGELOG.md to document the addition of the new OLMo-core based GRPO trainer.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the main entry point and associated scripts for GRPO training using OLMo-core's Trainer. It also updates the CHANGELOG.md and pyproject.toml for type checking. The changes integrate various components for distributed training with Ray and vLLM, including experiment tracking, tool initialization, and model saving. The new scripts provide debug configurations for single-GPU, multi-node, and tool-use scenarios.


wait_for_gpus(sum(args.num_learners_per_node))

bundles = [{"GPU": n, "CPU": n} for n in args.num_learners_per_node]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Ray placement group bundles allocate an equal number of CPUs and GPUs ("CPU": n for each "GPU": n). While this might be sufficient for GPU-heavy tasks, Ray actors often require more CPU resources for orchestration, data handling, or other background processes. Consider allocating more CPUs per bundle (e.g., "CPU": n * 2 or a fixed higher number) to prevent potential CPU bottlenecks that could impact overall training efficiency.

Comment on lines +332 to +357
vllm_engines = vllm_utils.create_vllm_engines(
vllm_config.vllm_num_engines,
vllm_config.vllm_tensor_parallel_size,
vllm_config.vllm_enforce_eager,
tc.tokenizer_name_or_path,
model_config.model_name_or_path,
model_config.model_revision,
args.seed,
vllm_config.vllm_enable_prefix_caching,
streaming_config.max_prompt_token_length + streaming_config.response_length,
vllm_config.vllm_gpu_memory_utilization,
args.single_gpu_mode,
pg=pg if args.single_gpu_mode else None,
tool_actors=tool_actors,
tool_parser_type=tools_config.tool_parser_type,
max_tool_calls=tools_config.max_tool_calls,
mask_tool_use=streaming_config.mask_tool_use,
prompt_queue=prompt_Q,
results_queue=inference_results_Q,
eval_results_queue=evaluation_inference_results_Q,
actor_manager=actor_manager,
inflight_updates=streaming_config.inflight_updates,
reward_config=reward_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to vllm_utils.create_vllm_engines is very long, spanning multiple lines with many parameters. This reduces readability and makes it harder to understand the configuration at a glance. Consider refactoring this by creating a dedicated configuration object (e.g., a dataclass) for vLLM engine parameters and passing that object instead, or by grouping related parameters into smaller, more manageable variables.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 811ba9adde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


wait_for_gpus(sum(args.num_learners_per_node))

bundles = [{"GPU": n, "CPU": n} for n in args.num_learners_per_node]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reserve enough CPUs in the Ray placement group

The placement-group bundles reserve CPU equal to the GPU count ({"GPU": n, "CPU": n}), but each PolicyTrainerOLMoCoreProcess actor requests 4 CPUs (num_cpus_per_actor = 4 in open_instruct/grpo_olmo_core_actor.py:391-409). On a 1‑GPU run this makes the bundle provide only 1 CPU while the actor needs 4, so the actors cannot be scheduled and the training will hang at pg.ready() or actor creation. The bundle CPU should scale to at least 4 * n (or whatever the actor CPU requirement is) to make the placement group feasible.

Useful? React with 👍 / 👎.

Comment on lines +138 to +140
logger.info(f"Only {available_gpus} GPUs available, waiting for {expected_gpus}...")
time.sleep(poll_interval)
logger.error(f"Timeout waiting for GPUs. Only {available_gpus} available, needed {expected_gpus}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail fast when GPUs never appear in the cluster

When the Ray cluster never reaches the expected GPU count, wait_for_gpus only logs an error and then returns, so the code proceeds to create a placement group that will block indefinitely. This means a misconfigured or undersized cluster will hang the job instead of terminating with a clear failure. Consider raising an exception (or exiting) after the timeout so the run fails fast in that scenario.

Useful? React with 👍 / 👎.

@finbarrtimbers finbarrtimbers changed the title Add GRPO main entry point and scripts (GRPO olmo-core implementation: PR 4 of 4) Add GRPO main entry point and scripts (GRPO olmo-core: PR 5 of 5) Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants