Add validation reward tracking for GRPO training #1369

natolambert · 2026-01-15T16:54:36Z

Summary

Add validation_holdout_ratio parameter to hold out a portion of training data for validation
Track accuracy on held-out training data to detect overfitting during RL training
Validation metrics appear under eval/ prefix in wandb

Changes

Add validation_holdout_ratio parameter to StreamingDataLoaderConfig (0.0-0.5)
Implement dataset splitting in setup_datasets() with proper index reset
Use validation holdout for eval metrics when enabled
Fix create_tools() to handle tools=None case
Add documentation and DGX Spark example script

Usage

uv run python open_instruct/grpo_fast.py \
    --validation_holdout_ratio 0.1 \  # Hold out 10% for validation
    ...

Test Results

wandb run: https://wandb.ai/ai2-llm/open_instruct_internal/runs/4zypodwy
Training completed successfully with 270 train + 30 validation samples
Model achieved ~12% gsm8k_correct_rate during training

Known Limitations

Eval metrics may timeout during training due to short timeout (0.01s) - this is a pre-existing limitation of the eval system
When validation holdout is enabled, it replaces the standard eval_dataset

Documentation

See docs/VALIDATION_REWARD_TRACKING.md for full documentation.

🤖 Generated with Claude Code

gemini-code-assist · 2026-01-15T16:55:05Z

Summary of Changes

Hello @natolambert, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the training capabilities by introducing a validation holdout mechanism for GRPO, which is crucial for monitoring overfitting during reinforcement learning. Concurrently, it expands hardware compatibility by adding experimental support for NVIDIA DGX Spark, including necessary dependency updates and specific configuration guidance. These changes aim to provide more robust training diagnostics and broader platform accessibility for large language model development.

Highlights

Validation Reward Tracking for GRPO: Introduced a new validation_holdout_ratio parameter (0.0-0.5) to StreamingDataLoaderConfig which allows holding out a portion of the training data for validation. This enables tracking accuracy on held-out training data during RL training to detect overfitting, with metrics appearing under the eval/ prefix in wandb.
DGX Spark (Blackwell) Support: Added experimental support for NVIDIA DGX Spark (GB10 Blackwell, CUDA 13, aarch64) including updated Dockerfile, pyproject.toml dependencies (vLLM, PyTorch, conditional flash-attn), and new documentation (docs/DGX_SPARK.md) with specific flags and troubleshooting for SFT, DPO, and GRPO training.
Flexible Attention Implementation: The attn_implementation parameter in ModelConfig and grpo_fast.py now supports both flash_attention_2 (default) and sdpa (PyTorch's native scaled_dot_product_attention), allowing for better compatibility and performance on different hardware, particularly for DGX Spark where SDPA is preferred.
Gradient Checkpointing for LoRA: Enabled gradient checkpointing for non-QLoRA LoRA training in finetune.py to help manage memory usage during fine-tuning.
Improved Tool Handling: The create_tools() function in grpo_fast.py now gracefully handles cases where tools=None, returning an empty list instead of potentially raising an error.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-01-15T16:56:41Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2026-01-15 16:56:40.856229410 +0000
+++ site-pr/sitemap.xml	2026-01-15 16:55:41.213508084 +0000
@@ -5,6 +5,14 @@
          <lastmod>2026-01-15</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/DGX_SPARK/</loc>
+         <lastmod>2026-01-15</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

gemini-code-assist

Code Review

This pull request introduces a valuable validation reward tracking feature for GRPO training, allowing for better monitoring of overfitting. The implementation, including the new validation_holdout_ratio parameter and dataset splitting logic, is well-executed and thoroughly documented. Additionally, the PR adds experimental support for DGX Spark, updating dependencies and configurations appropriately. The changes are logical and well-structured. I've included a suggestion to refactor the dataset loading logic to avoid unnecessarily loading the evaluation dataset when a validation holdout is used, which will improve efficiency.

gemini-code-assist · 2026-01-15T16:57:31Z

open_instruct/grpo_fast.py

+    if validation_dataset is not None:
+        if eval_dataset is not None:
+            logger.warning(
+                "⚠️ Both validation_holdout_ratio and dataset_mixer_eval_list are specified. "
+                "Using validation holdout for 'eval/' metrics (to track overfitting). "
+                "The separate eval_dataset (test set) will not be used."
+            )
+        # Use validation holdout for evaluation metrics
+        eval_dataset = validation_dataset
+        logger.info(
+            f"🎯 Using validation holdout ({len(eval_dataset)} samples) for evaluation metrics. "
+            "This tracks accuracy on held-out training data to detect overfitting."
+        )


Following the suggested change in setup_datasets to avoid loading the standard eval set when a validation holdout is used, this check should be updated. Instead of checking if eval_dataset is not None, we should check the configuration directly to see if a standard eval set was specified (len(streaming_config.dataset_mixer_eval_list) > 0). This ensures the warning is still logged correctly when both are configured, even though the standard eval set is no longer loaded.

Suggested change

if validation_dataset is not None:

if eval_dataset is not None:

logger.warning(

"⚠️ Both validation_holdout_ratio and dataset_mixer_eval_list are specified. "

"Using validation holdout for 'eval/' metrics (to track overfitting). "

"The separate eval_dataset (test set) will not be used."

)

# Use validation holdout for evaluation metrics

eval_dataset = validation_dataset

logger.info(

f"🎯 Using validation holdout ({len(eval_dataset)} samples) for evaluation metrics. "

"This tracks accuracy on held-out training data to detect overfitting."

)

if validation_dataset is not None:

if len(streaming_config.dataset_mixer_eval_list) > 0:

logger.warning(

"⚠️ Both validation_holdout_ratio and dataset_mixer_eval_list are specified. "

"Using validation holdout for 'eval/' metrics (to track overfitting). "

"The separate eval_dataset (test set) will not be used."

)

# Use validation holdout for evaluation metrics

eval_dataset = validation_dataset

logger.info(

f"🎯 Using validation holdout ({len(eval_dataset)} samples) for evaluation metrics. "

"This tracks accuracy on held-out training data to detect overfitting."

)

This feature allows holding out a portion of training data for validation to detect overfitting during RL training. Changes: - Add validation_holdout_ratio parameter to StreamingDataLoaderConfig - Implement dataset splitting in setup_datasets() with proper index reset - Use validation holdout for eval/ metrics when enabled - Fix create_tools() to handle tools=None case - Add documentation and example script for DGX Spark Usage: --validation_holdout_ratio 0.1 # Hold out 10% for validation The held-out data appears in eval/ metrics, allowing tracking of accuracy on unseen training data to detect overfitting. Known limitation: Eval metrics may timeout during training due to short timeout (0.01s). Full eval appears at end of training. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-01-15T21:52:30Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2026-01-15 21:52:30.136504774 +0000
+++ site-pr/sitemap.xml	2026-01-15 21:52:27.608522868 +0000
@@ -9,6 +9,10 @@
          <lastmod>2026-01-15</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/VALIDATION_REWARD_TRACKING/</loc>
+         <lastmod>2026-01-15</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

natolambert force-pushed the validation-reward-tracking branch from 6e1f2a5 to 203c5c6 Compare January 15, 2026 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add validation reward tracking for GRPO training #1369

Add validation reward tracking for GRPO training #1369

Uh oh!

natolambert commented Jan 15, 2026

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 15, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add validation reward tracking for GRPO training #1369

Are you sure you want to change the base?

Add validation reward tracking for GRPO training #1369

Uh oh!

Conversation

natolambert commented Jan 15, 2026

Summary

Changes

Usage

Test Results

Known Limitations

Documentation

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Jan 15, 2026

Documentation Changes Detected

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 15, 2026

Documentation Changes Detected

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants