Skip to content

Conversation

@natolambert
Copy link
Collaborator

Summary

  • Add validation_holdout_ratio parameter to hold out a portion of training data for validation
  • Track accuracy on held-out training data to detect overfitting during RL training
  • Validation metrics appear under eval/ prefix in wandb

Changes

  1. Add validation_holdout_ratio parameter to StreamingDataLoaderConfig (0.0-0.5)
  2. Implement dataset splitting in setup_datasets() with proper index reset
  3. Use validation holdout for eval metrics when enabled
  4. Fix create_tools() to handle tools=None case
  5. Add documentation and DGX Spark example script

Usage

uv run python open_instruct/grpo_fast.py \
    --validation_holdout_ratio 0.1 \  # Hold out 10% for validation
    ...

Test Results

Known Limitations

  • Eval metrics may timeout during training due to short timeout (0.01s) - this is a pre-existing limitation of the eval system
  • When validation holdout is enabled, it replaces the standard eval_dataset

Documentation

See docs/VALIDATION_REWARD_TRACKING.md for full documentation.


🤖 Generated with Claude Code

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @natolambert, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the training capabilities by introducing a validation holdout mechanism for GRPO, which is crucial for monitoring overfitting during reinforcement learning. Concurrently, it expands hardware compatibility by adding experimental support for NVIDIA DGX Spark, including necessary dependency updates and specific configuration guidance. These changes aim to provide more robust training diagnostics and broader platform accessibility for large language model development.

Highlights

  • Validation Reward Tracking for GRPO: Introduced a new validation_holdout_ratio parameter (0.0-0.5) to StreamingDataLoaderConfig which allows holding out a portion of the training data for validation. This enables tracking accuracy on held-out training data during RL training to detect overfitting, with metrics appearing under the eval/ prefix in wandb.
  • DGX Spark (Blackwell) Support: Added experimental support for NVIDIA DGX Spark (GB10 Blackwell, CUDA 13, aarch64) including updated Dockerfile, pyproject.toml dependencies (vLLM, PyTorch, conditional flash-attn), and new documentation (docs/DGX_SPARK.md) with specific flags and troubleshooting for SFT, DPO, and GRPO training.
  • Flexible Attention Implementation: The attn_implementation parameter in ModelConfig and grpo_fast.py now supports both flash_attention_2 (default) and sdpa (PyTorch's native scaled_dot_product_attention), allowing for better compatibility and performance on different hardware, particularly for DGX Spark where SDPA is preferred.
  • Gradient Checkpointing for LoRA: Enabled gradient checkpointing for non-QLoRA LoRA training in finetune.py to help manage memory usage during fine-tuning.
  • Improved Tool Handling: The create_tools() function in grpo_fast.py now gracefully handles cases where tools=None, returning an empty list instead of potentially raising an error.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

Documentation Changes Detected

📄 sitemap.xml
--- site-base/sitemap.xml	2026-01-15 16:56:40.856229410 +0000
+++ site-pr/sitemap.xml	2026-01-15 16:55:41.213508084 +0000
@@ -5,6 +5,14 @@
          <lastmod>2026-01-15</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/DGX_SPARK/</loc>
+         <lastmod>2026-01-15</lastmod>
+    </url>
+    <url>
📄 sitemap.xml.gz
Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable validation reward tracking feature for GRPO training, allowing for better monitoring of overfitting. The implementation, including the new validation_holdout_ratio parameter and dataset splitting logic, is well-executed and thoroughly documented. Additionally, the PR adds experimental support for DGX Spark, updating dependencies and configurations appropriately. The changes are logical and well-structured. I've included a suggestion to refactor the dataset loading logic to avoid unnecessarily loading the evaluation dataset when a validation holdout is used, which will improve efficiency.

Comment on lines +2356 to +2340
if validation_dataset is not None:
if eval_dataset is not None:
logger.warning(
"⚠️ Both validation_holdout_ratio and dataset_mixer_eval_list are specified. "
"Using validation holdout for 'eval/' metrics (to track overfitting). "
"The separate eval_dataset (test set) will not be used."
)
# Use validation holdout for evaluation metrics
eval_dataset = validation_dataset
logger.info(
f"🎯 Using validation holdout ({len(eval_dataset)} samples) for evaluation metrics. "
"This tracks accuracy on held-out training data to detect overfitting."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Following the suggested change in setup_datasets to avoid loading the standard eval set when a validation holdout is used, this check should be updated. Instead of checking if eval_dataset is not None, we should check the configuration directly to see if a standard eval set was specified (len(streaming_config.dataset_mixer_eval_list) > 0). This ensures the warning is still logged correctly when both are configured, even though the standard eval set is no longer loaded.

Suggested change
if validation_dataset is not None:
if eval_dataset is not None:
logger.warning(
"⚠️ Both validation_holdout_ratio and dataset_mixer_eval_list are specified. "
"Using validation holdout for 'eval/' metrics (to track overfitting). "
"The separate eval_dataset (test set) will not be used."
)
# Use validation holdout for evaluation metrics
eval_dataset = validation_dataset
logger.info(
f"🎯 Using validation holdout ({len(eval_dataset)} samples) for evaluation metrics. "
"This tracks accuracy on held-out training data to detect overfitting."
)
if validation_dataset is not None:
if len(streaming_config.dataset_mixer_eval_list) > 0:
logger.warning(
"⚠️ Both validation_holdout_ratio and dataset_mixer_eval_list are specified. "
"Using validation holdout for 'eval/' metrics (to track overfitting). "
"The separate eval_dataset (test set) will not be used."
)
# Use validation holdout for evaluation metrics
eval_dataset = validation_dataset
logger.info(
f"🎯 Using validation holdout ({len(eval_dataset)} samples) for evaluation metrics. "
"This tracks accuracy on held-out training data to detect overfitting."
)

This feature allows holding out a portion of training data for validation
to detect overfitting during RL training.

Changes:
- Add validation_holdout_ratio parameter to StreamingDataLoaderConfig
- Implement dataset splitting in setup_datasets() with proper index reset
- Use validation holdout for eval/ metrics when enabled
- Fix create_tools() to handle tools=None case
- Add documentation and example script for DGX Spark

Usage:
  --validation_holdout_ratio 0.1  # Hold out 10% for validation

The held-out data appears in eval/ metrics, allowing tracking of
accuracy on unseen training data to detect overfitting.

Known limitation: Eval metrics may timeout during training due to short
timeout (0.01s). Full eval appears at end of training.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@natolambert natolambert force-pushed the validation-reward-tracking branch from 6e1f2a5 to 203c5c6 Compare January 15, 2026 21:51
@github-actions
Copy link
Contributor

Documentation Changes Detected

📄 sitemap.xml
--- site-base/sitemap.xml	2026-01-15 21:52:30.136504774 +0000
+++ site-pr/sitemap.xml	2026-01-15 21:52:27.608522868 +0000
@@ -9,6 +9,10 @@
          <lastmod>2026-01-15</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/VALIDATION_REWARD_TRACKING/</loc>
+         <lastmod>2026-01-15</lastmod>
+    </url>
+    <url>
📄 sitemap.xml.gz
Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants