-
Notifications
You must be signed in to change notification settings - Fork 491
Add validation reward tracking for GRPO training #1369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @natolambert, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the training capabilities by introducing a validation holdout mechanism for GRPO, which is crucial for monitoring overfitting during reinforcement learning. Concurrently, it expands hardware compatibility by adding experimental support for NVIDIA DGX Spark, including necessary dependency updates and specific configuration guidance. These changes aim to provide more robust training diagnostics and broader platform accessibility for large language model development. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Documentation Changes Detected📄
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable validation reward tracking feature for GRPO training, allowing for better monitoring of overfitting. The implementation, including the new validation_holdout_ratio parameter and dataset splitting logic, is well-executed and thoroughly documented. Additionally, the PR adds experimental support for DGX Spark, updating dependencies and configurations appropriately. The changes are logical and well-structured. I've included a suggestion to refactor the dataset loading logic to avoid unnecessarily loading the evaluation dataset when a validation holdout is used, which will improve efficiency.
| if validation_dataset is not None: | ||
| if eval_dataset is not None: | ||
| logger.warning( | ||
| "⚠️ Both validation_holdout_ratio and dataset_mixer_eval_list are specified. " | ||
| "Using validation holdout for 'eval/' metrics (to track overfitting). " | ||
| "The separate eval_dataset (test set) will not be used." | ||
| ) | ||
| # Use validation holdout for evaluation metrics | ||
| eval_dataset = validation_dataset | ||
| logger.info( | ||
| f"🎯 Using validation holdout ({len(eval_dataset)} samples) for evaluation metrics. " | ||
| "This tracks accuracy on held-out training data to detect overfitting." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the suggested change in setup_datasets to avoid loading the standard eval set when a validation holdout is used, this check should be updated. Instead of checking if eval_dataset is not None, we should check the configuration directly to see if a standard eval set was specified (len(streaming_config.dataset_mixer_eval_list) > 0). This ensures the warning is still logged correctly when both are configured, even though the standard eval set is no longer loaded.
| if validation_dataset is not None: | |
| if eval_dataset is not None: | |
| logger.warning( | |
| "⚠️ Both validation_holdout_ratio and dataset_mixer_eval_list are specified. " | |
| "Using validation holdout for 'eval/' metrics (to track overfitting). " | |
| "The separate eval_dataset (test set) will not be used." | |
| ) | |
| # Use validation holdout for evaluation metrics | |
| eval_dataset = validation_dataset | |
| logger.info( | |
| f"🎯 Using validation holdout ({len(eval_dataset)} samples) for evaluation metrics. " | |
| "This tracks accuracy on held-out training data to detect overfitting." | |
| ) | |
| if validation_dataset is not None: | |
| if len(streaming_config.dataset_mixer_eval_list) > 0: | |
| logger.warning( | |
| "⚠️ Both validation_holdout_ratio and dataset_mixer_eval_list are specified. " | |
| "Using validation holdout for 'eval/' metrics (to track overfitting). " | |
| "The separate eval_dataset (test set) will not be used." | |
| ) | |
| # Use validation holdout for evaluation metrics | |
| eval_dataset = validation_dataset | |
| logger.info( | |
| f"🎯 Using validation holdout ({len(eval_dataset)} samples) for evaluation metrics. " | |
| "This tracks accuracy on held-out training data to detect overfitting." | |
| ) |
This feature allows holding out a portion of training data for validation to detect overfitting during RL training. Changes: - Add validation_holdout_ratio parameter to StreamingDataLoaderConfig - Implement dataset splitting in setup_datasets() with proper index reset - Use validation holdout for eval/ metrics when enabled - Fix create_tools() to handle tools=None case - Add documentation and example script for DGX Spark Usage: --validation_holdout_ratio 0.1 # Hold out 10% for validation The held-out data appears in eval/ metrics, allowing tracking of accuracy on unseen training data to detect overfitting. Known limitation: Eval metrics may timeout during training due to short timeout (0.01s). Full eval appears at end of training. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6e1f2a5 to
203c5c6
Compare
Documentation Changes Detected📄
|
Summary
validation_holdout_ratioparameter to hold out a portion of training data for validationeval/prefix in wandbChanges
validation_holdout_ratioparameter toStreamingDataLoaderConfig(0.0-0.5)setup_datasets()with proper index resetcreate_tools()to handletools=NonecaseUsage
uv run python open_instruct/grpo_fast.py \ --validation_holdout_ratio 0.1 \ # Hold out 10% for validation ...Test Results
Known Limitations
Documentation
See
docs/VALIDATION_REWARD_TRACKING.mdfor full documentation.🤖 Generated with Claude Code