Skip to content

fix: Improve robustness and correctness in data preprocessing#2628

Draft
iuyo5678 wants to merge 6 commits intoNVIDIA:mainfrom
iuyo5678:fix/preprocess-data-robustness
Draft

fix: Improve robustness and correctness in data preprocessing#2628
iuyo5678 wants to merge 6 commits intoNVIDIA:mainfrom
iuyo5678:fix/preprocess-data-robustness

Conversation

@iuyo5678
Copy link
Contributor

What does this PR do ?

  1. Fail Early: Moved the assertion check 'args.workers % args.partitions == 0' to the beginning of the partition logic. This prevents the script from performing expensive file I/O operations only to fail later due to invalid configuration.

  2. Multi-key Fix: Fixed a bug in 'Partition.process_json_file' where only the last key in '--json-keys' was being finalized. Now, all builders are correctly finalized.

  3. Robustness: Improved 'Encoder.encode' to safely handle cases where a specified JSON key is missing. Instead of raising a KeyError, it now skips the missing key.

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

This commit addresses three main issues in tools/preprocess_data.py:

1. Fail Early: Moved the assertion check 'args.workers % args.partitions == 0' to the beginning of the partition logic. This prevents the script from performing expensive file I/O operations only to fail later due to invalid configuration.

2. Multi-key Fix: Fixed a bug in 'Partition.process_json_file' where only the last key in '--json-keys' was being finalized. Now, all builders are correctly finalized.

3. Robustness: Improved 'Encoder.encode' to safely handle cases where a specified JSON key is missing. Instead of raising a KeyError, it now skips the missing key.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026
@Phlip79 Phlip79 requested a review from jon-barker January 30, 2026 19:02
@Phlip79
Copy link
Member

Phlip79 commented Jan 30, 2026

/ok to test dc2ecc8

@github-actions
Copy link
Contributor

Thank you for your contribution!

NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process.

Thank you for your understanding.

@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

/claude review

lens = {}
for key in self.args.json_keys:
text = data[key]
text = data.get(key)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data.get(key) + continue approach silently skips any record where the key is absent (including typos in --json-keys). This can cause silent data loss with no indication to the user that anything went wrong.

Consider at minimum logging a warning so the user knows records are being dropped:

Suggested change
text = data.get(key)
text = data.get(key)
if text is None:
import logging
logging.warning(f"Key '{key}' missing or null in record; skipping.")
continue

Or, if silent skipping is intentional, a counter that is reported at the end (similar to the existing self.print_processing_stats) would make the behavior observable.

@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

@Phlip79 Phlip79 marked this pull request as draft March 4, 2026 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants