fix: Improve robustness and correctness in data preprocessing#2628
fix: Improve robustness and correctness in data preprocessing#2628iuyo5678 wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
This commit addresses three main issues in tools/preprocess_data.py: 1. Fail Early: Moved the assertion check 'args.workers % args.partitions == 0' to the beginning of the partition logic. This prevents the script from performing expensive file I/O operations only to fail later due to invalid configuration. 2. Multi-key Fix: Fixed a bug in 'Partition.process_json_file' where only the last key in '--json-keys' was being finalized. Now, all builders are correctly finalized. 3. Robustness: Improved 'Encoder.encode' to safely handle cases where a specified JSON key is missing. Instead of raising a KeyError, it now skips the missing key.
|
/ok to test dc2ecc8 |
|
Thank you for your contribution! NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process. Thank you for your understanding. |
|
/claude review |
| lens = {} | ||
| for key in self.args.json_keys: | ||
| text = data[key] | ||
| text = data.get(key) |
There was a problem hiding this comment.
The data.get(key) + continue approach silently skips any record where the key is absent (including typos in --json-keys). This can cause silent data loss with no indication to the user that anything went wrong.
Consider at minimum logging a warning so the user knows records are being dropped:
| text = data.get(key) | |
| text = data.get(key) | |
| if text is None: | |
| import logging | |
| logging.warning(f"Key '{key}' missing or null in record; skipping.") | |
| continue |
Or, if silent skipping is intentional, a counter that is reported at the end (similar to the existing self.print_processing_stats) would make the behavior observable.
|
We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged. Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md. |
What does this PR do ?
Fail Early: Moved the assertion check 'args.workers % args.partitions == 0' to the beginning of the partition logic. This prevents the script from performing expensive file I/O operations only to fail later due to invalid configuration.
Multi-key Fix: Fixed a bug in 'Partition.process_json_file' where only the last key in '--json-keys' was being finalized. Now, all builders are correctly finalized.
Robustness: Improved 'Encoder.encode' to safely handle cases where a specified JSON key is missing. Instead of raising a KeyError, it now skips the missing key.
Pre-checks
Core 0.8)