[data] fix: Write full process_example_fn output to JSONL in HFDatasetBuilder#2612
[data] fix: Write full process_example_fn output to JSONL in HFDatasetBuilder#2612shanecmoran wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR updates the return type annotation of the ProcessExampleFn callable from a structured type to a generic dict, simplifies the output serialization logic in the dataset preprocessing function, and adds comprehensive unit tests validating chat-format and input-output format preprocessing workflows. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unit_tests/data/test_hf_dataset_chat_format.py (1)
52-110: Consider addingpytest.mark.unitdecorator.Per coding guidelines, tests should use
pytest.markto categorize tests. Since this is intests/unit_tests/, adding the marker helps with test filtering and organization.Proposed fix
+import pytest + + +@pytest.mark.unit class TestPreprocessChatFormat:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/data/test_hf_dataset_chat_format.py` around lines 52 - 110, Add the pytest unit marker to these tests so they are categorized as unit tests: decorate either the class TestPreprocessChatFormat or each test method (test_chat_format_writes_messages_and_tools and test_input_output_format_still_works) with `@pytest.mark.unit` and ensure pytest is imported (import pytest) at the top of the test file; keep the existing behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/unit_tests/data/test_hf_dataset_chat_format.py`:
- Around line 15-20: Remove the unused import of Path from the test module:
delete the "from pathlib import Path" import line in
tests/unit_tests/data/test_hf_dataset_chat_format.py (the import of Path is not
referenced anywhere; keep the other imports including json, Dataset,
DatasetDict, and preprocess_and_split_data intact). Run the test/linter to
verify no unused-import warnings remain.
---
Nitpick comments:
In `@tests/unit_tests/data/test_hf_dataset_chat_format.py`:
- Around line 52-110: Add the pytest unit marker to these tests so they are
categorized as unit tests: decorate either the class TestPreprocessChatFormat or
each test method (test_chat_format_writes_messages_and_tools and
test_input_output_format_still_works) with `@pytest.mark.unit` and ensure pytest
is imported (import pytest) at the top of the test file; keep the existing
behavior unchanged.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/megatron/bridge/data/builders/hf_dataset.pytests/unit_tests/data/test_hf_dataset_chat_format.py
3811da9 to
16b9fcc
Compare
|
/ok to test 16b9fcc |
26e7bb2 to
05194ac
Compare
|
@yaoyu-33 I'm a bit confused, it says you approved but also says its awaiting your review? |
…tBuilder preprocess_and_split_data previously hardcoded input/output/original_answers keys when writing JSONL, which prevented chat-format process functions from producing the messages/tools keys that GPTSFTChatDataset expects. Write the full dict returned by process_example_fn instead. Existing input/output processors are unaffected since their dicts already contain those keys. Fixes NVIDIA-NeMo#2611 Signed-off-by: Shane Moran <shane.moran@shopify.com>
235696c to
9ce3e17
Compare
|
assigned @cuichenx to take another look |
|
/ok to test 9ce3e17 |
What does this PR do ?
Write the full dict returned by
process_example_fnto JSONL instead of hardcodinginput/output/original_answerskeys. This allows chat-format process functions (returning{"messages": [...], "tools": [...]}) to work withHFDatasetConfigandGPTSFTChatDataset.Changelog
ProcessExampleFnreturn type fromProcessExampleOutputtodict[str, Any]process_example_fnoutput directly as JSONL instead of cherry-pickinginput/outputkeysGitHub Actions CI
N/A — external contributor, CI needs manual approval.
Before your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes
Documentation
Updates
Tests