Skip to content

[data] fix: Write full process_example_fn output to JSONL in HFDatasetBuilder#2612

Open
shanecmoran wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
shanecmoran:fix/hf-dataset-chat-format
Open

[data] fix: Write full process_example_fn output to JSONL in HFDatasetBuilder#2612
shanecmoran wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
shanecmoran:fix/hf-dataset-chat-format

Conversation

@shanecmoran
Copy link

@shanecmoran shanecmoran commented Mar 2, 2026

What does this PR do ?

Write the full dict returned by process_example_fn to JSONL instead of hardcoding input/output/original_answers keys. This allows chat-format process functions (returning {"messages": [...], "tools": [...]}) to work with HFDatasetConfig and GPTSFTChatDataset.

Changelog

  • Widen ProcessExampleFn return type from ProcessExampleOutput to dict[str, Any]
  • Write process_example_fn output directly as JSONL instead of cherry-picking input/output keys
  • Add unit test for chat-format and backward-compatible input/output format

GitHub Actions CI

N/A — external contributor, CI needs manual approval.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

Additional Information

Summary by CodeRabbit

Release Notes

  • Documentation

    • Enhanced documentation for data processing functions with expanded docstrings.
  • Updates

    • Method signature updated to support more flexible data format handling in the data processing pipeline.
  • Tests

    • Added comprehensive tests for chat-format data preprocessing, including backward compatibility validation with legacy input/output formats.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

📝 Walkthrough

Walkthrough

This PR updates the return type annotation of the ProcessExampleFn callable from a structured type to a generic dict, simplifies the output serialization logic in the dataset preprocessing function, and adds comprehensive unit tests validating chat-format and input-output format preprocessing workflows.

Changes

Cohort / File(s) Summary
Core Data Builder Updates
src/megatron/bridge/data/builders/hf_dataset.py
Updated ProcessExampleFn.call return type annotation to dict[str, Any], expanded docstrings for clarity, and simplified output writing to directly serialize the processed example dict as JSONL without intermediate construction.
Chat Format Test Suite
tests/unit_tests/data/test_hf_dataset_chat_format.py
New unit test module validating preprocessing logic for chat-format (messages/tools keys) and input-output format (backward compatibility) workflows using in-memory datasets and preprocess_and_split_data function.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Test Results For Major Changes ✅ Passed PR introduces new chat-format functionality with comprehensive test coverage validating both new behavior and backward compatibility, meeting requirements for testing major changes.
Title check ✅ Passed The title accurately describes the main change: updating HFDatasetBuilder to write the full process_example_fn output to JSONL, which is the primary focus of the PR and clearly reflected in the code changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/unit_tests/data/test_hf_dataset_chat_format.py (1)

52-110: Consider adding pytest.mark.unit decorator.

Per coding guidelines, tests should use pytest.mark to categorize tests. Since this is in tests/unit_tests/, adding the marker helps with test filtering and organization.

Proposed fix
+import pytest
+
+
+@pytest.mark.unit
 class TestPreprocessChatFormat:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/data/test_hf_dataset_chat_format.py` around lines 52 - 110,
Add the pytest unit marker to these tests so they are categorized as unit tests:
decorate either the class TestPreprocessChatFormat or each test method
(test_chat_format_writes_messages_and_tools and
test_input_output_format_still_works) with `@pytest.mark.unit` and ensure pytest
is imported (import pytest) at the top of the test file; keep the existing
behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unit_tests/data/test_hf_dataset_chat_format.py`:
- Around line 15-20: Remove the unused import of Path from the test module:
delete the "from pathlib import Path" import line in
tests/unit_tests/data/test_hf_dataset_chat_format.py (the import of Path is not
referenced anywhere; keep the other imports including json, Dataset,
DatasetDict, and preprocess_and_split_data intact). Run the test/linter to
verify no unused-import warnings remain.

---

Nitpick comments:
In `@tests/unit_tests/data/test_hf_dataset_chat_format.py`:
- Around line 52-110: Add the pytest unit marker to these tests so they are
categorized as unit tests: decorate either the class TestPreprocessChatFormat or
each test method (test_chat_format_writes_messages_and_tools and
test_input_output_format_still_works) with `@pytest.mark.unit` and ensure pytest
is imported (import pytest) at the top of the test file; keep the existing
behavior unchanged.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 383b610 and 455b077.

📒 Files selected for processing (2)
  • src/megatron/bridge/data/builders/hf_dataset.py
  • tests/unit_tests/data/test_hf_dataset_chat_format.py

@shanecmoran shanecmoran force-pushed the fix/hf-dataset-chat-format branch from 3811da9 to 16b9fcc Compare March 2, 2026 17:32
@shanecmoran shanecmoran changed the title [data] Write full process_example_fn output to JSONL in HFDatasetBuilder [data] fix: Write full process_example_fn output to JSONL in HFDatasetBuilder Mar 2, 2026
@NVIDIA-NeMo NVIDIA-NeMo deleted a comment from copy-pr-bot bot Mar 2, 2026
@adityavavreNVDA
Copy link
Contributor

/ok to test 16b9fcc

@yaoyu-33 yaoyu-33 self-requested a review March 2, 2026 22:00
@shanecmoran shanecmoran force-pushed the fix/hf-dataset-chat-format branch from 26e7bb2 to 05194ac Compare March 4, 2026 19:00
@shanecmoran
Copy link
Author

@yaoyu-33 I'm a bit confused, it says you approved but also says its awaiting your review?

…tBuilder

preprocess_and_split_data previously hardcoded input/output/original_answers
keys when writing JSONL, which prevented chat-format process functions from
producing the messages/tools keys that GPTSFTChatDataset expects.

Write the full dict returned by process_example_fn instead. Existing
input/output processors are unaffected since their dicts already contain
those keys.

Fixes NVIDIA-NeMo#2611

Signed-off-by: Shane Moran <shane.moran@shopify.com>
@shanecmoran shanecmoran force-pushed the fix/hf-dataset-chat-format branch from 235696c to 9ce3e17 Compare March 5, 2026 11:57
@yaoyu-33 yaoyu-33 self-requested a review March 5, 2026 19:28
@yaoyu-33
Copy link
Contributor

yaoyu-33 commented Mar 5, 2026

assigned @cuichenx to take another look

@cuichenx
Copy link
Contributor

cuichenx commented Mar 5, 2026

/ok to test 9ce3e17

@yaoyu-33 yaoyu-33 enabled auto-merge (squash) March 5, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants