[data] fix: Write full process_example_fn output to JSONL in HFDatasetBuilder by shanecmoran · Pull Request #2612 · NVIDIA-NeMo/Megatron-Bridge

shanecmoran · 2026-03-02T17:23:53Z

What does this PR do ?

Write the full dict returned by process_example_fn to JSONL instead of hardcoding input/output/original_answers keys. This allows chat-format process functions (returning {"messages": [...], "tools": [...]}) to work with HFDatasetConfig and GPTSFTChatDataset.

Changelog

Widen ProcessExampleFn return type from ProcessExampleOutput to dict[str, Any]
Write process_example_fn output directly as JSONL instead of cherry-picking input/output keys
Add unit test for chat-format and backward-compatible input/output format

GitHub Actions CI

N/A — external contributor, CI needs manual approval.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

Additional Information

Related to HFDatasetBuilder hardcodes input/output JSONL keys, breaking chat-format datasets #2611

Summary by CodeRabbit

Release Notes

Documentation
- Enhanced documentation for data processing functions with expanded docstrings.
Updates
- Method signature updated to support more flexible data format handling in the data processing pipeline.
Tests
- Added comprehensive tests for chat-format data preprocessing, including backward compatibility validation with legacy input/output formats.

copy-pr-bot · 2026-03-02T17:23:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-02T17:26:59Z

📝 Walkthrough

Walkthrough

This PR updates the return type annotation of the ProcessExampleFn callable from a structured type to a generic dict, simplifies the output serialization logic in the dataset preprocessing function, and adds comprehensive unit tests validating chat-format and input-output format preprocessing workflows.

Changes

Cohort / File(s)	Summary
Core Data Builder Updates `src/megatron/bridge/data/builders/hf_dataset.py`	Updated ProcessExampleFn.call return type annotation to `dict[str, Any]`, expanded docstrings for clarity, and simplified output writing to directly serialize the processed example dict as JSONL without intermediate construction.
Chat Format Test Suite `tests/unit_tests/data/test_hf_dataset_chat_format.py`	New unit test module validating preprocessing logic for chat-format (messages/tools keys) and input-output format (backward compatibility) workflows using in-memory datasets and preprocess_and_split_data function.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Test Results For Major Changes	✅ Passed	PR introduces new chat-format functionality with comprehensive test coverage validating both new behavior and backward compatibility, meeting requirements for testing major changes.
Title check	✅ Passed	The title accurately describes the main change: updating HFDatasetBuilder to write the full process_example_fn output to JSONL, which is the primary focus of the PR and clearly reflected in the code changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/unit_tests/data/test_hf_dataset_chat_format.py (1)
52-110: Consider adding pytest.mark.unit decorator.

Per coding guidelines, tests should use pytest.mark to categorize tests. Since this is in tests/unit_tests/, adding the marker helps with test filtering and organization.
Proposed fix
+import pytest
+
+
+@pytest.mark.unit
 class TestPreprocessChatFormat:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/data/test_hf_dataset_chat_format.py` around lines 52 - 110,
Add the pytest unit marker to these tests so they are categorized as unit tests:
decorate either the class TestPreprocessChatFormat or each test method
(test_chat_format_writes_messages_and_tools and
test_input_output_format_still_works) with `@pytest.mark.unit` and ensure pytest
is imported (import pytest) at the top of the test file; keep the existing
behavior unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unit_tests/data/test_hf_dataset_chat_format.py`:
- Around line 15-20: Remove the unused import of Path from the test module:
delete the "from pathlib import Path" import line in
tests/unit_tests/data/test_hf_dataset_chat_format.py (the import of Path is not
referenced anywhere; keep the other imports including json, Dataset,
DatasetDict, and preprocess_and_split_data intact). Run the test/linter to
verify no unused-import warnings remain.

---

Nitpick comments:
In `@tests/unit_tests/data/test_hf_dataset_chat_format.py`:
- Around line 52-110: Add the pytest unit marker to these tests so they are
categorized as unit tests: decorate either the class TestPreprocessChatFormat or
each test method (test_chat_format_writes_messages_and_tools and
test_input_output_format_still_works) with `@pytest.mark.unit` and ensure pytest
is imported (import pytest) at the top of the test file; keep the existing
behavior unchanged.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 383b610 and 455b077.

📒 Files selected for processing (2)

src/megatron/bridge/data/builders/hf_dataset.py
tests/unit_tests/data/test_hf_dataset_chat_format.py

tests/unit_tests/data/test_hf_dataset_chat_format.py

adityavavreNVDA · 2026-03-02T21:38:35Z

/ok to test 16b9fcc

shanecmoran · 2026-03-04T20:29:20Z

@yaoyu-33 I'm a bit confused, it says you approved but also says its awaiting your review?

…tBuilder preprocess_and_split_data previously hardcoded input/output/original_answers keys when writing JSONL, which prevented chat-format process functions from producing the messages/tools keys that GPTSFTChatDataset expects. Write the full dict returned by process_example_fn instead. Existing input/output processors are unaffected since their dicts already contain those keys. Fixes NVIDIA-NeMo#2611 Signed-off-by: Shane Moran <shane.moran@shopify.com>

yaoyu-33 · 2026-03-05T19:28:28Z

assigned @cuichenx to take another look

cuichenx · 2026-03-05T19:30:46Z

/ok to test 9ce3e17

github-actions bot added the community-request label Mar 2, 2026

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

tests/unit_tests/data/test_hf_dataset_chat_format.py Show resolved Hide resolved

shanecmoran force-pushed the fix/hf-dataset-chat-format branch from 3811da9 to 16b9fcc Compare March 2, 2026 17:32

shanecmoran changed the title ~~[data] Write full process_example_fn output to JSONL in HFDatasetBuilder~~ [data] fix: Write full process_example_fn output to JSONL in HFDatasetBuilder Mar 2, 2026

NVIDIA-NeMo deleted a comment from copy-pr-bot bot Mar 2, 2026

adityavavreNVDA mentioned this pull request Mar 2, 2026

HFDatasetBuilder hardcodes input/output JSONL keys, breaking chat-format datasets #2611

Open

copy-pr-bot bot temporarily deployed to test March 2, 2026 21:39 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 2, 2026 21:50 Inactive

yaoyu-33 approved these changes Mar 2, 2026

View reviewed changes

yaoyu-33 self-requested a review March 2, 2026 22:00

copy-pr-bot bot had a problem deploying to nemo-ci March 2, 2026 23:07 Failure

shanecmoran force-pushed the fix/hf-dataset-chat-format branch from 26e7bb2 to 05194ac Compare March 4, 2026 19:00

shanecmoran force-pushed the fix/hf-dataset-chat-format branch from 235696c to 9ce3e17 Compare March 5, 2026 11:57

yaoyu-33 approved these changes Mar 5, 2026

View reviewed changes

yaoyu-33 self-requested a review March 5, 2026 19:28

cuichenx approved these changes Mar 5, 2026

View reviewed changes

copy-pr-bot bot deployed to test March 5, 2026 19:31 Active

yaoyu-33 enabled auto-merge (squash) March 5, 2026 19:49

copy-pr-bot bot requested a deployment to nemo-ci March 5, 2026 20:36 Queued

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] fix: Write full process_example_fn output to JSONL in HFDatasetBuilder#2612

[data] fix: Write full process_example_fn output to JSONL in HFDatasetBuilder#2612
shanecmoran wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
shanecmoran:fix/hf-dataset-chat-format

shanecmoran commented Mar 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 2, 2026

Uh oh!

coderabbitai bot commented Mar 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

adityavavreNVDA commented Mar 2, 2026

Uh oh!

shanecmoran commented Mar 4, 2026

Uh oh!

yaoyu-33 commented Mar 5, 2026

Uh oh!

cuichenx commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shanecmoran commented Mar 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Mar 2, 2026

Uh oh!

coderabbitai bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adityavavreNVDA commented Mar 2, 2026

Uh oh!

shanecmoran commented Mar 4, 2026

Uh oh!

yaoyu-33 commented Mar 5, 2026

Uh oh!

cuichenx commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shanecmoran commented Mar 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 2, 2026 •

edited

Loading