Add E2E support for THD format by xiaoyao0115 · Pull Request #3386 · NVIDIA/Megatron-LM

xiaoyao0115 · 2026-02-12T15:22:30Z

Description

This PR adds Sequence Packing (THD format) E2E support to MCore. Dev branch PR:#2924

The core missing functionalities of THD in MCore are:

data iterator cannot handle THD meta data, like cu_seqlens, max_seqlens.
num_microbatches is fixed.
PackParams are not passing between PP ranks.

Key Changes

1. Add a data_iterator wrapper (megatron/core/datasets/data_schedule.py::wrap_dataloader)

A wrapper function that intercepts the data iterator to perform scheduling and packing:

Schedule & Pack: Extracts data from the data iterator, schedules sequences across DP×CP ranks, and packs them into microbatches with cu_seqlens metadata.
Returns packing results: Returns the packed num_microbatches along with two parameters for FLOPs calculation: num_total_tokens_this_global_batch and sequence_square_sum_this_global_batch.
TP broadcast: Broadcasts num_microbatches and FLOPs parameters across TP ranks since only TP rank 0 has access to the data iterator.
PP broadcast: When using PP, middle PP stages (not first or last) require metadata (cu_seqlens, cu_seqlens_padded, max_seqlen, etc.) to be broadcast from PP rank 0 for correct computation.

2. Mock SFT Dataset Support

Supports mock datasets for testing and benchmarking with configurable sequence length distributions.
There are two modes of mock sft dataset:

File mode: Load sequence lengths from an external CSV, example json:
```
{"mode": "file", "path": "/path/to/seqlens.csv"}
```

Distribution mode: Generate sequence lengths from a distribution (currently supports lognormal), example json:

{"mode": "distribution", "type": "lognormal", "min_seq_len": 1024, "max_seq_len": 8192, "mean_seq_len": 4096, "lognormal_sigma": 1.1}

Architecture

Before vs After

graph LR
    subgraph Before
        A1[DataIterator] --> B1[get_batch]
        B1 --> C1[forward_backward]
        C1 --> D1[Fixed seq_len FLOPs]
    end
    subgraph After
        A2[DataIterator] --> W[wrap_dataloader]
        W -->|schedule + pack| B2[PackedDataIterator]
        W -->|broadcast| M[num_microbatches + flops_params]
        B2 --> C2[get_batch_for_sequence_packing]
        C2 --> D2[forward_backward]
        D2 --> E2[Dynamic FLOPs]
        M   --> E2
    end

Execution Flow

sequenceDiagram
    participant Train as training.py
    participant Schedule as schedules.py
    participant Wrap as wrap_iterator_helper
    participant DataSched as data_schedule.py
    participant GetBatch as get_batch_for_seq_packing

    Train->>Schedule: forward_backward_*(data_iterator)
    Schedule->>Wrap: wrap_iterator_helper(config, data_iterator)
    Wrap->>DataSched: wrap_dataloader(data_iterator, scheduler_type)
    
    Note over DataSched: 1. Gather global seqlens across DP
    Note over DataSched: 2. Scheduler assigns sequences to microbatches
    Note over DataSched: 3. All-to-all redistribute samples
    Note over DataSched: 4. Pack into microbatches
    Note over DataSched: 5. Broadcast to TP/PP ranks
    
    DataSched-->>Schedule: (packed_iter, num_mbs, total_tokens, seq_sq_sum)
    
    loop for each microbatch
        Schedule->>GetBatch: get_batch_on_this_rank_for_sequence_packing
        Note over GetBatch: Broadcast tokens/labels to TP group
        Note over GetBatch: Partition for CP if needed
        GetBatch-->>Schedule: (tokens, labels, loss_mask, pos_ids, packed_seq_params)
    end
    
    Schedule-->>Train: forward_data_store + [total_tokens, seq_sq_sum]

New Arguments

Argument	Type	Description
`--sequence-packing`	flag	Enable sequence packing (THD format) for training
`--sequence-packing-scheduler`	str	Scheduler type: `default` or `empty`
`--sft-mock-dataset-config-json`	str	JSON config for mock dataset

Changes

File	Description
`megatron/core/datasets/data_schedule.py`	Core scheduling and packing logic
`megatron/core/pipeline_parallel/schedules.py`	Integration with forward/backward schedules
`megatron/training/training.py`	Updated FLOPs calculation for variable-length sequences
`megatron/training/datasets/sft_dataset.py`	Mock dataset support
`megatron/training/arguments.py`	New CLI arguments
`megatron/core/model_parallel_config.py`	Configuration options
`tests/unit_tests/test_sequence_packing.py`	Unit tests

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

Signed-off-by: xiaoyao0115 <1804647152@qq.com>

Signed-off-by: tailaim <tailaim@nvidia.com>

Signed-off-by: xiaoyao0115 <1804647152@qq.com>

copy-pr-bot · 2026-02-12T15:22:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Phlip79 · 2026-03-04T23:36:35Z

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

xiaoyao0115 and others added 3 commits February 12, 2026 06:13

add thd e2e support and mock dataset

59c26d3

Signed-off-by: xiaoyao0115 <1804647152@qq.com>

refactor according to comments and the new sftdataset

cd7d045

Signed-off-by: tailaim <tailaim@nvidia.com>

small fixes according to comments

c226649

Signed-off-by: xiaoyao0115 <1804647152@qq.com>

xiaoyao0115 requested review from a team as code owners February 12, 2026 15:22

ko3n1g requested a review from a team February 12, 2026 15:22

xiaoyao0115 mentioned this pull request Feb 12, 2026

[Dev] Add E2E support for THD format #2924

Merged

martinjaggi mentioned this pull request Feb 18, 2026

[Bug]: Attention Mask Ignored in transformer_engine Backend with Packed Sequences (Attention Leakage) #2357

Open

Phlip79 marked this pull request as draft March 4, 2026 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add E2E support for THD format#3386

Add E2E support for THD format#3386
xiaoyao0115 wants to merge 3 commits intoNVIDIA:mainfrom
xiaoyao0115:thd_e2e_main

xiaoyao0115 commented Feb 12, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 12, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiaoyao0115 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

1. Add a data_iterator wrapper (megatron/core/datasets/data_schedule.py::wrap_dataloader)

2. Mock SFT Dataset Support

Architecture

Before vs After

Execution Flow

New Arguments

Changes

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Feb 12, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiaoyao0115 commented Feb 12, 2026 •

edited

Loading

(Step 1): Add PR label `Expert Review`