Modify mfsdp default data-parallel-sharding-strategy by wplf · Pull Request #3691 · NVIDIA/Megatron-LM

wplf · 2026-03-04T08:52:37Z

PR for dev branch #3692

What does this PR do ?

The default value "no_shard" is not compatiable with many api. We can change this to optim_grads_params.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

…_grads_params

copy-pr-bot · 2026-03-04T08:52:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Phlip79 · 2026-03-04T22:39:09Z

Can you explain why you are making this change?

wplf · 2026-03-05T01:44:32Z

Can you explain why you are making this change?

Hi, I've found that when using no_shard, group.model_weight_buffer will not be created in

Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py

Lines 2091 to 2109 in c5d8f6b

    
           if data_parallel_sharding_strategy != "no_shard": 
        
               group.model_weight_buffer = DataParallelBuffer( 
        
                   self.ddp_config, 
        
                   group.params, 
        
                   is_data_distributed=is_model_weight_buffer_distributed 
        
                   and model_wbuf_dp_group.size() > 1, 
        
                   dtype=param_dtype, 
        
                   device=self.device, 
        
                   # Note: This will be DP-Outer + DP-Shard when sharding 
        
                   # the optimizer state in HFSDP, else just DP-Shard when 
        
                   # using basic HSDP or FSDP. 
        
                   data_parallel_group=model_wbuf_dp_group, 
        
                   is_transpose_buffer=False, 
        
                   temporary_bucket_allocator=self.weight_alloc, 
        
                   bucket_id=group_id, 
        
                   chunk_size_factor=group.chunk_size_factor, 
        
                   mem_alloc_context=self.mem_alloc_context, 
        
                   **main_buf_extra_kwargs, 
        
               )

but somewhere will require group.mbuf, such as

swilgu's saving ckpt using fsdp_slice in

Megatron-LM/megatron/core/transformer/fsdp_dtensor_checkpoint.py

Line 234 in c5d8f6b

fsdp_slice = dist_param.megatron_fsdp_slice

, which is defined at

Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py

Lines 2803 to 2806 in c5d8f6b

    
           mbuf = pg.model_weight_buffer 
        
           if mbuf: 
        
               _start, _end = mbuf._get_item_slice_in_shard(item_id) 
        
               setattr(dist_param, "megatron_fsdp_slice", slice(_start, _end))

. But mbuf is None, so this is the bug.

when using no_shard and no overlap_param_gather, code will step into else branch here

Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/megatron_fsdp.py

Lines 1182 to 1203 in c5d8f6b

    
           if not force_sync and self.ddp_config.overlap_param_gather: 
        
               # All-gather the first bucket before the forward pass. 
        
               if self.ddp_config.fsdp_all_gather_in_start_param_sync: 
        
                   first_param = list(self.module.parameters())[0] 
        
                   self.all_gather_and_wait_parameters_ready( 
        
                       params=[first_param], prefetch=True, wait_bucket_ready=False 
        
                   ) 
        
           else: 
        
               self.synchronize_param_gather() 
        
               for bucket_id in range(self.all_gather_pipeline.num_buckets): 
        
                   self.all_gather_pipeline.async_bucket_gather(bucket_id=bucket_id, bwd=False) 
        
                   group = self.param_and_grad_buffer.parameter_groups[bucket_id] 
        
                   if group.model_weight_buffer is None: 
        
                       continue 
        
                   if group.model_weight_buffer.is_data_distributed: 
        
                       # If model weight is sharded, we wait for the all-gather to complete and 
        
                       # then release the bucket immediately to save memory usage. 
        
                       self.all_gather_pipeline.wait_bucket_ready(bucket_id, False) 
        
               for bucket_id in range(self.all_gather_pipeline.num_buckets): 
        
                   self.all_gather_pipeline.wait_bucket_ready(bucket_id, False)

. param.mbuf being None will raise bug

Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py

Lines 4032 to 4038 in c5d8f6b

    
           wbuf = self.get_fsdp_buffer(bucket_id, bwd) 
        
           # Lazy release the unused buckets. 
        
           self.recycle_unused_buckets() 
        
           # Allocate an empty bucket to store the module weights. 
        
           bucket = wbuf.fetch_bucket(set_param_data=True)

cc @shjwudp @cspades

wplf · 2026-03-05T01:53:06Z

For bug2, I think when mfsdp_strategy is "no_shard", we can skip param_gather like this, because optimizer is not sharded, and all params in local rank are valid. So we don't need gather from other ddp ranks any more.

The same logic is here

Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/megatron_fsdp.py

Lines 434 to 435 in c5d8f6b

    
           if self.data_parallel_sharding_strategy == "no_shard": 
        
               return

. Does this work for the condition?
cc @Phlip79 @cspades @shjwudp

change default data-parallel-sharding-strategy from no_shard to optim…

9fae1e4

…_grads_params

svcnvidia-nemo-ci requested a review from a team March 4, 2026 08:52

Phlip79 marked this pull request as draft March 4, 2026 23:45

Merge branch 'main' into mfsdp-fix

9b3c5b5

wplf mentioned this pull request Mar 5, 2026

Modify mfsdp default data-parallel-sharding-strategy for dev #3692

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify mfsdp default data-parallel-sharding-strategy#3691

Modify mfsdp default data-parallel-sharding-strategy#3691
wplf wants to merge 2 commits intoNVIDIA:mainfrom
wplf:mfsdp-fix

wplf commented Mar 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

wplf commented Mar 5, 2026 •

edited

Loading

Uh oh!

wplf commented Mar 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wplf commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

wplf commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wplf commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wplf commented Mar 4, 2026 •

edited

Loading

(Step 1): Add PR label `Expert Review`

wplf commented Mar 5, 2026 •

edited

Loading

wplf commented Mar 5, 2026 •

edited

Loading