Skip to content

Modify mfsdp default data-parallel-sharding-strategy#3691

Draft
wplf wants to merge 2 commits intoNVIDIA:mainfrom
wplf:mfsdp-fix
Draft

Modify mfsdp default data-parallel-sharding-strategy#3691
wplf wants to merge 2 commits intoNVIDIA:mainfrom
wplf:mfsdp-fix

Conversation

@wplf
Copy link
Member

@wplf wplf commented Mar 4, 2026

PR for dev branch #3692

What does this PR do ?

The default value "no_shard" is not compatiable with many api. We can change this to optim_grads_params.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team March 4, 2026 08:52
@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

Can you explain why you are making this change?

@Phlip79 Phlip79 marked this pull request as draft March 4, 2026 23:45
@wplf
Copy link
Member Author

wplf commented Mar 5, 2026

Can you explain why you are making this change?

Hi, I've found that when using no_shard, group.model_weight_buffer will not be created in

if data_parallel_sharding_strategy != "no_shard":
group.model_weight_buffer = DataParallelBuffer(
self.ddp_config,
group.params,
is_data_distributed=is_model_weight_buffer_distributed
and model_wbuf_dp_group.size() > 1,
dtype=param_dtype,
device=self.device,
# Note: This will be DP-Outer + DP-Shard when sharding
# the optimizer state in HFSDP, else just DP-Shard when
# using basic HSDP or FSDP.
data_parallel_group=model_wbuf_dp_group,
is_transpose_buffer=False,
temporary_bucket_allocator=self.weight_alloc,
bucket_id=group_id,
chunk_size_factor=group.chunk_size_factor,
mem_alloc_context=self.mem_alloc_context,
**main_buf_extra_kwargs,
)

but somewhere will require group.mbuf, such as

  1. swilgu's saving ckpt using fsdp_slice in
    fsdp_slice = dist_param.megatron_fsdp_slice
    , which is defined at
    mbuf = pg.model_weight_buffer
    if mbuf:
    _start, _end = mbuf._get_item_slice_in_shard(item_id)
    setattr(dist_param, "megatron_fsdp_slice", slice(_start, _end))
    . But mbuf is None, so this is the bug.
  2. when using no_shard and no overlap_param_gather, code will step into else branch here
    if not force_sync and self.ddp_config.overlap_param_gather:
    # All-gather the first bucket before the forward pass.
    if self.ddp_config.fsdp_all_gather_in_start_param_sync:
    first_param = list(self.module.parameters())[0]
    self.all_gather_and_wait_parameters_ready(
    params=[first_param], prefetch=True, wait_bucket_ready=False
    )
    else:
    self.synchronize_param_gather()
    for bucket_id in range(self.all_gather_pipeline.num_buckets):
    self.all_gather_pipeline.async_bucket_gather(bucket_id=bucket_id, bwd=False)
    group = self.param_and_grad_buffer.parameter_groups[bucket_id]
    if group.model_weight_buffer is None:
    continue
    if group.model_weight_buffer.is_data_distributed:
    # If model weight is sharded, we wait for the all-gather to complete and
    # then release the bucket immediately to save memory usage.
    self.all_gather_pipeline.wait_bucket_ready(bucket_id, False)
    for bucket_id in range(self.all_gather_pipeline.num_buckets):
    self.all_gather_pipeline.wait_bucket_ready(bucket_id, False)
    . param.mbuf being None will raise bug
    wbuf = self.get_fsdp_buffer(bucket_id, bwd)
    # Lazy release the unused buckets.
    self.recycle_unused_buckets()
    # Allocate an empty bucket to store the module weights.
    bucket = wbuf.fetch_bucket(set_param_data=True)

cc @shjwudp @cspades

@wplf
Copy link
Member Author

wplf commented Mar 5, 2026

For bug2, I think when mfsdp_strategy is "no_shard", we can skip param_gather like this, because optimizer is not sharded, and all params in local rank are valid. So we don't need gather from other ddp ranks any more.

image

The same logic is here

if self.data_parallel_sharding_strategy == "no_shard":
return
. Does this work for the condition?
cc @Phlip79 @cspades @shjwudp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants