[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models by yaoyu-33 · Pull Request #2647 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-03-04T17:19:05Z

Summary

Cherry-pick of #2646 to r0.3.0.

Fix NCCL broadcast dtype mismatch by converting HF logits to float32 before the fallback tensor creation path, ensuring all ranks use the same dtype
Handle Megatron vocab-size padding by truncating logits to HF vocab size before comparison, eliminating the shape-mismatch code path
Simplify vlm_forward_step return, remove unused gc import, and unnecessary EP rank guard
Add torch.distributed.barrier() after HF results broadcast for synchronization safety
Disable mtp_num_layers for inference to prevent NCCL hangs

Test plan

Verified on remote cluster with Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99%
CI passes

Made with Cursor

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved CUDA memory management and distributed synchronization for reliable model comparisons
- Fixed vocabulary size mismatches between models
New Features
- Enhanced diagnostics with cosine similarity and logits difference metrics for comparison workflows
- Improved status logging for comparison steps

…ded models - Fix NCCL broadcast dtype mismatch by converting logits to float32 before fallback tensor creation - Handle Megatron vocab-size padding by truncating logits to HF vocab size before comparison, removing shape-mismatch branch - Simplify vlm_forward_step return (let caller handle tuple unpacking) - Remove unused gc import and grad_scale_func workaround - Add barrier after HF broadcast for synchronization safety - Simplify rank-0 guard to only check TP rank (EP rank check unnecessary) Verified on Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99% Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

copy-pr-bot · 2026-03-04T17:19:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-03-04T17:19:13Z

/ok to test 35ca667

coderabbitai · 2026-03-04T17:28:04Z

📝 Walkthrough

Walkthrough

Modifications to the HuggingFace and Megatron model comparison script include: initializing per-component Megatron config, managing CUDA cache explicitly, improving broadcast synchronization across ranks with explicit tensor initialization, truncating Megatron logits to match HuggingFace vocabulary size, and updating comparison logic to evaluate truncated logits with cosine similarity metrics.

Changes

Cohort / File(s)	Summary
Comparison Script Enhancement `examples/conversion/compare_hf_and_megatron/compare.py`	Set `mtp_num_layers = None` for each Megatron model component; add explicit CUDA cache clearing post-HF inference; ensure HF logits are cast to float before broadcast synchronization; truncate Megatron logits to HF vocabulary size for direct comparison; compute logits difference and cosine similarity metrics between HF and truncated Megatron logits; add synchronization barriers and status logging.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[model] fix: correct GLM-4.5V inference parallelism for 46-layer model #2322: Shares the same pattern of setting m.config.mtp_num_layers = None after iterating over Megatron sub-models.
cp: [model] fix: correct GLM-4.5V inference parallelism for 46-layer model (2322) into r0.3.0 #2336: Also modifies example conversion code to explicitly clear Megatron's mtp_num_layers during model initialization.

Suggested labels

Run CICD, cherry-pick, r0.3.0

Suggested reviewers

cuichenx

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: fixing robustness issues in compare.py for multi-GPU scenarios and vocab-padded models, which directly align with the changeset's core modifications.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR includes test results on Qwen/Qwen3-0.6B model showing token match and 99.99% cosine similarity, demonstrating no regression while fixing multi-GPU and vocab-padding issues.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yuya/cherry-pick-compare-fix-r0.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/conversion/compare_hf_and_megatron/compare.py (1)

728-745: ⚠️ Potential issue | 🟠 Major

Synchronize logits length before broadcast to avoid rank-size mismatches.

hf_logits on rank 0 is model-output-sized, but fallback allocation on other ranks is tokenizer-sized. If those lengths differ, broadcast can fail or hang.

🔧 Proposed fix

     if torch.distributed.is_initialized():
         # Ensure consistent dtype across ranks: rank 0 has bfloat16 logits from the HF model,
         # so all ranks must use the same dtype for NCCL broadcast to work correctly.
         if hf_logits is not None:
             hf_logits = hf_logits.float()

+        # Broadcast logits length first so every rank allocates identically sized tensors.
+        hf_vocab_size_tensor = torch.zeros(1, device=input_ids.device, dtype=torch.long)
+        if hf_logits is not None:
+            hf_vocab_size_tensor[0] = hf_logits.numel()
+        dist.broadcast(hf_vocab_size_tensor, src=0)
+        hf_vocab_size = int(hf_vocab_size_tensor.item())
+
         if hf_next_token is None:
-            hf_next_token = torch.zeros(1, device=input_ids.device, dtype=torch.long)
-        if hf_logits is None:
-            vocab_size = getattr(
-                tokenizer, "vocab_size", len(tokenizer.vocab) if hasattr(tokenizer, "vocab") else 32000
-            )
-            hf_logits = torch.zeros(vocab_size, device=input_ids.device, dtype=torch.float32)
+            hf_next_token = torch.zeros((), device=input_ids.device, dtype=torch.long)
+        if hf_logits is None or hf_logits.numel() != hf_vocab_size:
+            hf_logits = torch.zeros(hf_vocab_size, device=input_ids.device, dtype=torch.float32)

         # Broadcast from rank 0 to all ranks
         torch.distributed.broadcast(hf_next_token, 0)
         torch.distributed.broadcast(hf_logits, 0)
         torch.distributed.barrier()

Run this read-only check to confirm the current shape source mismatch path:

#!/bin/bash
set -euo pipefail

FILE="$(fd -a "compare.py" | rg "examples/conversion/compare_hf_and_megatron/compare.py$" | head -n1)"

echo "== Rank-0 logits source =="
nl -ba "$FILE" | sed -n '537,543p'

echo "== Fallback allocation + broadcast block =="
nl -ba "$FILE" | sed -n '733,745p'

echo "== Search for explicit logits-length sync before logits broadcast =="
rg -n 'hf_logits\.shape|hf_logits\.numel|vocab_size|broadcast\(' "$FILE"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 728 -
745, hf_logits can have different lengths on rank 0 (model output) vs other
ranks (tokenizer fallback), which breaks torch.distributed.broadcast; ensure all
ranks have the same hf_logits shape before broadcasting by resizing or
reallocating hf_logits on non-zero ranks to match hf_logits.shape on rank 0.
Specifically, use a small two-step sync: first broadcast an integer length
(e.g., logits_len tensor) from rank 0, then on non-zero ranks reallocate
hf_logits = torch.zeros(logits_len, device=input_ids.device,
dtype=torch.float32) (or resize the existing tensor) before calling
torch.distributed.broadcast(hf_logits, 0); do the same length sync for
hf_next_token if its size can vary. Reference symbols: hf_logits, hf_next_token,
tokenizer, broadcast.

🧹 Nitpick comments (1)

examples/conversion/compare_hf_and_megatron/compare.py (1)

802-821: Align printed Megatron diagnostics with truncated logits.

Comparison uses megatron_logits_cmp, but printed next-token/top-5 stats are still from untruncated logits. Reporting from the same truncated tensor will make debug output consistent with pass/fail logic.

♻️ Suggested refactor

-            megatron_logits = megatron_output[0, -1, :]
-            megatron_next_token = torch.argmax(megatron_logits, dim=-1)
+            megatron_logits = megatron_output[0, -1, :]

             if not torch.distributed.is_initialized() or parallel_state.get_tensor_model_parallel_rank() == 0:
                 print(f"Megatron output shape: {megatron_output.shape}")
                 print(f"Megatron logits stats - mean: {megatron_logits.mean():.4f}, std: {megatron_logits.std():.4f}")
-                print(
-                    f"Megatron next token: {megatron_next_token.item()} ('{tokenizer.decode([megatron_next_token.item()])}')"
-                )
-
-                # Show top 5 tokens
-                top5_vals, top5_ids = torch.topk(megatron_logits, 5)
-                top5_tokens = [tokenizer.decode([idx]) for idx in top5_ids]
-                print(f"Megatron Top 5: {list(zip(top5_tokens, top5_vals.tolist()))}")
 
                 # Megatron may pad vocab_size for GPU kernel efficiency — truncate
                 # to the HF vocab size so logits are directly comparable.
                 hf_vocab_size = hf_logits.shape[0]
                 megatron_logits_cmp = megatron_logits[:hf_vocab_size]
                 megatron_next_token_cmp = torch.argmax(megatron_logits_cmp, dim=-1)
+
+                print(
+                    "Megatron next token (truncated): "
+                    f"{megatron_next_token_cmp.item()} ('{tokenizer.decode([megatron_next_token_cmp.item()])}')"
+                )
+                top5_vals, top5_ids = torch.topk(megatron_logits_cmp, 5)
+                top5_tokens = [tokenizer.decode([idx]) for idx in top5_ids]
+                print(f"Megatron Top 5 (truncated): {list(zip(top5_tokens, top5_vals.tolist()))}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 802 -
821, The printed Megatron diagnostics are still using the untruncated logits
while the comparison logic uses megatron_logits_cmp; update all diagnostic
calculations and prints to use megatron_logits_cmp (and its derived
megatron_next_token_cmp and any top-k computations) so the reported next-token,
top-5, and similarity stats are computed from the same truncated tensor used for
pass/fail checks (variables to update: megatron_logits -> megatron_logits_cmp,
megatron_next_token -> megatron_next_token_cmp, and any topk/indexing derived
from megatron_logits).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 728-745: hf_logits can have different lengths on rank 0 (model
output) vs other ranks (tokenizer fallback), which breaks
torch.distributed.broadcast; ensure all ranks have the same hf_logits shape
before broadcasting by resizing or reallocating hf_logits on non-zero ranks to
match hf_logits.shape on rank 0. Specifically, use a small two-step sync: first
broadcast an integer length (e.g., logits_len tensor) from rank 0, then on
non-zero ranks reallocate hf_logits = torch.zeros(logits_len,
device=input_ids.device, dtype=torch.float32) (or resize the existing tensor)
before calling torch.distributed.broadcast(hf_logits, 0); do the same length
sync for hf_next_token if its size can vary. Reference symbols: hf_logits,
hf_next_token, tokenizer, broadcast.

---

Nitpick comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 802-821: The printed Megatron diagnostics are still using the
untruncated logits while the comparison logic uses megatron_logits_cmp; update
all diagnostic calculations and prints to use megatron_logits_cmp (and its
derived megatron_next_token_cmp and any top-k computations) so the reported
next-token, top-5, and similarity stats are computed from the same truncated
tensor used for pass/fail checks (variables to update: megatron_logits ->
megatron_logits_cmp, megatron_next_token -> megatron_next_token_cmp, and any
topk/indexing derived from megatron_logits).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2f198d60-86a7-4f8c-848b-7bd2a8b43bc5

📥 Commits

Reviewing files that changed from the base of the PR and between aea0cfd and 35ca667.

📒 Files selected for processing (1)

examples/conversion/compare_hf_and_megatron/compare.py

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 17:19 Inactive

copy-pr-bot bot temporarily deployed to test March 4, 2026 17:20 Inactive

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 17:40 Inactive

yaoyu-33 added the docs-only With great power comes great responsibility. label Mar 4, 2026

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 18:16 Inactive

cuichenx approved these changes Mar 4, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 18:29 Inactive

ko3n1g approved these changes Mar 4, 2026

View reviewed changes

ko3n1g merged commit c10d427 into r0.3.0 Mar 4, 2026
55 checks passed

ko3n1g deleted the yuya/cherry-pick-compare-fix-r0.3.0 branch March 4, 2026 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models#2647

[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models#2647
ko3n1g merged 1 commit intor0.3.0from
yuya/cherry-pick-compare-fix-r0.3.0

yaoyu-33 commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

yaoyu-33 commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yaoyu-33 commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

yaoyu-33 commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaoyu-33 commented Mar 4, 2026 •

edited by coderabbitai bot

Loading