[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models by yaoyu-33 · Pull Request #2646 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-03-04T17:06:42Z

Summary

Fix NCCL broadcast dtype mismatch by converting HF logits to float32 before the fallback tensor creation path, ensuring all ranks use the same dtype
Handle Megatron vocab-size padding by truncating logits to HF vocab size before comparison, eliminating the shape-mismatch code path
Simplify vlm_forward_step return, remove unused gc import, grad_scale_func workaround, and unnecessary EP rank guard
Add torch.distributed.barrier() after HF results broadcast for synchronization safety

Test plan

Verified on remote cluster with Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99%
CI passes

Made with Cursor

Summary by CodeRabbit

Refactor
- Enhanced model comparison utility to improve accuracy and consistency across distributed training scenarios. Simplified forward-output handling and synchronized results across compute ranks. Updated comparison metrics calculation for more reliable model alignment validation.

…ded models - Fix NCCL broadcast dtype mismatch by converting logits to float32 before fallback tensor creation - Handle Megatron vocab-size padding by truncating logits to HF vocab size before comparison, removing shape-mismatch branch - Simplify vlm_forward_step return (let caller handle tuple unpacking) - Remove unused gc import and grad_scale_func workaround - Add barrier after HF broadcast for synchronization safety - Simplify rank-0 guard to only check TP rank (EP rank check unnecessary) Verified on Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99% Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

copy-pr-bot · 2026-03-04T17:06:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-03-04T17:06:49Z

/ok to test 2be9ee7

coderabbitai · 2026-03-04T17:16:21Z

📝 Walkthrough

Walkthrough

This pull request refactors model comparison logic in compare.py to simplify output handling, normalize dtypes for NCCL broadcasts, truncate Megatron logits to HF vocabulary size, and broaden broadcast conditions. Changes ensure dtype consistency across distributed ranks and improve metric computation for direct model comparability.

Changes

Cohort / File(s)	Summary
Model Comparison Logic `examples/conversion/compare_hf_and_megatron/compare.py`	Simplifies vlm_forward_step output handling; removes grad_scale_func workaround from Megatron initialization; adds dtype normalization and synchronization barriers before/after HF-to-Megatron broadcast; truncates Megatron logits to HF vocabulary size; relaxes distributed logging condition to tensor-model-parallel rank 0; reworks comparison metrics to use truncated logits for cosine similarity and difference calculations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[model] fix: correct GLM-4.5V inference parallelism for 46-layer model #2322 — Related MTP configuration handling during Megatron model initialization, with overlap in mtp_num_layers config logic.

Suggested labels

Run CICD

Suggested reviewers

cuichenx

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is partially related to the changeset. It references improving robustness for multi-GPU and vocab-padded models, which are real aspects of the changes, but does not highlight the primary technical improvements such as dtype normalization, logits truncation, and simplified return handling.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR makes robustness fixes to a utility/comparison script with documented test results (Qwen/Qwen3-0.6B: token match confirmed, cosine similarity 99.99%) demonstrating no numeric regressions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yuya/fix-compare-script-robustness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Line 321: vlm_forward_step currently returns model(**forward_args) which may
be a tuple (e.g., (output_tensor, loss_mask)) and breaks downstream tensor ops;
update vlm_forward_step to detect if model(**forward_args) is a tuple or list
and extract the primary output tensor (e.g., first element) before returning, so
return (output_tensor, loss_func) instead of the raw tuple; reference the call
site model(**forward_args) and the returned loss_func when making this change.
- Around line 799-818: Add an explicit vocab-size compatibility guard before
truncating megatron_logits: compute hf_vocab_size = hf_logits.shape[0] then
check megatron_logits.size(0) >= hf_vocab_size (using megatron_logits.size(0) or
.shape[0]) and if not raise a descriptive ValueError (e.g., "Megatron logits
vocab smaller than HF vocab: megatron_vocab=..., hf_vocab=...") to fail fast;
keep the existing truncation into megatron_logits_cmp and subsequent comparisons
(hf_next_token, megatron_next_token_cmp, diff, cosine_similarity,
SIMILARITY_THRESHOLD) unchanged when the check passes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f4f9fd0d-5961-41af-bb43-445481a458c2

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb1af6 and 2be9ee7.

📒 Files selected for processing (1)

examples/conversion/compare_hf_and_megatron/compare.py

coderabbitai · 2026-03-04T17:16:24Z

examples/conversion/compare_hf_and_megatron/compare.py

-        output_tensor = model_output
-
-    return output_tensor, loss_func
+    return model(**forward_args), loss_func


⚠️ Potential issue | 🟠 Major

Handle tuple model outputs before returning from vlm_forward_step.

At Line 321, returning raw model(**forward_args) can propagate a tuple output (e.g., (output_tensor, loss_mask)), while downstream code assumes a tensor and will fail on tensor ops/indexing.

Proposed fix

- return model(**forward_args), loss_func + model_output = model(**forward_args) + if isinstance(model_output, tuple): + output_tensor, _ = model_output + else: + output_tensor = model_output + return output_tensor, loss_func

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/conversion/compare_hf_and_megatron/compare.py` at line 321, vlm_forward_step currently returns model(**forward_args) which may be a tuple (e.g., (output_tensor, loss_mask)) and breaks downstream tensor ops; update vlm_forward_step to detect if model(**forward_args) is a tuple or list and extract the primary output tensor (e.g., first element) before returning, so return (output_tensor, loss_func) instead of the raw tuple; reference the call site model(**forward_args) and the returned loss_func when making this change.

coderabbitai · 2026-03-04T17:16:24Z

examples/conversion/compare_hf_and_megatron/compare.py

+                # Megatron may pad vocab_size for GPU kernel efficiency — truncate
+                # to the HF vocab size so logits are directly comparable.
+                hf_vocab_size = hf_logits.shape[0]
+                megatron_logits_cmp = megatron_logits[:hf_vocab_size]
+                megatron_next_token_cmp = torch.argmax(megatron_logits_cmp, dim=-1)
+
+                # Compare outputs
                print("=== COMPARISON ===")
-                token_match = hf_next_token.item() == megatron_next_token.item()
+                token_match = hf_next_token.item() == megatron_next_token_cmp.item()
                token_status_emoji = "✅" if token_match else "❌"
                print(f"Token match: {token_match} {token_status_emoji}")

-                # Compare logits if shapes match
-                if hf_logits.shape == megatron_logits.shape:
-                    diff = (hf_logits - megatron_logits).abs()
-                    print(f"Logits diff - max: {diff.max():.6f}, mean: {diff.mean():.6f}")
-                    cosine_sim = torch.cosine_similarity(hf_logits.unsqueeze(0), megatron_logits.unsqueeze(0))
-                    cos_val = cosine_sim.item()
-                    percent = cos_val * 100.0
-                    status_emoji = "✅" if cos_val >= SIMILARITY_THRESHOLD else "❌"
-                    tolerance_text = "within ±2%" if cos_val >= SIMILARITY_THRESHOLD else "outside ±2%"
-                    print(
-                        f"Cosine similarity: {cos_val:.6f} ({percent:.2f}%) {status_emoji} ({tolerance_text} tolerance)"
-                    )
-                else:
-                    print(f"Shape mismatch: HF {hf_logits.shape} vs Megatron {megatron_logits.shape}")
-                    print("Cannot compare logits directly due to shape mismatch")
+                diff = (hf_logits - megatron_logits_cmp).abs()
+                print(f"Logits diff - max: {diff.max():.6f}, mean: {diff.mean():.6f}")
+                cosine_sim = torch.cosine_similarity(hf_logits.unsqueeze(0), megatron_logits_cmp.unsqueeze(0))
+                cos_val = cosine_sim.item()
+                percent = cos_val * 100.0
+                status_emoji = "✅" if cos_val >= SIMILARITY_THRESHOLD else "❌"
+                tolerance_text = "within ±2%" if cos_val >= SIMILARITY_THRESHOLD else "outside ±2%"
+                print(f"Cosine similarity: {cos_val:.6f} ({percent:.2f}%) {status_emoji} ({tolerance_text} tolerance)")


⚠️ Potential issue | 🟡 Minor

Add an explicit vocab-size compatibility guard before truncation.

At Line 802, truncation assumes Megatron logits are at least HF vocab length. If Megatron vocab is smaller, Line 811 still fails later with a shape mismatch; fail fast with a clear error.

Proposed fix

hf_vocab_size = hf_logits.shape[0] + if megatron_logits.shape[0] < hf_vocab_size: + raise ValueError( + "Incompatible vocab sizes: " + f"Megatron logits ({megatron_logits.shape[0]}) < HF logits ({hf_vocab_size}). " + "Ensure both models use the same tokenizer/vocab." + ) megatron_logits_cmp = megatron_logits[:hf_vocab_size] megatron_next_token_cmp = torch.argmax(megatron_logits_cmp, dim=-1)

Based on learnings: when a path is unsupported, raise an explicit, descriptive error instead of failing later with an implicit runtime mismatch.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 799 - 818, Add an explicit vocab-size compatibility guard before truncating megatron_logits: compute hf_vocab_size = hf_logits.shape[0] then check megatron_logits.size(0) >= hf_vocab_size (using megatron_logits.size(0) or .shape[0]) and if not raise a descriptive ValueError (e.g., "Megatron logits vocab smaller than HF vocab: megatron_vocab=..., hf_vocab=...") to fail fast; keep the existing truncation into megatron_logits_cmp and subsequent comparisons (hf_next_token, megatron_next_token_cmp, diff, cosine_similarity, SIMILARITY_THRESHOLD) unchanged when the check passes.

cuichenx

LGTM!

cuichenx · 2026-03-04T19:13:05Z

examples/conversion/compare_hf_and_megatron/compare.py

-                parallel_state.get_tensor_model_parallel_rank() == 0
-                and parallel_state.get_expert_model_parallel_rank() == 0
-            ):
+            if not torch.distributed.is_initialized() or parallel_state.get_tensor_model_parallel_rank() == 0:


Suggested change

if not torch.distributed.is_initialized() or parallel_state.get_tensor_model_parallel_rank() == 0:

if not torch.distributed.is_initialized() or (parallel_state.get_tensor_model_parallel_rank() == 0 and parallel_state.get_expert_model_parallel_rank() == 0):

a small suggestion for EP>1 runs so that output is not garbled

copy-pr-bot bot temporarily deployed to test March 4, 2026 17:07 Inactive

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

yaoyu-33 mentioned this pull request Mar 4, 2026

[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models #2647

Merged

2 tasks

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 17:22 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 4, 2026 17:30 Failure

cuichenx approved these changes Mar 4, 2026

View reviewed changes

cuichenx reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models#2646

[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models#2646
yaoyu-33 wants to merge 1 commit intomainfrom
yuya/fix-compare-script-robustness

yaoyu-33 commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

yaoyu-33 commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 4, 2026

Uh oh!

coderabbitai bot Mar 4, 2026

Uh oh!

cuichenx left a comment

Uh oh!

cuichenx Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if not torch.distributed.is_initialized() or parallel_state.get_tensor_model_parallel_rank() == 0:
	if not torch.distributed.is_initialized() or (parallel_state.get_tensor_model_parallel_rank() == 0 and parallel_state.get_expert_model_parallel_rank() == 0):

Conversation

yaoyu-33 commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

yaoyu-33 commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

cuichenx Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoyu-33 commented Mar 4, 2026 •

edited by coderabbitai bot

Loading