Skip to content

[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models#2646

Open
yaoyu-33 wants to merge 1 commit intomainfrom
yuya/fix-compare-script-robustness
Open

[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models#2646
yaoyu-33 wants to merge 1 commit intomainfrom
yuya/fix-compare-script-robustness

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Mar 4, 2026

Summary

  • Fix NCCL broadcast dtype mismatch by converting HF logits to float32 before the fallback tensor creation path, ensuring all ranks use the same dtype
  • Handle Megatron vocab-size padding by truncating logits to HF vocab size before comparison, eliminating the shape-mismatch code path
  • Simplify vlm_forward_step return, remove unused gc import, grad_scale_func workaround, and unnecessary EP rank guard
  • Add torch.distributed.barrier() after HF results broadcast for synchronization safety

Test plan

  • Verified on remote cluster with Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99%
  • CI passes

Made with Cursor

Summary by CodeRabbit

  • Refactor
    • Enhanced model comparison utility to improve accuracy and consistency across distributed training scenarios. Simplified forward-output handling and synchronized results across compute ranks. Updated comparison metrics calculation for more reliable model alignment validation.

…ded models

- Fix NCCL broadcast dtype mismatch by converting logits to float32 before
  fallback tensor creation
- Handle Megatron vocab-size padding by truncating logits to HF vocab size
  before comparison, removing shape-mismatch branch
- Simplify vlm_forward_step return (let caller handle tuple unpacking)
- Remove unused gc import and grad_scale_func workaround
- Add barrier after HF broadcast for synchronization safety
- Simplify rank-0 guard to only check TP rank (EP rank check unnecessary)

Verified on Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99%

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33
Copy link
Contributor Author

yaoyu-33 commented Mar 4, 2026

/ok to test 2be9ee7

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

This pull request refactors model comparison logic in compare.py to simplify output handling, normalize dtypes for NCCL broadcasts, truncate Megatron logits to HF vocabulary size, and broaden broadcast conditions. Changes ensure dtype consistency across distributed ranks and improve metric computation for direct model comparability.

Changes

Cohort / File(s) Summary
Model Comparison Logic
examples/conversion/compare_hf_and_megatron/compare.py
Simplifies vlm_forward_step output handling; removes grad_scale_func workaround from Megatron initialization; adds dtype normalization and synchronization barriers before/after HF-to-Megatron broadcast; truncates Megatron logits to HF vocabulary size; relaxes distributed logging condition to tensor-model-parallel rank 0; reworks comparison metrics to use truncated logits for cosine similarity and difference calculations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Run CICD

Suggested reviewers

  • cuichenx
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is partially related to the changeset. It references improving robustness for multi-GPU and vocab-padded models, which are real aspects of the changes, but does not highlight the primary technical improvements such as dtype normalization, logits truncation, and simplified return handling.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR makes robustness fixes to a utility/comparison script with documented test results (Qwen/Qwen3-0.6B: token match confirmed, cosine similarity 99.99%) demonstrating no numeric regressions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yuya/fix-compare-script-robustness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Line 321: vlm_forward_step currently returns model(**forward_args) which may
be a tuple (e.g., (output_tensor, loss_mask)) and breaks downstream tensor ops;
update vlm_forward_step to detect if model(**forward_args) is a tuple or list
and extract the primary output tensor (e.g., first element) before returning, so
return (output_tensor, loss_func) instead of the raw tuple; reference the call
site model(**forward_args) and the returned loss_func when making this change.
- Around line 799-818: Add an explicit vocab-size compatibility guard before
truncating megatron_logits: compute hf_vocab_size = hf_logits.shape[0] then
check megatron_logits.size(0) >= hf_vocab_size (using megatron_logits.size(0) or
.shape[0]) and if not raise a descriptive ValueError (e.g., "Megatron logits
vocab smaller than HF vocab: megatron_vocab=..., hf_vocab=...") to fail fast;
keep the existing truncation into megatron_logits_cmp and subsequent comparisons
(hf_next_token, megatron_next_token_cmp, diff, cosine_similarity,
SIMILARITY_THRESHOLD) unchanged when the check passes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f4f9fd0d-5961-41af-bb43-445481a458c2

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb1af6 and 2be9ee7.

📒 Files selected for processing (1)
  • examples/conversion/compare_hf_and_megatron/compare.py

output_tensor = model_output

return output_tensor, loss_func
return model(**forward_args), loss_func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle tuple model outputs before returning from vlm_forward_step.

At Line 321, returning raw model(**forward_args) can propagate a tuple output (e.g., (output_tensor, loss_mask)), while downstream code assumes a tensor and will fail on tensor ops/indexing.

Proposed fix
-    return model(**forward_args), loss_func
+    model_output = model(**forward_args)
+    if isinstance(model_output, tuple):
+        output_tensor, _ = model_output
+    else:
+        output_tensor = model_output
+    return output_tensor, loss_func
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/conversion/compare_hf_and_megatron/compare.py` at line 321,
vlm_forward_step currently returns model(**forward_args) which may be a tuple
(e.g., (output_tensor, loss_mask)) and breaks downstream tensor ops; update
vlm_forward_step to detect if model(**forward_args) is a tuple or list and
extract the primary output tensor (e.g., first element) before returning, so
return (output_tensor, loss_func) instead of the raw tuple; reference the call
site model(**forward_args) and the returned loss_func when making this change.

Comment on lines +799 to +818
# Megatron may pad vocab_size for GPU kernel efficiency — truncate
# to the HF vocab size so logits are directly comparable.
hf_vocab_size = hf_logits.shape[0]
megatron_logits_cmp = megatron_logits[:hf_vocab_size]
megatron_next_token_cmp = torch.argmax(megatron_logits_cmp, dim=-1)

# Compare outputs
print("=== COMPARISON ===")
token_match = hf_next_token.item() == megatron_next_token.item()
token_match = hf_next_token.item() == megatron_next_token_cmp.item()
token_status_emoji = "✅" if token_match else "❌"
print(f"Token match: {token_match} {token_status_emoji}")

# Compare logits if shapes match
if hf_logits.shape == megatron_logits.shape:
diff = (hf_logits - megatron_logits).abs()
print(f"Logits diff - max: {diff.max():.6f}, mean: {diff.mean():.6f}")
cosine_sim = torch.cosine_similarity(hf_logits.unsqueeze(0), megatron_logits.unsqueeze(0))
cos_val = cosine_sim.item()
percent = cos_val * 100.0
status_emoji = "✅" if cos_val >= SIMILARITY_THRESHOLD else "❌"
tolerance_text = "within ±2%" if cos_val >= SIMILARITY_THRESHOLD else "outside ±2%"
print(
f"Cosine similarity: {cos_val:.6f} ({percent:.2f}%) {status_emoji} ({tolerance_text} tolerance)"
)
else:
print(f"Shape mismatch: HF {hf_logits.shape} vs Megatron {megatron_logits.shape}")
print("Cannot compare logits directly due to shape mismatch")
diff = (hf_logits - megatron_logits_cmp).abs()
print(f"Logits diff - max: {diff.max():.6f}, mean: {diff.mean():.6f}")
cosine_sim = torch.cosine_similarity(hf_logits.unsqueeze(0), megatron_logits_cmp.unsqueeze(0))
cos_val = cosine_sim.item()
percent = cos_val * 100.0
status_emoji = "✅" if cos_val >= SIMILARITY_THRESHOLD else "❌"
tolerance_text = "within ±2%" if cos_val >= SIMILARITY_THRESHOLD else "outside ±2%"
print(f"Cosine similarity: {cos_val:.6f} ({percent:.2f}%) {status_emoji} ({tolerance_text} tolerance)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add an explicit vocab-size compatibility guard before truncation.

At Line 802, truncation assumes Megatron logits are at least HF vocab length. If Megatron vocab is smaller, Line 811 still fails later with a shape mismatch; fail fast with a clear error.

Proposed fix
                 hf_vocab_size = hf_logits.shape[0]
+                if megatron_logits.shape[0] < hf_vocab_size:
+                    raise ValueError(
+                        "Incompatible vocab sizes: "
+                        f"Megatron logits ({megatron_logits.shape[0]}) < HF logits ({hf_vocab_size}). "
+                        "Ensure both models use the same tokenizer/vocab."
+                    )
                 megatron_logits_cmp = megatron_logits[:hf_vocab_size]
                 megatron_next_token_cmp = torch.argmax(megatron_logits_cmp, dim=-1)

Based on learnings: when a path is unsupported, raise an explicit, descriptive error instead of failing later with an implicit runtime mismatch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 799 -
818, Add an explicit vocab-size compatibility guard before truncating
megatron_logits: compute hf_vocab_size = hf_logits.shape[0] then check
megatron_logits.size(0) >= hf_vocab_size (using megatron_logits.size(0) or
.shape[0]) and if not raise a descriptive ValueError (e.g., "Megatron logits
vocab smaller than HF vocab: megatron_vocab=..., hf_vocab=...") to fail fast;
keep the existing truncation into megatron_logits_cmp and subsequent comparisons
(hf_next_token, megatron_next_token_cmp, diff, cosine_similarity,
SIMILARITY_THRESHOLD) unchanged when the check passes.

Copy link
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

parallel_state.get_tensor_model_parallel_rank() == 0
and parallel_state.get_expert_model_parallel_rank() == 0
):
if not torch.distributed.is_initialized() or parallel_state.get_tensor_model_parallel_rank() == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not torch.distributed.is_initialized() or parallel_state.get_tensor_model_parallel_rank() == 0:
if not torch.distributed.is_initialized() or (parallel_state.get_tensor_model_parallel_rank() == 0 and parallel_state.get_expert_model_parallel_rank() == 0):

a small suggestion for EP>1 runs so that output is not garbled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants