Skip to content

[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models#2647

Merged
ko3n1g merged 1 commit intor0.3.0from
yuya/cherry-pick-compare-fix-r0.3.0
Mar 4, 2026
Merged

[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models#2647
ko3n1g merged 1 commit intor0.3.0from
yuya/cherry-pick-compare-fix-r0.3.0

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Mar 4, 2026

Summary

Cherry-pick of #2646 to r0.3.0.

  • Fix NCCL broadcast dtype mismatch by converting HF logits to float32 before the fallback tensor creation path, ensuring all ranks use the same dtype
  • Handle Megatron vocab-size padding by truncating logits to HF vocab size before comparison, eliminating the shape-mismatch code path
  • Simplify vlm_forward_step return, remove unused gc import, and unnecessary EP rank guard
  • Add torch.distributed.barrier() after HF results broadcast for synchronization safety
  • Disable mtp_num_layers for inference to prevent NCCL hangs

Test plan

  • Verified on remote cluster with Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99%
  • CI passes

Made with Cursor

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved CUDA memory management and distributed synchronization for reliable model comparisons
    • Fixed vocabulary size mismatches between models
  • New Features

    • Enhanced diagnostics with cosine similarity and logits difference metrics for comparison workflows
    • Improved status logging for comparison steps

…ded models

- Fix NCCL broadcast dtype mismatch by converting logits to float32 before
  fallback tensor creation
- Handle Megatron vocab-size padding by truncating logits to HF vocab size
  before comparison, removing shape-mismatch branch
- Simplify vlm_forward_step return (let caller handle tuple unpacking)
- Remove unused gc import and grad_scale_func workaround
- Add barrier after HF broadcast for synchronization safety
- Simplify rank-0 guard to only check TP rank (EP rank check unnecessary)

Verified on Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99%

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33
Copy link
Contributor Author

yaoyu-33 commented Mar 4, 2026

/ok to test 35ca667

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

Modifications to the HuggingFace and Megatron model comparison script include: initializing per-component Megatron config, managing CUDA cache explicitly, improving broadcast synchronization across ranks with explicit tensor initialization, truncating Megatron logits to match HuggingFace vocabulary size, and updating comparison logic to evaluate truncated logits with cosine similarity metrics.

Changes

Cohort / File(s) Summary
Comparison Script Enhancement
examples/conversion/compare_hf_and_megatron/compare.py
Set mtp_num_layers = None for each Megatron model component; add explicit CUDA cache clearing post-HF inference; ensure HF logits are cast to float before broadcast synchronization; truncate Megatron logits to HF vocabulary size for direct comparison; compute logits difference and cosine similarity metrics between HF and truncated Megatron logits; add synchronization barriers and status logging.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Run CICD, cherry-pick, r0.3.0

Suggested reviewers

  • cuichenx
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: fixing robustness issues in compare.py for multi-GPU scenarios and vocab-padded models, which directly align with the changeset's core modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR includes test results on Qwen/Qwen3-0.6B model showing token match and 99.99% cosine similarity, demonstrating no regression while fixing multi-GPU and vocab-padding issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yuya/cherry-pick-compare-fix-r0.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/conversion/compare_hf_and_megatron/compare.py (1)

728-745: ⚠️ Potential issue | 🟠 Major

Synchronize logits length before broadcast to avoid rank-size mismatches.

hf_logits on rank 0 is model-output-sized, but fallback allocation on other ranks is tokenizer-sized. If those lengths differ, broadcast can fail or hang.

🔧 Proposed fix
     if torch.distributed.is_initialized():
         # Ensure consistent dtype across ranks: rank 0 has bfloat16 logits from the HF model,
         # so all ranks must use the same dtype for NCCL broadcast to work correctly.
         if hf_logits is not None:
             hf_logits = hf_logits.float()

+        # Broadcast logits length first so every rank allocates identically sized tensors.
+        hf_vocab_size_tensor = torch.zeros(1, device=input_ids.device, dtype=torch.long)
+        if hf_logits is not None:
+            hf_vocab_size_tensor[0] = hf_logits.numel()
+        dist.broadcast(hf_vocab_size_tensor, src=0)
+        hf_vocab_size = int(hf_vocab_size_tensor.item())
+
         if hf_next_token is None:
-            hf_next_token = torch.zeros(1, device=input_ids.device, dtype=torch.long)
-        if hf_logits is None:
-            vocab_size = getattr(
-                tokenizer, "vocab_size", len(tokenizer.vocab) if hasattr(tokenizer, "vocab") else 32000
-            )
-            hf_logits = torch.zeros(vocab_size, device=input_ids.device, dtype=torch.float32)
+            hf_next_token = torch.zeros((), device=input_ids.device, dtype=torch.long)
+        if hf_logits is None or hf_logits.numel() != hf_vocab_size:
+            hf_logits = torch.zeros(hf_vocab_size, device=input_ids.device, dtype=torch.float32)

         # Broadcast from rank 0 to all ranks
         torch.distributed.broadcast(hf_next_token, 0)
         torch.distributed.broadcast(hf_logits, 0)
         torch.distributed.barrier()

Run this read-only check to confirm the current shape source mismatch path:

#!/bin/bash
set -euo pipefail

FILE="$(fd -a "compare.py" | rg "examples/conversion/compare_hf_and_megatron/compare.py$" | head -n1)"

echo "== Rank-0 logits source =="
nl -ba "$FILE" | sed -n '537,543p'

echo "== Fallback allocation + broadcast block =="
nl -ba "$FILE" | sed -n '733,745p'

echo "== Search for explicit logits-length sync before logits broadcast =="
rg -n 'hf_logits\.shape|hf_logits\.numel|vocab_size|broadcast\(' "$FILE"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 728 -
745, hf_logits can have different lengths on rank 0 (model output) vs other
ranks (tokenizer fallback), which breaks torch.distributed.broadcast; ensure all
ranks have the same hf_logits shape before broadcasting by resizing or
reallocating hf_logits on non-zero ranks to match hf_logits.shape on rank 0.
Specifically, use a small two-step sync: first broadcast an integer length
(e.g., logits_len tensor) from rank 0, then on non-zero ranks reallocate
hf_logits = torch.zeros(logits_len, device=input_ids.device,
dtype=torch.float32) (or resize the existing tensor) before calling
torch.distributed.broadcast(hf_logits, 0); do the same length sync for
hf_next_token if its size can vary. Reference symbols: hf_logits, hf_next_token,
tokenizer, broadcast.
🧹 Nitpick comments (1)
examples/conversion/compare_hf_and_megatron/compare.py (1)

802-821: Align printed Megatron diagnostics with truncated logits.

Comparison uses megatron_logits_cmp, but printed next-token/top-5 stats are still from untruncated logits. Reporting from the same truncated tensor will make debug output consistent with pass/fail logic.

♻️ Suggested refactor
-            megatron_logits = megatron_output[0, -1, :]
-            megatron_next_token = torch.argmax(megatron_logits, dim=-1)
+            megatron_logits = megatron_output[0, -1, :]

             if not torch.distributed.is_initialized() or parallel_state.get_tensor_model_parallel_rank() == 0:
                 print(f"Megatron output shape: {megatron_output.shape}")
                 print(f"Megatron logits stats - mean: {megatron_logits.mean():.4f}, std: {megatron_logits.std():.4f}")
-                print(
-                    f"Megatron next token: {megatron_next_token.item()} ('{tokenizer.decode([megatron_next_token.item()])}')"
-                )
-
-                # Show top 5 tokens
-                top5_vals, top5_ids = torch.topk(megatron_logits, 5)
-                top5_tokens = [tokenizer.decode([idx]) for idx in top5_ids]
-                print(f"Megatron Top 5: {list(zip(top5_tokens, top5_vals.tolist()))}")
 
                 # Megatron may pad vocab_size for GPU kernel efficiency — truncate
                 # to the HF vocab size so logits are directly comparable.
                 hf_vocab_size = hf_logits.shape[0]
                 megatron_logits_cmp = megatron_logits[:hf_vocab_size]
                 megatron_next_token_cmp = torch.argmax(megatron_logits_cmp, dim=-1)
+
+                print(
+                    "Megatron next token (truncated): "
+                    f"{megatron_next_token_cmp.item()} ('{tokenizer.decode([megatron_next_token_cmp.item()])}')"
+                )
+                top5_vals, top5_ids = torch.topk(megatron_logits_cmp, 5)
+                top5_tokens = [tokenizer.decode([idx]) for idx in top5_ids]
+                print(f"Megatron Top 5 (truncated): {list(zip(top5_tokens, top5_vals.tolist()))}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 802 -
821, The printed Megatron diagnostics are still using the untruncated logits
while the comparison logic uses megatron_logits_cmp; update all diagnostic
calculations and prints to use megatron_logits_cmp (and its derived
megatron_next_token_cmp and any top-k computations) so the reported next-token,
top-5, and similarity stats are computed from the same truncated tensor used for
pass/fail checks (variables to update: megatron_logits -> megatron_logits_cmp,
megatron_next_token -> megatron_next_token_cmp, and any topk/indexing derived
from megatron_logits).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 728-745: hf_logits can have different lengths on rank 0 (model
output) vs other ranks (tokenizer fallback), which breaks
torch.distributed.broadcast; ensure all ranks have the same hf_logits shape
before broadcasting by resizing or reallocating hf_logits on non-zero ranks to
match hf_logits.shape on rank 0. Specifically, use a small two-step sync: first
broadcast an integer length (e.g., logits_len tensor) from rank 0, then on
non-zero ranks reallocate hf_logits = torch.zeros(logits_len,
device=input_ids.device, dtype=torch.float32) (or resize the existing tensor)
before calling torch.distributed.broadcast(hf_logits, 0); do the same length
sync for hf_next_token if its size can vary. Reference symbols: hf_logits,
hf_next_token, tokenizer, broadcast.

---

Nitpick comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 802-821: The printed Megatron diagnostics are still using the
untruncated logits while the comparison logic uses megatron_logits_cmp; update
all diagnostic calculations and prints to use megatron_logits_cmp (and its
derived megatron_next_token_cmp and any top-k computations) so the reported
next-token, top-5, and similarity stats are computed from the same truncated
tensor used for pass/fail checks (variables to update: megatron_logits ->
megatron_logits_cmp, megatron_next_token -> megatron_next_token_cmp, and any
topk/indexing derived from megatron_logits).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2f198d60-86a7-4f8c-848b-7bd2a8b43bc5

📥 Commits

Reviewing files that changed from the base of the PR and between aea0cfd and 35ca667.

📒 Files selected for processing (1)
  • examples/conversion/compare_hf_and_megatron/compare.py

@yaoyu-33 yaoyu-33 added the docs-only With great power comes great responsibility. label Mar 4, 2026
@ko3n1g ko3n1g merged commit c10d427 into r0.3.0 Mar 4, 2026
55 checks passed
@ko3n1g ko3n1g deleted the yuya/cherry-pick-compare-fix-r0.3.0 branch March 4, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-only With great power comes great responsibility.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants