Conversation
…ded models - Fix NCCL broadcast dtype mismatch by converting logits to float32 before fallback tensor creation - Handle Megatron vocab-size padding by truncating logits to HF vocab size before comparison, removing shape-mismatch branch - Simplify vlm_forward_step return (let caller handle tuple unpacking) - Remove unused gc import and grad_scale_func workaround - Add barrier after HF broadcast for synchronization safety - Simplify rank-0 guard to only check TP rank (EP rank check unnecessary) Verified on Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99% Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
|
/ok to test 35ca667 |
📝 WalkthroughWalkthroughModifications to the HuggingFace and Megatron model comparison script include: initializing per-component Megatron config, managing CUDA cache explicitly, improving broadcast synchronization across ranks with explicit tensor initialization, truncating Megatron logits to match HuggingFace vocabulary size, and updating comparison logic to evaluate truncated logits with cosine similarity metrics. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/conversion/compare_hf_and_megatron/compare.py (1)
728-745:⚠️ Potential issue | 🟠 MajorSynchronize logits length before broadcast to avoid rank-size mismatches.
hf_logitson rank 0 is model-output-sized, but fallback allocation on other ranks is tokenizer-sized. If those lengths differ,broadcastcan fail or hang.🔧 Proposed fix
if torch.distributed.is_initialized(): # Ensure consistent dtype across ranks: rank 0 has bfloat16 logits from the HF model, # so all ranks must use the same dtype for NCCL broadcast to work correctly. if hf_logits is not None: hf_logits = hf_logits.float() + # Broadcast logits length first so every rank allocates identically sized tensors. + hf_vocab_size_tensor = torch.zeros(1, device=input_ids.device, dtype=torch.long) + if hf_logits is not None: + hf_vocab_size_tensor[0] = hf_logits.numel() + dist.broadcast(hf_vocab_size_tensor, src=0) + hf_vocab_size = int(hf_vocab_size_tensor.item()) + if hf_next_token is None: - hf_next_token = torch.zeros(1, device=input_ids.device, dtype=torch.long) - if hf_logits is None: - vocab_size = getattr( - tokenizer, "vocab_size", len(tokenizer.vocab) if hasattr(tokenizer, "vocab") else 32000 - ) - hf_logits = torch.zeros(vocab_size, device=input_ids.device, dtype=torch.float32) + hf_next_token = torch.zeros((), device=input_ids.device, dtype=torch.long) + if hf_logits is None or hf_logits.numel() != hf_vocab_size: + hf_logits = torch.zeros(hf_vocab_size, device=input_ids.device, dtype=torch.float32) # Broadcast from rank 0 to all ranks torch.distributed.broadcast(hf_next_token, 0) torch.distributed.broadcast(hf_logits, 0) torch.distributed.barrier()Run this read-only check to confirm the current shape source mismatch path:
#!/bin/bash set -euo pipefail FILE="$(fd -a "compare.py" | rg "examples/conversion/compare_hf_and_megatron/compare.py$" | head -n1)" echo "== Rank-0 logits source ==" nl -ba "$FILE" | sed -n '537,543p' echo "== Fallback allocation + broadcast block ==" nl -ba "$FILE" | sed -n '733,745p' echo "== Search for explicit logits-length sync before logits broadcast ==" rg -n 'hf_logits\.shape|hf_logits\.numel|vocab_size|broadcast\(' "$FILE"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 728 - 745, hf_logits can have different lengths on rank 0 (model output) vs other ranks (tokenizer fallback), which breaks torch.distributed.broadcast; ensure all ranks have the same hf_logits shape before broadcasting by resizing or reallocating hf_logits on non-zero ranks to match hf_logits.shape on rank 0. Specifically, use a small two-step sync: first broadcast an integer length (e.g., logits_len tensor) from rank 0, then on non-zero ranks reallocate hf_logits = torch.zeros(logits_len, device=input_ids.device, dtype=torch.float32) (or resize the existing tensor) before calling torch.distributed.broadcast(hf_logits, 0); do the same length sync for hf_next_token if its size can vary. Reference symbols: hf_logits, hf_next_token, tokenizer, broadcast.
🧹 Nitpick comments (1)
examples/conversion/compare_hf_and_megatron/compare.py (1)
802-821: Align printed Megatron diagnostics with truncated logits.Comparison uses
megatron_logits_cmp, but printed next-token/top-5 stats are still from untruncated logits. Reporting from the same truncated tensor will make debug output consistent with pass/fail logic.♻️ Suggested refactor
- megatron_logits = megatron_output[0, -1, :] - megatron_next_token = torch.argmax(megatron_logits, dim=-1) + megatron_logits = megatron_output[0, -1, :] if not torch.distributed.is_initialized() or parallel_state.get_tensor_model_parallel_rank() == 0: print(f"Megatron output shape: {megatron_output.shape}") print(f"Megatron logits stats - mean: {megatron_logits.mean():.4f}, std: {megatron_logits.std():.4f}") - print( - f"Megatron next token: {megatron_next_token.item()} ('{tokenizer.decode([megatron_next_token.item()])}')" - ) - - # Show top 5 tokens - top5_vals, top5_ids = torch.topk(megatron_logits, 5) - top5_tokens = [tokenizer.decode([idx]) for idx in top5_ids] - print(f"Megatron Top 5: {list(zip(top5_tokens, top5_vals.tolist()))}") # Megatron may pad vocab_size for GPU kernel efficiency — truncate # to the HF vocab size so logits are directly comparable. hf_vocab_size = hf_logits.shape[0] megatron_logits_cmp = megatron_logits[:hf_vocab_size] megatron_next_token_cmp = torch.argmax(megatron_logits_cmp, dim=-1) + + print( + "Megatron next token (truncated): " + f"{megatron_next_token_cmp.item()} ('{tokenizer.decode([megatron_next_token_cmp.item()])}')" + ) + top5_vals, top5_ids = torch.topk(megatron_logits_cmp, 5) + top5_tokens = [tokenizer.decode([idx]) for idx in top5_ids] + print(f"Megatron Top 5 (truncated): {list(zip(top5_tokens, top5_vals.tolist()))}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 802 - 821, The printed Megatron diagnostics are still using the untruncated logits while the comparison logic uses megatron_logits_cmp; update all diagnostic calculations and prints to use megatron_logits_cmp (and its derived megatron_next_token_cmp and any top-k computations) so the reported next-token, top-5, and similarity stats are computed from the same truncated tensor used for pass/fail checks (variables to update: megatron_logits -> megatron_logits_cmp, megatron_next_token -> megatron_next_token_cmp, and any topk/indexing derived from megatron_logits).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 728-745: hf_logits can have different lengths on rank 0 (model
output) vs other ranks (tokenizer fallback), which breaks
torch.distributed.broadcast; ensure all ranks have the same hf_logits shape
before broadcasting by resizing or reallocating hf_logits on non-zero ranks to
match hf_logits.shape on rank 0. Specifically, use a small two-step sync: first
broadcast an integer length (e.g., logits_len tensor) from rank 0, then on
non-zero ranks reallocate hf_logits = torch.zeros(logits_len,
device=input_ids.device, dtype=torch.float32) (or resize the existing tensor)
before calling torch.distributed.broadcast(hf_logits, 0); do the same length
sync for hf_next_token if its size can vary. Reference symbols: hf_logits,
hf_next_token, tokenizer, broadcast.
---
Nitpick comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 802-821: The printed Megatron diagnostics are still using the
untruncated logits while the comparison logic uses megatron_logits_cmp; update
all diagnostic calculations and prints to use megatron_logits_cmp (and its
derived megatron_next_token_cmp and any top-k computations) so the reported
next-token, top-5, and similarity stats are computed from the same truncated
tensor used for pass/fail checks (variables to update: megatron_logits ->
megatron_logits_cmp, megatron_next_token -> megatron_next_token_cmp, and any
topk/indexing derived from megatron_logits).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2f198d60-86a7-4f8c-848b-7bd2a8b43bc5
📒 Files selected for processing (1)
examples/conversion/compare_hf_and_megatron/compare.py
Summary
Cherry-pick of #2646 to
r0.3.0.float32before the fallback tensor creation path, ensuring all ranks use the same dtypevlm_forward_stepreturn, remove unusedgcimport, and unnecessary EP rank guardtorch.distributed.barrier()after HF results broadcast for synchronization safetymtp_num_layersfor inference to prevent NCCL hangsTest plan
Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99%Made with Cursor
Summary by CodeRabbit
Release Notes
Bug Fixes
New Features