[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models#2646
[misc] fix: Improve compare.py robustness for multi-GPU and vocab-padded models#2646
Conversation
…ded models - Fix NCCL broadcast dtype mismatch by converting logits to float32 before fallback tensor creation - Handle Megatron vocab-size padding by truncating logits to HF vocab size before comparison, removing shape-mismatch branch - Simplify vlm_forward_step return (let caller handle tuple unpacking) - Remove unused gc import and grad_scale_func workaround - Add barrier after HF broadcast for synchronization safety - Simplify rank-0 guard to only check TP rank (EP rank check unnecessary) Verified on Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99% Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
|
/ok to test 2be9ee7 |
📝 WalkthroughWalkthroughThis pull request refactors model comparison logic in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Line 321: vlm_forward_step currently returns model(**forward_args) which may
be a tuple (e.g., (output_tensor, loss_mask)) and breaks downstream tensor ops;
update vlm_forward_step to detect if model(**forward_args) is a tuple or list
and extract the primary output tensor (e.g., first element) before returning, so
return (output_tensor, loss_func) instead of the raw tuple; reference the call
site model(**forward_args) and the returned loss_func when making this change.
- Around line 799-818: Add an explicit vocab-size compatibility guard before
truncating megatron_logits: compute hf_vocab_size = hf_logits.shape[0] then
check megatron_logits.size(0) >= hf_vocab_size (using megatron_logits.size(0) or
.shape[0]) and if not raise a descriptive ValueError (e.g., "Megatron logits
vocab smaller than HF vocab: megatron_vocab=..., hf_vocab=...") to fail fast;
keep the existing truncation into megatron_logits_cmp and subsequent comparisons
(hf_next_token, megatron_next_token_cmp, diff, cosine_similarity,
SIMILARITY_THRESHOLD) unchanged when the check passes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f4f9fd0d-5961-41af-bb43-445481a458c2
📒 Files selected for processing (1)
examples/conversion/compare_hf_and_megatron/compare.py
| output_tensor = model_output | ||
|
|
||
| return output_tensor, loss_func | ||
| return model(**forward_args), loss_func |
There was a problem hiding this comment.
Handle tuple model outputs before returning from vlm_forward_step.
At Line 321, returning raw model(**forward_args) can propagate a tuple output (e.g., (output_tensor, loss_mask)), while downstream code assumes a tensor and will fail on tensor ops/indexing.
Proposed fix
- return model(**forward_args), loss_func
+ model_output = model(**forward_args)
+ if isinstance(model_output, tuple):
+ output_tensor, _ = model_output
+ else:
+ output_tensor = model_output
+ return output_tensor, loss_func🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/conversion/compare_hf_and_megatron/compare.py` at line 321,
vlm_forward_step currently returns model(**forward_args) which may be a tuple
(e.g., (output_tensor, loss_mask)) and breaks downstream tensor ops; update
vlm_forward_step to detect if model(**forward_args) is a tuple or list and
extract the primary output tensor (e.g., first element) before returning, so
return (output_tensor, loss_func) instead of the raw tuple; reference the call
site model(**forward_args) and the returned loss_func when making this change.
| # Megatron may pad vocab_size for GPU kernel efficiency — truncate | ||
| # to the HF vocab size so logits are directly comparable. | ||
| hf_vocab_size = hf_logits.shape[0] | ||
| megatron_logits_cmp = megatron_logits[:hf_vocab_size] | ||
| megatron_next_token_cmp = torch.argmax(megatron_logits_cmp, dim=-1) | ||
|
|
||
| # Compare outputs | ||
| print("=== COMPARISON ===") | ||
| token_match = hf_next_token.item() == megatron_next_token.item() | ||
| token_match = hf_next_token.item() == megatron_next_token_cmp.item() | ||
| token_status_emoji = "✅" if token_match else "❌" | ||
| print(f"Token match: {token_match} {token_status_emoji}") | ||
|
|
||
| # Compare logits if shapes match | ||
| if hf_logits.shape == megatron_logits.shape: | ||
| diff = (hf_logits - megatron_logits).abs() | ||
| print(f"Logits diff - max: {diff.max():.6f}, mean: {diff.mean():.6f}") | ||
| cosine_sim = torch.cosine_similarity(hf_logits.unsqueeze(0), megatron_logits.unsqueeze(0)) | ||
| cos_val = cosine_sim.item() | ||
| percent = cos_val * 100.0 | ||
| status_emoji = "✅" if cos_val >= SIMILARITY_THRESHOLD else "❌" | ||
| tolerance_text = "within ±2%" if cos_val >= SIMILARITY_THRESHOLD else "outside ±2%" | ||
| print( | ||
| f"Cosine similarity: {cos_val:.6f} ({percent:.2f}%) {status_emoji} ({tolerance_text} tolerance)" | ||
| ) | ||
| else: | ||
| print(f"Shape mismatch: HF {hf_logits.shape} vs Megatron {megatron_logits.shape}") | ||
| print("Cannot compare logits directly due to shape mismatch") | ||
| diff = (hf_logits - megatron_logits_cmp).abs() | ||
| print(f"Logits diff - max: {diff.max():.6f}, mean: {diff.mean():.6f}") | ||
| cosine_sim = torch.cosine_similarity(hf_logits.unsqueeze(0), megatron_logits_cmp.unsqueeze(0)) | ||
| cos_val = cosine_sim.item() | ||
| percent = cos_val * 100.0 | ||
| status_emoji = "✅" if cos_val >= SIMILARITY_THRESHOLD else "❌" | ||
| tolerance_text = "within ±2%" if cos_val >= SIMILARITY_THRESHOLD else "outside ±2%" | ||
| print(f"Cosine similarity: {cos_val:.6f} ({percent:.2f}%) {status_emoji} ({tolerance_text} tolerance)") |
There was a problem hiding this comment.
Add an explicit vocab-size compatibility guard before truncation.
At Line 802, truncation assumes Megatron logits are at least HF vocab length. If Megatron vocab is smaller, Line 811 still fails later with a shape mismatch; fail fast with a clear error.
Proposed fix
hf_vocab_size = hf_logits.shape[0]
+ if megatron_logits.shape[0] < hf_vocab_size:
+ raise ValueError(
+ "Incompatible vocab sizes: "
+ f"Megatron logits ({megatron_logits.shape[0]}) < HF logits ({hf_vocab_size}). "
+ "Ensure both models use the same tokenizer/vocab."
+ )
megatron_logits_cmp = megatron_logits[:hf_vocab_size]
megatron_next_token_cmp = torch.argmax(megatron_logits_cmp, dim=-1)Based on learnings: when a path is unsupported, raise an explicit, descriptive error instead of failing later with an implicit runtime mismatch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 799 -
818, Add an explicit vocab-size compatibility guard before truncating
megatron_logits: compute hf_vocab_size = hf_logits.shape[0] then check
megatron_logits.size(0) >= hf_vocab_size (using megatron_logits.size(0) or
.shape[0]) and if not raise a descriptive ValueError (e.g., "Megatron logits
vocab smaller than HF vocab: megatron_vocab=..., hf_vocab=...") to fail fast;
keep the existing truncation into megatron_logits_cmp and subsequent comparisons
(hf_next_token, megatron_next_token_cmp, diff, cosine_similarity,
SIMILARITY_THRESHOLD) unchanged when the check passes.
| parallel_state.get_tensor_model_parallel_rank() == 0 | ||
| and parallel_state.get_expert_model_parallel_rank() == 0 | ||
| ): | ||
| if not torch.distributed.is_initialized() or parallel_state.get_tensor_model_parallel_rank() == 0: |
There was a problem hiding this comment.
| if not torch.distributed.is_initialized() or parallel_state.get_tensor_model_parallel_rank() == 0: | |
| if not torch.distributed.is_initialized() or (parallel_state.get_tensor_model_parallel_rank() == 0 and parallel_state.get_expert_model_parallel_rank() == 0): |
a small suggestion for EP>1 runs so that output is not garbled
Summary
float32before the fallback tensor creation path, ensuring all ranks use the same dtypevlm_forward_stepreturn, remove unusedgcimport,grad_scale_funcworkaround, and unnecessary EP rank guardtorch.distributed.barrier()after HF results broadcast for synchronization safetyTest plan
Qwen/Qwen3-0.6B: token match ✅, cosine similarity 99.99%Made with Cursor
Summary by CodeRabbit