[bridge] Fix off-by-one in sliding window size for Gemma2, Gemma3, and GPT-OSS#2656
[bridge] Fix off-by-one in sliding window size for Gemma2, Gemma3, and GPT-OSS#2656
Conversation
…d GPT-OSS HuggingFace sliding_window is inclusive (tokens within window are attended to), while Megatron/FlashAttention window_size is exclusive. Subtract 1 to align semantics. Also make GPT-OSS read sliding_window from the HF config instead of hardcoding 128. Made-with: Cursor
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThe changes adjust sliding window size calculations in three model bridge implementations. In Gemma2Bridge, Gemma3TEDotProductAttention, and GPT-OSS bridge, window_size is modified to subtract 1 from the computed or configured window dimensions, affecting local attention behavior. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
sliding_windowis inclusive (tokens within the window are attended to), while Megatron/FlashAttentionwindow_sizeis exclusive — subtract 1 to align semanticssliding_windowfrom the HF config instead of hardcoding 128Test plan
Made with Cursor
Thanks to @returnL for catching this in NVIDIA/Megatron-LM#2771 (review)
Summary by CodeRabbit