Enable support for 16 or 32 heads in the SM90 Sparse Attention #11
+110
−38
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related to vllm-project/vllm#31473
Purposal:
The SM90 kernel now automatically selects the appropriate block size based on h_q:
h_q = 64, 128, 192, ... → Use B_H=64
h_q = 32, 96, 160, ... → Use B_H=32
h_q = 16, 48, 80, ... → Use B_H=16
This allows for more flexible support of different head configurations while maintaining optimal performance (prioritizing larger block sizes).
Test Plan:
@LucasWilkinson This code was generated by Gemini 3.0 Pro. I don't know much about kernel. Is this code modification reasonable? How should I test it?🥹