Skip to content

Conversation

@princepride
Copy link

Related to vllm-project/vllm#31473

Purposal:

The SM90 kernel now automatically selects the appropriate block size based on h_q:
h_q = 64, 128, 192, ... → Use B_H=64
h_q = 32, 96, 160, ... → Use B_H=32
h_q = 16, 48, 80, ... → Use B_H=16
This allows for more flexible support of different head configurations while maintaining optimal performance (prioritizing larger block sizes).

Test Plan:

@LucasWilkinson This code was generated by Gemini 3.0 Pro. I don't know much about kernel. Is this code modification reasonable? How should I test it?🥹

…ion kernel.

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@LucasWilkinson
Copy link
Collaborator

you can test it by using export FLASH_MLA_SRC_DIR=<path-to-modified-flashmla> then rebuild vLLM

NOTE: id recommend installing ccache and using:

VLLM_DISABLE_SCCACHE=1 CCACHE_NOHASHDIR="true" uv pip install --no-build-isolation -e . -v

for faster rebuilds (first build will still be very slow), you can then even do:

VLLM_DISABLE_SCCACHE=1 python setup.py build_ext --inplace

to just rebuild the kernels

Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride
Copy link
Author

@LucasWilkinson It seems much more complicated than I imagined 😂. After trying this kernel and making several more versions, the model's performance has decreased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants