Skip to content

[ROCm][FP4 BMM] integrate FP4 BMM#816

Open
zejunchen-zejun wants to merge 1 commit intodev/perffrom
zejun/add_fp4_bmm_for_dev_perf
Open

[ROCm][FP4 BMM] integrate FP4 BMM#816
zejunchen-zejun wants to merge 1 commit intodev/perffrom
zejun/add_fp4_bmm_for_dev_perf

Conversation

@zejunchen-zejun
Copy link

@zejunchen-zejun zejunchen-zejun commented Nov 20, 2025

Integrate the FP4 BMM and unify the env flag VLLM_ROCM_USE_AITER_BMM.
When VLLM_ROCM_USE_AITER_BMM=1(default), the logic is:
When weight in attention part is BF16 dtype, the FP8 BMM is called.
When weight in attention part is U8 dtype, the FP4 BMM is called.

When VLLM_ROCM_USE_AITER_BMM=0, the torch BMM is used.

For model DeepSeek-R1-MXFP4-Preview, whose kv_b_proj weight is U8, the FP4 BMM is used. Here is the associated performance and accuracy.

The FP4 accuracy is:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9515 ± 0.0059
strict-match 5 exact_match 0.9507 ± 0.0060

The FP4 BMM performance is:
Request throughput (req/s): 1.22
Mean TTFT (ms): 6474.98
Mean TPOT (ms): 44.95

The baseline performance(FP8 BMM) is:
Request throughput (req/s): 1.21
Mean TTFT (ms): 6732.17
Mean TPOT (ms): 45.16

@zejunchen-zejun zejunchen-zejun marked this pull request as draft November 20, 2025 04:15
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/add_fp4_bmm_for_dev_perf branch 6 times, most recently from 9e200a0 to b4152da Compare November 26, 2025 01:50
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/add_fp4_bmm_for_dev_perf branch from b4152da to 8eb3c34 Compare November 26, 2025 03:41
@zejunchen-zejun zejunchen-zejun marked this pull request as ready for review November 26, 2025 03:59
@zejunchen-zejun
Copy link
Author

zejunchen-zejun commented Nov 26, 2025

Hi, @ZhiweiYan-96
Here is the PR for the FP4 BMM.
The dequant method in post processing is not very efficient.
You can modify it to use switch bit and subbyte storage for post processing U8 weight.

Finally you can upstream this PR to vllm community.

return False


def quant_to_mxfp4(x):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be some utils in vllm for quant&dequant

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale label Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants