[ROCm][FP4 BMM] integrate FP4 BMM by zejunchen-zejun · Pull Request #816 · ROCm/vllm

zejunchen-zejun · 2025-11-20T04:15:35Z

Integrate the FP4 BMM and unify the env flag VLLM_ROCM_USE_AITER_BMM.
When VLLM_ROCM_USE_AITER_BMM=1(default), the logic is:
When weight in attention part is BF16 dtype, the FP8 BMM is called.
When weight in attention part is U8 dtype, the FP4 BMM is called.

When VLLM_ROCM_USE_AITER_BMM=0, the torch BMM is used.

For model DeepSeek-R1-MXFP4-Preview, whose kv_b_proj weight is U8, the FP4 BMM is used. Here is the associated performance and accuracy.

The FP4 accuracy is:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9515	±	0.0059
		strict-match	5	exact_match	↑	0.9507	±	0.0060

The FP4 BMM performance is:
Request throughput (req/s): 1.22
Mean TTFT (ms): 6474.98
Mean TPOT (ms): 44.95

The baseline performance(FP8 BMM) is:
Request throughput (req/s): 1.21
Mean TTFT (ms): 6732.17
Mean TPOT (ms): 45.16

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

zejunchen-zejun · 2025-11-26T04:01:15Z

Hi, @ZhiweiYan-96
Here is the PR for the FP4 BMM.
The dequant method in post processing is not very efficient.
You can modify it to use switch bit and subbyte storage for post processing U8 weight.

Finally you can upstream this PR to vllm community.

ZhiweiYan-96 · 2025-12-03T02:43:54Z

vllm/model_executor/layers/quantization/quark/utils.py

    return False
+
+
+def quant_to_mxfp4(x):


there should be some utils in vllm for quant&dequant

github-actions · 2026-03-04T02:13:23Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

zejunchen-zejun requested review from kliuae-amd, tjtanaavllm and wuhuikx as code owners November 20, 2025 04:15

zejunchen-zejun marked this pull request as draft November 20, 2025 04:15

zejunchen-zejun force-pushed the zejun/add_fp4_bmm_for_dev_perf branch 6 times, most recently from 9e200a0 to b4152da Compare November 26, 2025 01:50

[ROCm][FP4 BMM] integrate FP4 BMM

8eb3c34

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

zejunchen-zejun force-pushed the zejun/add_fp4_bmm_for_dev_perf branch from b4152da to 8eb3c34 Compare November 26, 2025 03:41

zejunchen-zejun marked this pull request as ready for review November 26, 2025 03:59

zejunchen-zejun requested a review from ZhiweiYan-96 November 26, 2025 04:02

ZhiweiYan-96 reviewed Dec 3, 2025

View reviewed changes

github-actions bot added the stale label Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][FP4 BMM] integrate FP4 BMM#816

[ROCm][FP4 BMM] integrate FP4 BMM#816
zejunchen-zejun wants to merge 1 commit intodev/perffrom
zejun/add_fp4_bmm_for_dev_perf

zejunchen-zejun commented Nov 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

zejunchen-zejun commented Nov 26, 2025 •

edited

Loading

Uh oh!

ZhiweiYan-96 Dec 3, 2025

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zejunchen-zejun commented Nov 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zejunchen-zejun commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZhiweiYan-96 Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zejunchen-zejun commented Nov 20, 2025 •

edited by github-actions bot

Loading

zejunchen-zejun commented Nov 26, 2025 •

edited

Loading