Skip to content

Update fp8 paged attention#592

Draft
amd-xiaoyu12 wants to merge 890 commits intoROCm:mainfrom
amd-xiaoyu12:fp8-paged-attention
Draft

Update fp8 paged attention#592
amd-xiaoyu12 wants to merge 890 commits intoROCm:mainfrom
amd-xiaoyu12:fp8-paged-attention

Conversation

@amd-xiaoyu12
Copy link

@amd-xiaoyu12 amd-xiaoyu12 commented Jul 9, 2025

Please direct your PRs to the upstream vllm (https://github.com/vllm-project/vllm.git)

Accepting PRs into the ROCm fork (https://github.com/ROCm/vllm) will require a clear previously communicated exception

Summary:
Support full fp8 MFMA with wrap level dynamic query quantization to improve fp8 performance on MI308, which can also benefits other MI300x accelerator or latest hardware.

  • Performance
image
  • Unit test - attention output
image * Lm-eval-harness ppl test image

gshtras and others added 30 commits February 17, 2025 15:42
* Enabling ROCm CI on MI250 machines:
- correct build target
- correct queue

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

---------

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
* Optimization for quantized gemm skinny sizes

* lint fix

* Add support for bf16/fp16

* code cleanup

* code cleanup

* lint fix2

* cleanup

* Moved the logic into tuned gemm to preserve API compatibility

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
* Removing gfx940 and gfx941 targets. These have been deprecated in favor of gfx942 for MI300X

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* Remove from custom kernels as well

---------

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
* Advance torch commit to be past pytorch/pytorch#144942 to fix tunable ops

* Make sure to use the submodule commit compatible with the main aiter commit
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
* Using aiter branch that can be built into a whl with PREBUILD_KERNELS=1

* Using fail fast on aiter build to see compilation errors in the log since it fails silently

* Check for build success without installing whl
* Using proposed fix from ROCm/aiter#115

* Build fix
* tuning adjustment for quantized skinny gemm.

* lint fix
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
@amd-xiaoyu12 amd-xiaoyu12 changed the title Update fp8 paged attention Update fp8 paged attention for MI308 Jul 9, 2025
@amd-xiaoyu12 amd-xiaoyu12 changed the title Update fp8 paged attention for MI308 Update fp8 paged attention Aug 4, 2025
@gshtras gshtras force-pushed the main branch 2 times, most recently from 1d2c43d to eb9d4de Compare September 9, 2025 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.