[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp by ganyi1996ppo · Pull Request #29287 · vllm-project/vllm

ganyi1996ppo · 2025-11-24T02:11:14Z

Purpose

This PR optimize the deepseekv3.2's performance on AMD's device, and separate SparseAttnIndexer out as a CustomOp as it contains lots of heavy kernels like fp8_mqa_logits or fp8_paged_mqa_logits. The solution might vary on different platform for this indexer op in order to achieve optimal performance on vllm. The main change include:

Separate SparseAttnIndexer out as CustomOp.
Integrate mla_decode_fwd to AiterMLASparseBackend to accelerate the performance on sparse mla
add fetch_ragged_layout triton kernel to handle the dynamic shape issue and enable the full cudagraph on decode phase
add _indexer_k_quant_and_cache_kernel _cp_gather_indexer_quant_cache_kernel triton kernel for preshuffle layout support
integrate gluon implementation of fp8_paged_mqa_logits with preshuffle layout support

Test Plan

accuracy: gsm8k
performance: vllm bench

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-24T13:49:12Z

vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse.py

+        self.paged_kv_last_page_len = torch.ones(
+            max_num_seqs, dtype=torch.int32, device=device
+        )
+
+        # These two needs to be calculated in runtime,


Compute paged_kv_last_page_len for non-unit block sizes

ROCm sparse MLA now runs with block_size 64 (DeepseekV32IndexerBackend.supported_kernel_block_sizes was switched to [64]), but the ROCM metadata builder still initializes paged_kv_last_page_len to all ones and never derives the actual last-page lengths before passing it into mla_decode_fwd. With block sizes larger than 1, this tells the kernel every last cache page has only a single valid token, so any decode where a sequence spans more than one cache page will mask out all but the first position of its final page, truncating attention for longer contexts.

Useful? React with 👍 / 👎.

tjtanaa · 2025-11-24T14:03:37Z

@ganyi1996ppo Can you also include the AITER commit or requirements, and also information if the AITER commit in the Dockerfile.rocm_base supports the feature? It will be easier for us to keep track and plan the merge.

ganyi1996ppo · 2025-11-25T07:00:42Z

@ganyi1996ppo Can you also include the AITER commit or requirements, and also information if the AITER commit in the Dockerfile.rocm_base supports the feature? It will be easier for us to keep track and plan the merge.

Should works fine after this aiter commit e2a1a6f7c8628e14b28c09844ee25ef0b6f9b19d

heheda12345 · 2025-11-26T01:03:06Z

@ganyi1996ppo please use 20-shot gsm-8k to verify correctness. https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html#accuracy-benchmarking

This model selects 2048 tokens so correctness evaluation should use requests longer than 2048 tokens

vllm/attention/ops/rocm_aiter_mla_sparse.py

heheda12345 · 2025-11-26T01:07:24Z

vllm/model_executor/models/deepseek_v2.py

    num_decode_tokens = attn_metadata.num_decode_tokens

-    ops.indexer_k_quant_and_cache(
+    indexer_k_quant_cache_and_cache_func = ops.indexer_k_quant_and_cache


can you use CustomOp to select the kernel implementation for different platforms?

hi @heheda12345 , Thanks for the suggestion. We actually planning to replace many kernels in sparse_attn_indexer. So perhaps it's better to wrap the sparse_attn_indexer to CustomOp. What's your thought?

Yes that works for me

vllm/v1/attention/backends/mla/indexer.py

ganyi1996ppo · 2025-11-26T02:28:17Z

@ganyi1996ppo please use 20-shot gsm-8k to verify correctness. >https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html#accuracy-benchmarking
This model selects 2048 tokens so correctness evaluation should use requests longer than 2048 tokens

Sure, thanks for the suggestion!

ganyi1996ppo · 2025-12-03T06:31:38Z

Update 20-shot gsm8k result:

# 20-shot
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9507|±  |0.0060|
|     |       |strict-match    |    20|exact_match|↑  |0.9515|±  |0.0059|

ganyi1996ppo · 2025-12-11T06:25:47Z

hi @heheda12345 @tjtanaa, I just update the code as we discussed before, please take a look.

ganyi1996ppo · 2025-12-11T06:28:10Z

And @tjtanaa , we might need to wait triton update to 3.5.0 before merge this PR. Or the gluon version of fp8_paged_mqa_logits might just not be able to compile. Do you have any clue on the triton update plan on rocm platform?

tjtanaa · 2025-12-11T10:24:01Z

And @tjtanaa , we might need to wait triton update to 3.5.0 before merge this PR. Or the gluon version of fp8_paged_mqa_logits might just not be able to compile. Do you have any clue on the triton update plan on rocm platform?

@gshtras do you know if there are any plans for triton updates and also aiter updates?

Signed-off-by: ganyi <ygan@amd.com>

tjtanaa

I added some comments to expedite the review. I will add more in the next couple hours.

tjtanaa · 2026-01-16T06:29:43Z

vllm/v1/attention/ops/rocm_aiter_mla_sparse.py

+        "NHD",
+        block_tile_size,
+        head_tile_size,
+        IS_FNUZ=current_platform.fp8_dtype() == torch.float8_e4m3fnuz,


there is a helper function current_platform.is_fp8_fnuz(), can you also help to add cache decorator to is_fp8_fnuz ?

I found cache decorator can not be captured by torch.compile, maybe we can leave this one? This host overhead should minor to big models like dsv3.2

tjtanaa · 2026-01-20T16:05:17Z

vllm/v1/attention/ops/rocm_aiter_mla_sparse.py

+        return importlib.util.find_spec("aiter.ops.triton.fp8_mqa_logits") is not None
+
+    if rocm_aiter_ops.is_enabled() and has_mqa_logits_module():
+        from aiter.ops.triton.attention.fp8_mqa_logits import fp8_mqa_logits


In the AITER version used in the Dockerfile.rocm_base, the kernel existed already. However, the path is different from the one used in latest main.

The AITER code in Dockerfile.rocm_base store the ops at path from aiter.ops.triton.fp8_mqa_logits import fp8_mqa_logits

Please take a look again, I have wrap them up to make it compatible to both aiter version

vllm/v1/attention/ops/rocm_aiter_mla_sparse.py

tjtanaa · 2026-01-21T03:08:26Z

vllm/v1/attention/ops/rocm_aiter_mla_sparse.py


    if rocm_aiter_ops.is_enabled():
-        from aiter.ops.triton.pa_mqa_logits import deepgemm_fp8_paged_mqa_logits_stage1
+        from aiter.ops.triton.attention.pa_mqa_logits import (


The AITER used in the Dockerfile.rocm_base does not have this path yet. It should be

from aiter.ops.triton.pa_mqa_logits import ( deepgemm_fp8_paged_mqa_logits_stage1, )

This is also compatible too

Signed-off-by: ganyi <ygan@amd.com>

tjtanaa

LGTM.

…t#29287) Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

…t#29287) Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: 陈建华 <1647430658@qq.com>

…t#29287) Signed-off-by: ganyi <ygan@amd.com>

mergify bot added deepseek Related to DeepSeek models rocm Related to AMD ROCm v1 labels Nov 24, 2025

ganyi1996ppo marked this pull request as ready for review November 24, 2025 13:43

ganyi1996ppo requested review from pavanimajety and tjtanaa as code owners November 24, 2025 13:43

chatgpt-codex-connector bot reviewed Nov 24, 2025

View reviewed changes

ganyi1996ppo changed the title ~~[ROCm][Deepseekv3.2][Perf] Performance optimize of deepseek v3.2 on AMD device~~ [ROCm][Deepseekv3.2][Perf] Performance optimization of deepseek v3.2 on AMD device Nov 24, 2025

heheda12345 reviewed Nov 26, 2025

View reviewed changes

vllm/attention/ops/rocm_aiter_mla_sparse.py Show resolved Hide resolved

heheda12345 reviewed Nov 26, 2025

View reviewed changes

vllm/v1/attention/backends/mla/indexer.py Show resolved Hide resolved

ganyi1996ppo force-pushed the ganyi/optimized_dsv3.2 branch from b9b2b37 to 83df375 Compare December 11, 2025 06:09

ganyi1996ppo requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners December 11, 2025 06:09

ganyi1996ppo requested a review from heheda12345 December 11, 2025 06:28

ganyi1996ppo force-pushed the ganyi/optimized_dsv3.2 branch from 0f82d58 to f9d3e0f Compare January 18, 2026 06:03

ganyi1996ppo mentioned this pull request Jan 20, 2026

[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm #32649

Draft

5 tasks

ganyi1996ppo force-pushed the ganyi/optimized_dsv3.2 branch from f9d3e0f to 5111929 Compare January 20, 2026 07:57

ganyi1996ppo added 8 commits January 21, 2026 02:19

enable mla_asm in sparse_mla backend

934f0de

Signed-off-by: ganyi <ygan@amd.com>

refactor the SparseAttnIndexer as CustomOp

5430bc5

Signed-off-by: ganyi <ygan@amd.com>

raise NotImplementedError for other platform

6563a92

Signed-off-by: ganyi <ygan@amd.com>

remove import

50d83c1

Signed-off-by: ganyi <ygan@amd.com>

rebase code

efad0b1

Signed-off-by: ganyi <ygan@amd.com>

fix rebase

9174e10

Signed-off-by: ganyi <ygan@amd.com>

resolve comments

f406805

Signed-off-by: ganyi <ygan@amd.com>

resolve comments

ee0a366

Signed-off-by: ganyi <ygan@amd.com>

ganyi1996ppo force-pushed the ganyi/optimized_dsv3.2 branch from 5111929 to ee0a366 Compare January 21, 2026 02:25

tjtanaa reviewed Jan 21, 2026

View reviewed changes

ganyi1996ppo added 3 commits January 21, 2026 03:20

make mqa and paged_mqa be compatible to both aiter version

3ba5fd1

Signed-off-by: ganyi <ygan@amd.com>

remove commenst

98df7d6

Signed-off-by: ganyi <ygan@amd.com>

disable pa_mqa_logits and cudagraph support

8fb3641

Signed-off-by: ganyi <ygan@amd.com>

tjtanaa approved these changes Jan 21, 2026

View reviewed changes

tjtanaa changed the title ~~[ROCm][Deepseekv3.2][Perf] Performance optimization of deepseek v3.2 on AMD device~~ [ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp Jan 21, 2026

tjtanaa merged commit 6c20e89 into vllm-project:main Jan 21, 2026
65 checks passed

monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026

[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (vllm-projec…

449d5a4

…t#29287) Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026

[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (vllm-projec…

1bc92b2

…t#29287) Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: 陈建华 <1647430658@qq.com>

lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026

[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (vllm-projec…

d9db31d

…t#29287) Signed-off-by: ganyi <ygan@amd.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (vllm-projec…

79bcafd

…t#29287) Signed-off-by: ganyi <ygan@amd.com>

kebe7jun mentioned this pull request Feb 25, 2026

[Bugfix] [DeepSeek-V3.2] fix sparse_attn_indexer weights padding #35277

Open

5 tasks

Uh oh!

Conversation

ganyi1996ppo commented Nov 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Nov 24, 2025

Uh oh!

ganyi1996ppo commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heheda12345 commented Nov 26, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ganyi1996ppo commented Nov 26, 2025

Uh oh!

ganyi1996ppo commented Dec 3, 2025

Uh oh!

ganyi1996ppo commented Dec 11, 2025

Uh oh!

ganyi1996ppo commented Dec 11, 2025

Uh oh!

tjtanaa commented Dec 11, 2025

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ganyi1996ppo commented Nov 24, 2025 •

edited by github-actions bot

Loading

ganyi1996ppo commented Nov 25, 2025 •

edited

Loading

ganyi1996ppo Jan 21, 2026 •

edited

Loading