Skip to content

[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm#32649

Draft
ganyi1996ppo wants to merge 9 commits intovllm-project:mainfrom
ROCm:ganyi/dsv3.2_further_opt
Draft

[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm#32649
ganyi1996ppo wants to merge 9 commits intovllm-project:mainfrom
ROCm:ganyi/dsv3.2_further_opt

Conversation

@ganyi1996ppo
Copy link
Contributor

@ganyi1996ppo ganyi1996ppo commented Jan 20, 2026

Purpose

This PR move some of the original feature from #29287 to here. Includes some triton 3.5.0 depending kernel. And add more optimization on ROCMAiterMLASparseBackend. This PR depends on #29287 to merge

Test Plan

gsm8k with 20 shot

Test Result

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 20 exact_match 0.9484 ± 0.0061
strict-match 20 exact_match 0.9484 ± 0.0061

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added deepseek Related to DeepSeek models rocm Related to AMD ROCm v1 labels Jan 20, 2026
@mergify
Copy link

mergify bot commented Jan 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ganyi1996ppo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 20, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance optimizations for Deepseek v3.2 on ROCm by adding new Triton kernels and a specialized backend. It also refactors the sparse_attn_indexer logic into a dedicated file, which is a good architectural improvement. However, I've identified a critical bug in the refactored CUDA path that could lead to an AttributeError, and a significant limitation in the new ROCm kernels due to a hardcoded value that restricts flexibility. Addressing these issues will improve the robustness and applicability of these optimizations.

):
return torch.ops.vllm.sparse_attn_indexer(
hidden_states,
self.k_cache.layer_prefix,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The k_cache object is of type DeepseekV32IndexerCache, which has a prefix attribute but not a layer_prefix attribute. Using self.k_cache.layer_prefix will result in an AttributeError. The HIP path correctly uses self.k_cache.prefix. This should be consistent.

Suggested change
self.k_cache.layer_prefix,
self.k_cache.prefix,

chunk.cu_seqlen_ke,
)
num_rows = logits.shape[0]
assert topk_tokens == 2048, "top_k_per_row assumes size 2048"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code asserts that topk_tokens must be 2048. This hardcoded value limits the flexibility of the sparse attention indexer. If this is a temporary limitation of the underlying custom C++ op, it should be noted with a TODO. For broader applicability, this should be made more flexible or at least provide a more informative error message if the value is unsupported.

)

num_rows = logits.shape[0]
assert topk_tokens == 2048, "top_k_per_row assumes size 2048"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the prefill path, the decode path also asserts that topk_tokens must be 2048. This hardcoded value is restrictive and should be generalized if possible to support other values.

@ganyi1996ppo ganyi1996ppo force-pushed the ganyi/dsv3.2_further_opt branch from 40265e8 to 9707e58 Compare January 20, 2026 09:17
@mergify mergify bot removed the needs-rebase label Jan 20, 2026
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
@ganyi1996ppo ganyi1996ppo force-pushed the ganyi/dsv3.2_further_opt branch from 9707e58 to e569fa2 Compare January 20, 2026 15:32
@mergify
Copy link

mergify bot commented Jan 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ganyi1996ppo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 20, 2026
Signed-off-by: ganyi <ygan@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models needs-rebase rocm Related to AMD ROCm v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant