[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm#32649
[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm#32649ganyi1996ppo wants to merge 9 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces performance optimizations for Deepseek v3.2 on ROCm by adding new Triton kernels and a specialized backend. It also refactors the sparse_attn_indexer logic into a dedicated file, which is a good architectural improvement. However, I've identified a critical bug in the refactored CUDA path that could lead to an AttributeError, and a significant limitation in the new ROCm kernels due to a hardcoded value that restricts flexibility. Addressing these issues will improve the robustness and applicability of these optimizations.
| ): | ||
| return torch.ops.vllm.sparse_attn_indexer( | ||
| hidden_states, | ||
| self.k_cache.layer_prefix, |
There was a problem hiding this comment.
The k_cache object is of type DeepseekV32IndexerCache, which has a prefix attribute but not a layer_prefix attribute. Using self.k_cache.layer_prefix will result in an AttributeError. The HIP path correctly uses self.k_cache.prefix. This should be consistent.
| self.k_cache.layer_prefix, | |
| self.k_cache.prefix, |
| chunk.cu_seqlen_ke, | ||
| ) | ||
| num_rows = logits.shape[0] | ||
| assert topk_tokens == 2048, "top_k_per_row assumes size 2048" |
There was a problem hiding this comment.
The code asserts that topk_tokens must be 2048. This hardcoded value limits the flexibility of the sparse attention indexer. If this is a temporary limitation of the underlying custom C++ op, it should be noted with a TODO. For broader applicability, this should be made more flexible or at least provide a more informative error message if the value is unsupported.
| ) | ||
|
|
||
| num_rows = logits.shape[0] | ||
| assert topk_tokens == 2048, "top_k_per_row assumes size 2048" |
40265e8 to
9707e58
Compare
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
9707e58 to
e569fa2
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Purpose
This PR move some of the original feature from #29287 to here. Includes some triton 3.5.0 depending kernel. And add more optimization on
ROCMAiterMLASparseBackend. This PR depends on #29287 to mergeTest Plan
gsm8k with 20 shot
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.