[feat][plugin] make ATOM mla attention works for vllm#265
[feat][plugin] make ATOM mla attention works for vllm#265XiaobingSuper wants to merge 13 commits intoROCm:mainfrom
Conversation
0a9f742 to
77ccd4d
Compare
There was a problem hiding this comment.
Pull request overview
Adds vLLM plugin-mode support for ATOM’s MLA attention path (non-sparse), including backend selection, metadata plumbing, and DeepSeek V3 model registration/loading so MLA can run end-to-end under vLLM.
Changes:
- Route vLLM’s
use_mlaattention selection to an ATOM MLA backend and add MLA-specific plugin-mode metadata builders. - Implement plugin-mode MLA forward/prefill/decode logic (including positions capture for graph mode).
- Register DeepSeek V3 as a supported vLLM plugin model and add a plugin-mode
load_weightsimplementation.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| atom/utils/backends.py | Extends compilation-cache hashing to ignore <frozen os> traced “files”. |
| atom/plugin/vllm/register.py | Patches vLLM process_weights_after_loading for Attention/MLAAttention. |
| atom/plugin/vllm/platform.py | Selects ATOM MLA backend when attn_selector_config.use_mla is true. |
| atom/plugin/vllm/model_wrapper.py | Copies positions into a static buffer for graph-mode MLA correctness. |
| atom/plugin/attention_mla.py | New: plugin-mode MLAAttention implementation helpers (prefill/decode/DCP). |
| atom/plugin/attention.py | Adds MLA plugin-mode metadata builders + backend wiring; renames plugin metadata class. |
| atom/models/deepseek_v2.py | Adds DeepSeek V3 support + plugin-mode load_weights. |
| atom/model_ops/utils.py | Removes duplicate per_tensor_dequantize implementation (keeps the canonical one). |
| atom/model_ops/paged_attention.py | Integrates vLLM MLAAttention usage and allocates a shared positions buffer. |
| atom/model_ops/linear.py | Ensures activation tensor is contiguous before quantizer .view() calls. |
| atom/model_ops/base_attention.py | Adjusts MLA unified-attn path to apply o_proj outside MLA impl. |
| atom/model_ops/attentions/aiter_mla.py | Decorates MLA backend/builder for plugin mode; builder init adjustments. |
| atom/model_ops/attentions/aiter_attention.py | Removes unused import. |
| atom/model_ops/attention_mla.py | Adds plugin-mode hooks/decorator and splits v_up and o_proj responsibilities. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
atom/model_ops/linear.py
Outdated
| # quant_func will call view, so we need to call contiguous to avoid view error | ||
| x, x_scale = quant_func( | ||
| x, | ||
| x.contiguous(), |
There was a problem hiding this comment.
This is required for the deepseek-r1 model, where x is a sliced tensor that cannot be viewed.
There was a problem hiding this comment.
we will do something else to avoid contiguous which will introduce mem copy here, all our quant should support non-contiguous tensor already... we hit any issue here?
There was a problem hiding this comment.
Yes, I meet this issue for the chunked prefill path. see https://github.com/XiaobingSuper/ATOM/blob/xiaobing/oot_kimi/atom/plugin/attention_mla.py#L345.
There was a problem hiding this comment.
then do contiguous at that place maybe, i don't like loss any perf
There was a problem hiding this comment.
Yes, I updated the code, and do contiguous at plugin side.
|
DeepSeek-R1-0528 with TP=8 has also been tested: |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ChuanLi1101
left a comment
There was a problem hiding this comment.
Left my comment FYI.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ChuanLi1101
left a comment
There was a problem hiding this comment.
LGTM, thanks for the quick turnaround.
f5260c0 to
e046d06
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
34a1e24 to
5c460b9
Compare
| # dummy run: skip real attention and return | ||
| output_shape = list(q.shape) | ||
| output_shape[-1] = 7168 | ||
| output_shape[-1] = self.num_heads * self.v_head_dim |
There was a problem hiding this comment.
self.num_heads * self.v_head_dim looks like not eaquals to 7168 for deepseek
There was a problem hiding this comment.
plugin path is also in our repo... then why we have to move o_proj out of attn
There was a problem hiding this comment.
This change is for the fallback path, i.e., for plugin mode, but use vllm attn backend, because we use vllm MLAAttention class(here is self.attn), the forward path doesn't has o_proj, see https://github.com/vllm-project/vllm/blob/v0.15.1/vllm/attention/layer.py#L640, this is only for attention compute.
There was a problem hiding this comment.
@valarLip The plugin mode using vllm atten backend will be like(setting ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=1): https://github.com/zejunchen-zejun/ATOM/blob/zejun/plugin_for_atom_1223/recipes/vLLM-ATOM-OOT-Plugin-Backend.md#launching-server-of-vllm-with-atom-oot-plugin-platform
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| **kwargs, | ||
| ) | ||
|
|
||
| impl_args["head_size" if self.use_mla else "head_dim"] = head_dim |
There was a problem hiding this comment.
this is also comes from vllm? i would like we always use head_dim
There was a problem hiding this comment.
Yes, vllm use head_size, see https://github.com/vllm-project/vllm/blob/v0.15.1/vllm/attention/layer.py#L579. Before this PR 6c40248, it also use head_size.
| self.layer_num = layer_num | ||
|
|
||
| def process_weights_after_loading(self): | ||
| def process_weights_after_loading(self, act_dtype: Optional[torch.dtype] = None): |
There was a problem hiding this comment.
For vLLM side, it's calling path like https://github.com/vllm-project/vllm/blob/1892993bc18e243e2c05841314c5e9c06a80c70d/vllm/attention/layer.py#L675, it needs such a arg.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -153,7 +204,8 @@ def __init__( | |||
| k_norm=k_norm, | |||
| **kwargs, | |||
| ) | |||
|
|
|||
| impl_args["head_size" if self.use_mla else "head_dim"] = head_dim | |||
| self.impl = impl_cls(**impl_args) | |||
There was a problem hiding this comment.
When use_mla is True, impl_cls is atom.model_ops.attention_mla.MLAAttention which now expects head_size (not head_dim). This code always includes head_dim in impl_args and then also adds head_size, so MLAAttention will receive an unexpected head_dim kwarg and raise at construction time. Build impl_args conditionally (only pass head_dim for MHA, and only pass head_size for MLA), or remove the unconditional head_dim entry before instantiating the MLA impl.


Motivation
Following #126, this PR makes ATOM mla attention work for the vLLM plugin model. Note: the sparse mla is not supported now and will be implemented in the next step.
Technical Details
The design tails can be seen in #126.
Test Plan
This PR does a test for Kimi-K2-Thinking-MXFP4 mode with TP4 on mi355:
Test Result
gsmk result"
Submission Checklist