Skip to content

feat: support mla model in vllm#261

Open
ztang2370 wants to merge 3 commits intoovg-project:mainfrom
ztang2370:feat/support-mla-model-in-vllm
Open

feat: support mla model in vllm#261
ztang2370 wants to merge 3 commits intoovg-project:mainfrom
ztang2370:feat/support-mla-model-in-vllm

Conversation

@ztang2370
Copy link
Contributor

@ztang2370 ztang2370 commented Feb 28, 2026

Tested MLA model deepseek-ai/DeepSeek-V2-Lite on vllm 0.14.0 - 0.16.0.

@cui36
Copy link
Collaborator

cui36 commented Feb 28, 2026

Thanks @ztang2370!

@ivanium
Copy link
Collaborator

ivanium commented Mar 8, 2026

Nice contribution! At a high level, I feel alloc_mla_kv_caches should be merged into alloc_kv_caches since their logic are largely similar and we have already had a attention_type field for this. wdyt?

@ztang2370
Copy link
Contributor Author

Nice contribution! At a high level, I feel alloc_mla_kv_caches should be merged into alloc_kv_caches since their logic are largely similar and we have already had a attention_type field for this. wdyt?

Agreed. Updated with alloc_mla_kv_cache merged into alloc_kv_cache and tested. I think the same should be applied to the sglang side cc @cui36 . I'll do the sglang part later in another PR.

@ztang2370
Copy link
Contributor Author

Hi @cui36 , I encountered problem setting up sglang with kvcached from source on the DGX Spark device, and therefore does not have device to test the refactor of sglang part for now (the smallest MLA model DeepSeek-V2-Lite takes more than 32GB). I'll leave it to you if you don't mind.
On DGX Spark now we can set up sglang only with Docker.

@jiarong0907
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for MLA models in vllm. The changes are mostly in kvcached/integration/vllm/interfaces.py and kvcached/integration/vllm/patches.py to handle the specifics of MLA, such as the combined KV buffer. The implementation is sound, but I've identified a couple of areas with code duplication that could be refactored to improve maintainability. My review includes suggestions for these refactorings.

@ztang2370 ztang2370 force-pushed the feat/support-mla-model-in-vllm branch from 3554ffc to 5e2bd4c Compare March 17, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants