skip target model mm emb computation when draft is text-only by kkt-cohere · Pull Request #1 · kkt-cohere/vllm

kkt-cohere · 2026-01-29T19:00:03Z

Purpose

This PR makes the mm embedding gather in draft proposal step conditional not only on the target model's mm capabilities, but also the draft model. We dont need mm embeddings when draft model doesn't support multimodal inputs. Specifically, when self.drafter.supports_mm_inputs is False, the mm_embed_inputs assigned here and passed here to the drafter is not used at all (i.e., this block is skipped).

Test Plan

We rely on existing CI tests. In addition, we ran the following offline spec dec MM bench to sanity check backward compatibility.

VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method eagle --model-dir meta-llama/Llama-4-Scout-17B-16E-Instruct --eagle-dir morgendave/EAGLE-Llama-4-Scout-17B-16E-Instruct --num_spec_tokens 3 --tp 4 --num-prompts 12 --custom-mm-prompts

We saw MM cache misses leading to model crashing in prod traffic without this change.

Test Result

After this change, prod traffic is stable. And nothing existing breaks.

Signed-off-by: kkt-cohere <komal@cohere.com>

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Signed-off-by: angelayi <yiangela7@gmail.com>

…t#33324) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…ct#33358) Signed-off-by: Chendi Xue <chendi.xue@intel.com>

…#32954) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

…m-project#33326) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>

)

…oject#33129) Signed-off-by: khluu <khluu000@gmail.com>

…3300) Signed-off-by: mgoin <mgoin64@gmail.com>

…-project#32849) Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…32696) Signed-off-by: WangHaoyuuu <mailwhaoyu@gmail.com>

…llm-project#33359) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…ect#33352) Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Signed-off-by: Ryan Rock <ryan.rock@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>

…oject#33239) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…t#33282) Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

…33372) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Signed-off-by: hujiaxin <524446785@qq.com> Signed-off-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com> Co-authored-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com>

Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com>

…roject#33396) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: carlory <baofa.fan@daocloud.io> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…oject#33187) Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>

…project#33406) Signed-off-by: juliendenize <julien.denize@mistral.ai>

…oject#33388) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…project#33323) Signed-off-by: carlory <baofa.fan@daocloud.io> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…#32286) Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…ng kv cache update to splitting ops (vllm-project#33441) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Richard Zou <zou3519@gmail.com>

…ct#33013) Signed-off-by: Hollow Man <hollowman@opensuse.org>

Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io>

…capacity (vllm-project#33110) Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io>

…t#33477) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…vllm-project#33473) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

) Signed-off-by: yang.xiao <yang.xiao@daocloud.io>

) Signed-off-by: linhaifeng <1371675203@qq.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

Signed-off-by: smashyalts <smashyalts@gmail.com>

Signed-off-by: esmeetu <jasonailu87@gmail.com>

…oE kernels (vllm-project#33417) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: greg pereira <grpereir@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…ct#33370) Signed-off-by: Eduardo Salinas <edus@microsoft.com>

…lm-project#33488) Signed-off-by: zack041 <zackyu041@gmail.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: YunzhuLu <lucia.yunzhu@gmail.com>

…uantFP8` class. (vllm-project#33047) Signed-off-by: maral <maralbahari.98@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

…ing allow inference Omni on ROCM (vllm-project#33077) Signed-off-by: JartX <sagformas@epdcenter.es>

kkt-cohere and others added 30 commits January 29, 2026 14:25

skip target model mm emb computation when draft is text-only

49c81f2

Signed-off-by: kkt-cohere <komal@cohere.com>

[Bugfix] Fix broken GLM-OCR initialization (vllm-project#33350)

5e73e49

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

[ez] Delete torch25_custom_graph_pass (vllm-project#33287)

5a66c9c

Signed-off-by: angelayi <yiangela7@gmail.com>

[Chore] Move MediaConnector to vllm.multimodal.media (vllm-projec…

831453f

…t#33324) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[BUGFIX][XPU] fix memory check after XPU reuse GPU_worker (vllm-proje…

8c8ebeb

…ct#33358) Signed-off-by: Chendi Xue <chendi.xue@intel.com>

[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe (vllm-project…

0493d89

…#32954) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

[Bugfix][Kernel] Fix negative memory offset in GDN Triton kernel (vll…

23591e6

…m-project#33326) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>

Add Triton fused MoE config for B200 (Nemotron Nano) (vllm-project#32804

8e2a469

)

[release] Minor fixes to release annotation and wheel upload (vllm-pr…

2284461

…oject#33129) Signed-off-by: khluu <khluu000@gmail.com>

[Bugfix] Enable Triton MoE for FP8 per-tensor dynamic (vllm-project#3…

bfb9bda

…3300) Signed-off-by: mgoin <mgoin64@gmail.com>

[Docs] Adding links and intro to Speculators and LLM Compressor (vllm…

1337657

…-project#32849) Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[Model][Multimodal] Add explicit MusicFlamingo adapter (vllm-project#…

c46b0cd

…32696) Signed-off-by: WangHaoyuuu <mailwhaoyu@gmail.com>

Fix tie_word_embeddings for multimodal models in Transformers v5 (v…

80b918f

…llm-project#33359) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[BugFix] Disable async scheduling for Mamba prefix caching (vllm-proj…

ec51831

…ect#33352) Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

[Models] Refactor Kimi-K2.5 weight loading (vllm-project#33346)

8bfc8d5

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

[CI][AMD] Skip 4 GPUs testgroup ray tests (vllm-project#33305)

070c811

Signed-off-by: Ryan Rock <ryan.rock@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>

Move decode context parallel validationn to ParallelConfig (vllm-pr…

d334dd2

…oject#33239) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[CI] Enable mypy import following for vllm/spec_decode (vllm-projec…

726d897

…t#33282) Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

Explicitly set return_dict for apply_chat_template (vllm-project#…

9432ed8

…33372) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[model] Add support for openPangu7B-VL (vllm-project#32449)

ba45bed

Signed-off-by: hujiaxin <524446785@qq.com> Signed-off-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com> Co-authored-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com>

fix: allow LFM2 MoE prefix caching (align) (vllm-project#33376)

f45870b

Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com>

[Refactor] Move MM item count validation outside of processor (vllm-p…

c87eac1

…roject#33396) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Misc] Replace Optional[X] with X | None syntax (vllm-project#33332)

1a7894d

Signed-off-by: carlory <baofa.fan@daocloud.io> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

[Realtime API] Adds minimal realtime API based on websockets (vllm-pr…

10152d2

…oject#33187) Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>

[BUGFIX] Pixtral cannot be loaded with --limit-mm-per-prompt 0 (vllm-…

8e2ad97

…project#33406) Signed-off-by: juliendenize <julien.denize@mistral.ai>

[Doc] [ROCm] Update Documentation to reflect v0.15.0 release (vllm-pr…

174f167

…oject#33388) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Remove deprecated reasoning_content message field (vllm-project#33402)

c5113f6

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[Misc] Clean up HIDDEN_DEPRECATED_METRICS after metric removal (vllm-…

cf896ae

…project#33323) Signed-off-by: carlory <baofa.fan@daocloud.io> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

[Doc] Enhance documentation around CPU container images (vllm-project…

58cb55e

…#32286) Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

Fix test_moe.py for Transformers v5 (vllm-project#33413)

a11bc12

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

cmunley1 and others added 30 commits January 31, 2026 06:04

support return prompt token ids in responses (vllm-project#33378)

6f5e7cd

[Doc] Update plugin deprecation notices (vllm-project#33476)

793af53

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[fix][torch.compile] Fix cold-start compilation time increase by addi…

15f40b2

…ng kv cache update to splitting ops (vllm-project#33441) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Richard Zou <zou3519@gmail.com>

[BugFix][Router Replay] Capture Logical Experts with EPLB (vllm-proje…

13b842f

…ct#33013) Signed-off-by: Hollow Man <hollowman@opensuse.org>

Support clear mm and encoder cache (vllm-project#33452)

22d9a05

Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io>

[Bugfix] Early-reject requests with MM data longer than encode cache …

27cb2f6

…capacity (vllm-project#33110) Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io>

[Deprecation] Remove deprecated items related to pooling (vllm-projec…

92924b2

…t#33477) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Refactor] Move MM data parsing outside processor (vllm-project#33408)

88c3e11

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Update huggingface-hub pin for the last time before Transformers v5 (…

ce0afe2

…vllm-project#33473) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Merge branch 'main' into spec-mm-1

5aae98e

[Misc] support collect_env for endpoint /server_info (vllm-project#33246

2238a12

) Signed-off-by: yang.xiao <yang.xiao@daocloud.io>

[Bugfix]: Fix display errors in TORCH_CHECK messages (vllm-project#32942

fedf643

) Signed-off-by: linhaifeng <1371675203@qq.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

Fix grammar (vllm-project#33121)

1e86c80

Signed-off-by: smashyalts <smashyalts@gmail.com>

Merge branch 'main' into spec-mm-1

2d2e7b0

[Misc] Fix flashinfer related tests (vllm-project#33462)

63c0889

Signed-off-by: esmeetu <jasonailu87@gmail.com>

fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 M…

0797811

…oE kernels (vllm-project#33417) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>

[Refactor] Make Renderer an abstract class (vllm-project#33479)

a358e4d

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[ROCm][CI] Update huggingface-hub pin (vllm-project#33492)

0fb3157

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

pin LMCache to v0.3.9 or greater with vLLM v0.15.0 (vllm-project#33440)

d6416fd

Signed-off-by: greg pereira <grpereir@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

[Bugfix] Fix inconsistent handling of cache reset (vllm-project#33481)

79b6ec6

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Critical] Revert vllm-project#33110 (vllm-project#33500)

b6bb284

[Models]: lfm2_siglip2 return intermediate encoder layers (vllm-proje…

302ecf6

…ct#33370) Signed-off-by: Eduardo Salinas <edus@microsoft.com>

Merge branch 'main' into spec-mm-1

72713ca

fix: only include Authorization header when OPENAI_API_KEY is set (vl…

754a8ca

…lm-project#33488) Signed-off-by: zack041 <zackyu041@gmail.com>

Change defaults for vllm bench startup (vllm-project#33489)

6720238

[Redo] vllm-project#33110 with threading limit (vllm-project#33502)

21997f4

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: YunzhuLu <lucia.yunzhu@gmail.com>

[W8A8 Block Linear Refactor][1/N] Keep all quantization types into `Q…

b5f8c30

…uantFP8` class. (vllm-project#33047) Signed-off-by: maral <maralbahari.98@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

[BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profil…

cd86fff

…ing allow inference Omni on ROCM (vllm-project#33077) Signed-off-by: JartX <sagformas@epdcenter.es>

Add MoE config for Super B200 TP2 (vllm-project#33510)

8869cd8

Merge branch 'main' into spec-mm-1

9c3e15b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip target model mm emb computation when draft is text-only#1

skip target model mm emb computation when draft is text-only#1
kkt-cohere wants to merge 102 commits intomainfrom
spec-mm-1

kkt-cohere commented Jan 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

kkt-cohere commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

kkt-cohere commented Jan 29, 2026 •

edited

Loading