[Bug] Fix TRTLLM Block FP8 MoE Monolithic by wzhao18 · Pull Request #36296 · vllm-project/vllm

wzhao18 · 2026-03-07T00:40:23Z

Purpose

Test Plan

vllm serve deepseek-ai/DeepSeek-R1 -tp 8

lm_eval --model local-completions --model_args "base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.953|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.953|±  |0.0058|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

gemini-code-assist

Code Review

This PR fixes a dtype mismatch bug for e_score_correction_bias in the TRT-LLM FP8 MoE monolithic path. The change in _apply_per_block is correct. I've identified a similar issue in _apply_per_tensor that is not covered by this PR and have left a comment with a suggested fix. Addressing this will ensure consistency and prevent potential runtime errors.

vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py

ProExpertProg · 2026-03-10T15:44:04Z

Can you try vllm serve deepseek-ai/DeepSeek-V3 -tp=8 as well?

wzhao18 · 2026-03-10T16:09:47Z

@ProExpertProg I am experiencing OOM issue on top of tree. I can roll back to an older commit and test but it would be best if we could test on top-of-tree.

ProExpertProg · 2026-03-10T16:20:40Z

Tested your commit on top of my branch in #36551, that's good enough for me:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.97	±	0.0171
		strict-match	5	exact_match	↑	0.97	±	0.0171

robertgshaw2-redhat · 2026-03-10T16:22:36Z

we need to backport this to 0.17.1

pavanimajety

LGTM, can we add a note in the PR or comment in the code saying what determines the e_score_correction_bias datatype? It's typically based on the checkpoint and kernel checkpoint, so hardcoding with hidden_states or router_logits, could work for a few models while not work for the others.
For example deepseek uses -

model.layers.3.mlp.gate.e_score_correction_bias | [256] | F32

wzhao18 · 2026-03-10T16:42:51Z

Talked to Pavani offline. I think we could merge this patch for now. but it may be worth checking whether we should instead make change to the kernel side to support the built-in dtype for e_score_correction_bias from the model checkpoint.

ProExpertProg · 2026-03-10T18:43:19Z

This PR on B200:

local-completions (pretrained=deepseek-ai/DeepSeek-V3,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,max_retries=3), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.95	±	0.006
		strict-match	5	exact_match	↑	0.95	±	0.006

robertgshaw2-redhat · 2026-03-10T18:54:02Z

do we know which PR introduced this issue?

the monolithic refactor did not change the behavior here AFAICT

wzhao18@97995f6#diff-0bca428f0755cb2a74957adf7250557ef5069290db40e919744308bd36e50001L220-L221

robertgshaw2-redhat · 2026-03-10T18:58:21Z

do we know which PR introduced this issue?

the monolithic refactor did not change the behavior here AFAICT

wzhao18@97995f6#diff-0bca428f0755cb2a74957adf7250557ef5069290db40e919744308bd36e50001L220-L221

ah I see, its because now the hidden states are quantized. I dont understand why this is not caught by the tests, we run this for Qwen3 FP8 Block on B200

robertgshaw2-redhat · 2026-03-10T18:58:59Z

oh, its because it does not have e_score_correction_bias

robertgshaw2-redhat · 2026-03-10T18:59:39Z

vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py

            router_logits = router_logits.to(torch.float32)

+        if e_score_correction_bias is not None:
+            e_score_correction_bias = e_score_correction_bias.to(router_logits.dtype)


this works, but is a hack. Do we know what dtype is required by the kernel here? Is it just bf16?

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 · 2026-03-10T21:01:12Z

The reason for this regression per @robertgshaw2-redhat 's investigation is because the MoE refactor changes where the input quantization happens:

# prior
bias = bias.to(x.dtype)
x_q, x_scale = quantize(x)
kernel()

# after:
x << input to function is quantized
bias = bias.to(x.dtype)  # oops, cast to fp8
kernel()

Based on discussion, we remove the casting for the routing bias.

robertgshaw2-redhat · 2026-03-10T21:19:50Z

we discovered this cast is not needed due to trtllm kernel supporting fp32 inputs now. We originally added this cast here - ef28354

but this is not needed anymore

ProExpertProg · 2026-03-10T21:32:04Z

local-completions (pretrained=deepseek-ai/DeepSeek-V3,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,max_retries=3), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9462	±	0.0062
		strict-match	5	exact_match	↑	0.9462	±	0.0062

robertgshaw2-redhat · 2026-03-10T22:13:27Z

LGTM, thanks for the fix

robertgshaw2-redhat · 2026-03-10T23:51:52Z

verified same accuracy with FP32 and Bf16

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> (cherry picked from commit 84e436e)

Fix trtllm moe monolithic

b3f2504

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 requested review from mgoin and pavanimajety as code owners March 7, 2026 00:40

mergify bot added nvidia bug Something isn't working labels Mar 7, 2026

github-project-automation bot added this to NVIDIA Mar 7, 2026

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py Outdated Show resolved Hide resolved

wzhao18 mentioned this pull request Mar 10, 2026

[Bug]: Deepseek-v3 fails on 8xB200 in v0.17.0 (including eager) #36662

Closed

1 task

ProExpertProg linked an issue Mar 10, 2026 that may be closed by this pull request

[Bug]: Deepseek-v3 fails on 8xB200 in v0.17.0 (including eager) #36662

Closed

1 task

bnellnm approved these changes Mar 10, 2026

View reviewed changes

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2026

ProExpertProg mentioned this pull request Mar 10, 2026

[torch.compile] Add support for non-contiguous fused RMSNorm + group quant #36551

Open

pavanimajety approved these changes Mar 10, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 10, 2026

robertgshaw2-redhat changed the title ~~[Bug] Fix TRTLLM FP8 MoE Monolithic~~ [Bug] Fix TRTLLM Block FP8 MoE Monolithic Mar 10, 2026

ProExpertProg enabled auto-merge (squash) March 10, 2026 17:58

robertgshaw2-redhat reviewed Mar 10, 2026

View reviewed changes

Remove routing bias cast for fp8 trtllm

8b6124e

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

auto-merge was automatically disabled March 10, 2026 20:58
Head branch was pushed to by a user without write access

wzhao18 requested review from bnellnm and pavanimajety March 10, 2026 20:58

ProExpertProg approved these changes Mar 10, 2026

View reviewed changes

Merge branch 'main' into wzhao/fix-fs-r1-trtllm-moe

d10be68

robertgshaw2-redhat approved these changes Mar 10, 2026

View reviewed changes

robertgshaw2-redhat merged commit 84e436e into vllm-project:main Mar 11, 2026
54 of 60 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 11, 2026

Uh oh!

Conversation

wzhao18 commented Mar 7, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ProExpertProg commented Mar 10, 2026

Uh oh!

wzhao18 commented Mar 10, 2026

Uh oh!

ProExpertProg commented Mar 10, 2026

Uh oh!

robertgshaw2-redhat commented Mar 10, 2026

Uh oh!

pavanimajety left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhao18 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProExpertProg commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Mar 10, 2026

Uh oh!

robertgshaw2-redhat commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Mar 10, 2026

Uh oh!

robertgshaw2-redhat Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 commented Mar 10, 2026

Uh oh!

robertgshaw2-redhat commented Mar 10, 2026

Uh oh!

ProExpertProg commented Mar 10, 2026

Uh oh!

robertgshaw2-redhat commented Mar 10, 2026

Uh oh!

robertgshaw2-redhat commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wzhao18 commented Mar 7, 2026 •

edited by github-actions bot

Loading

pavanimajety left a comment •

edited

Loading

wzhao18 commented Mar 10, 2026 •

edited

Loading

ProExpertProg commented Mar 10, 2026 •

edited

Loading

robertgshaw2-redhat commented Mar 10, 2026 •

edited

Loading