Fix ultra-long context prefilling in Qwen3 MoE GGUF models by guoqingbao · Pull Request #1624 · EricLBuehler/mistral.rs

guoqingbao · 2025-08-09T04:07:49Z

This PR fixes a long-context prefill issue in GGUF models (including Qwen3-MoE) by updating the underlying Candle library.

A full explanation of why this change is necessary can be found here:

EricLBuehler/candle-vllm#256
EricLBuehler/candle-vllm#255

Changes for candle updates available here:

https://github.com/EricLBuehler/candle/pull/94/files

github-actions · 2025-08-09T04:08:55Z

Code Metrics Report

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           63           54            0            9
 CSS                     1          473          408           14           51
 Dockerfile              1           39           22            8            9
 HTML                    1           78           64            5            9
 JavaScript              7         1397         1068          180          149
 JSON                   22          410          407            0            3
 Makefile                1            6            5            0            1
 Python                102         5660         4631          298          731
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   23          877          809           11           57
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               74         6981            0         5227         1754
 |- BASH                19          299          260           24           15
 |- JSON                11          523          523            0            0
 |- Python              14          521          434           35           52
 |- Rust                32         1320         1108           36          176
 |- TOML                 2           75           63            0           12
 (Total)                           9719         2388         5322         2009
-------------------------------------------------------------------------------
 Rust                  422       156830       138527         3993        14310
 |- Markdown           200         4348          285         3498          565
 (Total)                         161178       138812         7491        14875
===============================================================================
 Total                 666       176621       146040        12169        18412
===============================================================================

EricLBuehler · 2025-08-09T15:31:50Z

@guoqingbao thanks, I just merged EricLBuehler/candle#94.

guoqingbao · 2025-08-09T15:50:08Z

@guoqingbao thanks, I just merged EricLBuehler/candle#94.

The bug in candle::quantize_q8_1 occurs only when the input token count exceeds 65,535 (the gridDim.y limit in CUDA kernel launches). For Qwen3-MoE, the limit is 65,535 / 8, since there are 8 experts per token.

guoqingbao · 2025-08-09T15:52:35Z

@guoqingbao thanks, I just merged EricLBuehler/candle#94.

This PR should address problems in long context (>65535 or >65535/topk) inference for all gguf models since they all quantize inputs into q8_1 via quantize_q8_1 function.

guoqingbao closed this Aug 9, 2025

guoqingbao force-pushed the master branch from dac4861 to 4c23d41 Compare August 9, 2025 15:40

guoqingbao reopened this Aug 9, 2025

guoqingbao closed this Oct 11, 2025

guoqingbao force-pushed the master branch from 6d1f5b1 to a132202 Compare October 11, 2025 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ultra-long context prefilling in Qwen3 MoE GGUF models#1624

Fix ultra-long context prefilling in Qwen3 MoE GGUF models#1624
guoqingbao wants to merge 0 commit intoEricLBuehler:masterfrom
guoqingbao:master

guoqingbao commented Aug 9, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 9, 2025 •

edited

Loading

Uh oh!

EricLBuehler commented Aug 9, 2025

Uh oh!

guoqingbao commented Aug 9, 2025

Uh oh!

guoqingbao commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guoqingbao commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricLBuehler commented Aug 9, 2025

Uh oh!

guoqingbao commented Aug 9, 2025

Uh oh!

guoqingbao commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guoqingbao commented Aug 9, 2025 •

edited

Loading

github-actions bot commented Aug 9, 2025 •

edited

Loading