Skip to content

Fix ultra-long context prefilling in Qwen3 MoE GGUF models#1624

Closed
guoqingbao wants to merge 0 commit intoEricLBuehler:masterfrom
guoqingbao:master
Closed

Fix ultra-long context prefilling in Qwen3 MoE GGUF models#1624
guoqingbao wants to merge 0 commit intoEricLBuehler:masterfrom
guoqingbao:master

Conversation

@guoqingbao
Copy link
Contributor

@guoqingbao guoqingbao commented Aug 9, 2025

This PR fixes a long-context prefill issue in GGUF models (including Qwen3-MoE) by updating the underlying Candle library.

A full explanation of why this change is necessary can be found here:

EricLBuehler/candle-vllm#256
EricLBuehler/candle-vllm#255

Changes for candle updates available here:

https://github.com/EricLBuehler/candle/pull/94/files

@github-actions
Copy link

github-actions bot commented Aug 9, 2025

Code Metrics Report
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           63           54            0            9
 CSS                     1          473          408           14           51
 Dockerfile              1           39           22            8            9
 HTML                    1           78           64            5            9
 JavaScript              7         1397         1068          180          149
 JSON                   22          410          407            0            3
 Makefile                1            6            5            0            1
 Python                102         5660         4631          298          731
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   23          877          809           11           57
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               74         6981            0         5227         1754
 |- BASH                19          299          260           24           15
 |- JSON                11          523          523            0            0
 |- Python              14          521          434           35           52
 |- Rust                32         1320         1108           36          176
 |- TOML                 2           75           63            0           12
 (Total)                           9719         2388         5322         2009
-------------------------------------------------------------------------------
 Rust                  422       156830       138527         3993        14310
 |- Markdown           200         4348          285         3498          565
 (Total)                         161178       138812         7491        14875
===============================================================================
 Total                 666       176621       146040        12169        18412
===============================================================================

@EricLBuehler
Copy link
Owner

@guoqingbao thanks, I just merged EricLBuehler/candle#94.

@guoqingbao
Copy link
Contributor Author

@guoqingbao thanks, I just merged EricLBuehler/candle#94.

The bug in candle::quantize_q8_1 occurs only when the input token count exceeds 65,535 (the gridDim.y limit in CUDA kernel launches). For Qwen3-MoE, the limit is 65,535 / 8, since there are 8 experts per token.

@guoqingbao
Copy link
Contributor Author

@guoqingbao thanks, I just merged EricLBuehler/candle#94.

This PR should address problems in long context (>65535 or >65535/topk) inference for all gguf models since they all quantize inputs into q8_1 via quantize_q8_1 function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants