Fix vLLM inference test failures for reasoning models#140
Fix vLLM inference test failures for reasoning models#140robballantyne wants to merge 1 commit intomainfrom
Conversation
Reasoning models (DeepSeek-R1, Qwen3, etc.) often consume all tokens on internal chain-of-thought, leaving content empty. This caused false test failures even though the model was serving correctly. - Handle finish_reason=length as a warning instead of failure - Check thinking_content field in addition to reasoning_content - Only fail inference when ALL prompts produce no output at all Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Updates the vLLM serving pipeline inference test to avoid false failures for “reasoning” models that may return empty content while consuming tokens for internal reasoning.
Changes:
- Treat
finish_reason=lengthwith empty output as a warning (not a hard failure). - Consider
thinking_contentas an alternative toreasoning_contentwhen determining whether the model produced output. - Only fail the inference check when no prompts produce any output signal (no passes and no warnings).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [[ "$is_reasoning" == "true" ]] && echo " (reasoning_content)" | ||
| echo " response: ${display}" |
There was a problem hiding this comment.
The log label ("reasoning_content") can be inaccurate now that thinking_content is also treated as reasoning. When thinking_content is present (and reasoning_content is not), is_reasoning becomes true but the output still claims reasoning_content, which can mislead debugging. Consider tracking which field was used (reasoning_content vs thinking_content) and printing the correct label (or a generic reasoning/thinking_content).
Reasoning models (DeepSeek-R1, Qwen3, etc.) often consume all tokens on internal chain-of-thought, leaving content empty. This caused false test failures even though the model was serving correctly.