Skip to content

Fix vLLM inference test failures for reasoning models#140

Open
robballantyne wants to merge 1 commit intomainfrom
vllm-tests
Open

Fix vLLM inference test failures for reasoning models#140
robballantyne wants to merge 1 commit intomainfrom
vllm-tests

Conversation

@robballantyne
Copy link
Copy Markdown
Collaborator

Reasoning models (DeepSeek-R1, Qwen3, etc.) often consume all tokens on internal chain-of-thought, leaving content empty. This caused false test failures even though the model was serving correctly.

  • Handle finish_reason=length as a warning instead of failure
  • Check thinking_content field in addition to reasoning_content
  • Only fail inference when ALL prompts produce no output at all

Reasoning models (DeepSeek-R1, Qwen3, etc.) often consume all tokens on
internal chain-of-thought, leaving content empty. This caused false test
failures even though the model was serving correctly.

- Handle finish_reason=length as a warning instead of failure
- Check thinking_content field in addition to reasoning_content
- Only fail inference when ALL prompts produce no output at all

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the vLLM serving pipeline inference test to avoid false failures for “reasoning” models that may return empty content while consuming tokens for internal reasoning.

Changes:

  • Treat finish_reason=length with empty output as a warning (not a hard failure).
  • Consider thinking_content as an alternative to reasoning_content when determining whether the model produced output.
  • Only fail the inference check when no prompts produce any output signal (no passes and no warnings).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 376 to 377
[[ "$is_reasoning" == "true" ]] && echo " (reasoning_content)"
echo " response: ${display}"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log label ("reasoning_content") can be inaccurate now that thinking_content is also treated as reasoning. When thinking_content is present (and reasoning_content is not), is_reasoning becomes true but the output still claims reasoning_content, which can mislead debugging. Consider tracking which field was used (reasoning_content vs thinking_content) and printing the correct label (or a generic reasoning/thinking_content).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants