Fix vLLM inference test failures for reasoning models by robballantyne · Pull Request #140 · vast-ai/base-image

robballantyne · 2026-03-27T14:34:41Z

Reasoning models (DeepSeek-R1, Qwen3, etc.) often consume all tokens on internal chain-of-thought, leaving content empty. This caused false test failures even though the model was serving correctly.

Handle finish_reason=length as a warning instead of failure
Check thinking_content field in addition to reasoning_content
Only fail inference when ALL prompts produce no output at all

Reasoning models (DeepSeek-R1, Qwen3, etc.) often consume all tokens on internal chain-of-thought, leaving content empty. This caused false test failures even though the model was serving correctly. - Handle finish_reason=length as a warning instead of failure - Check thinking_content field in addition to reasoning_content - Only fail inference when ALL prompts produce no output at all Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Updates the vLLM serving pipeline inference test to avoid false failures for “reasoning” models that may return empty content while consuming tokens for internal reasoning.

Changes:

Treat finish_reason=length with empty output as a warning (not a hard failure).
Consider thinking_content as an alternative to reasoning_content when determining whether the model produced output.
Only fail the inference check when no prompts produce any output signal (no passes and no warnings).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T14:37:54Z

external/vllm/ROOT/opt/instance-tools/tests/vllm.d/10-vllm-serving.sh

            [[ "$is_reasoning" == "true" ]] && echo "    (reasoning_content)"
            echo "    response: ${display}"


The log label ("reasoning_content") can be inaccurate now that thinking_content is also treated as reasoning. When thinking_content is present (and reasoning_content is not), is_reasoning becomes true but the output still claims reasoning_content, which can mislead debugging. Consider tracking which field was used (reasoning_content vs thinking_content) and printing the correct label (or a generic reasoning/thinking_content).

robballantyne requested a review from Copilot March 27, 2026 14:35

Copilot started reviewing on behalf of robballantyne March 27, 2026 14:35 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix vLLM inference test failures for reasoning models#140

Fix vLLM inference test failures for reasoning models#140
robballantyne wants to merge 1 commit intomainfrom
vllm-tests

robballantyne commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		[[ "$is_reasoning" == "true" ]] && echo " (reasoning_content)"
		echo " response: ${display}"

Conversation

robballantyne commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants