Skip to content

feat: reasoning output responses api#5206

Open
robinnarsinghranabhat wants to merge 18 commits intollamastack:mainfrom
robinnarsinghranabhat:feat/reasoning-output-responses-api
Open

feat: reasoning output responses api#5206
robinnarsinghranabhat wants to merge 18 commits intollamastack:mainfrom
robinnarsinghranabhat:feat/reasoning-output-responses-api

Conversation

@robinnarsinghranabhat
Copy link
Copy Markdown
Contributor

@robinnarsinghranabhat robinnarsinghranabhat commented Mar 19, 2026

What does this PR do?

Closed and Re-Opened 5087

Adds end-to-end reasoning output support for the llamastack's Responses API endpoint, enabling reasoning models (e.g. gpt-oss via Ollama/vLLM) to propagate their chain-of-thought reasoning through the LlamaStack Responses API pipeline.

Test Plan

1. BFCL Evals

1.1 GPT-OSS-120B :

vllm-chat-completions (1), llamastack-chat-completions (2.1) and llamastack-responses (3.1) are now equivalent.

Rank Config Overall Base Miss Func Miss Param Long Context
1 vLLM 0.17 Non-Streaming Chat Completions 49.75% 61.50% 54.00% 48.00% 35.50%
1.1 Same as Above ( VLLM 0.18 March 27) 52.12% 65.00% 53.00% 52.00% 38.50%
2 Llama Stack Chat Completions Before CC reasoning merged 46.50% 59.00% 48.00% 46.50% 32.50%
2.1 Updated LS-CC with reasoning support (merged) 50.36% 61.50% 53.50% 50.00% 36.50%
3 Llama Stack Responses with vllm 0.17 47.75% 62.00% 51.00% 46.50% 31.50%
3.1 Same as Above ( VLLM 0.18 March 27 Pull) 45.63% 54.50% 47.50% 45.50% 35.00%
3.2 Llama Stack Responses with Reasoning Propagation (THIS PR) 50.63% 63.00% 53.00% 51.00% 35.50%
3.3 PR Refactor run on March 27 run on vllm 0.18 50.37% 62.00% 51.00% 50.00% 38.50%
1.2 GPT-OSS-20B

Similarly we see that Row 1.2 and Row 2.2 are equivalent, meaning llamastack-responses itself brings no regression.

Rank Config Overall Base Miss Func Miss Param Long Context
1 vLLM CC 28.00% 33.50% 33.00% 28.00% 17.50%
1.1 vLLM CC with invalid tool-name handled client-side 41.25% 54.00% 40.50% 40% 30.50%
1.2 vLLM CC-Streaming with invalid tool-name handled client-side 42.5% 56.00% 39.0% 43.00% 32.5%
2 LS Responses 35.75% 41.00% 36.50% 43.00% 22.50%
2.1 LS Responses with Reasoning Propagation (THIS PR) 29.50% 38.00% 26.50% 33.50% 20.00%
2.2 Same as 2.1, and vllm's invalid tools handled client-side (THIS PR) 43.75% 57.50% 41.50% 43.50% 32.50

IMP Assuming vllm's corrupted tool output issue is dealt with, this PR improves gpt-oss-20b as shown in Row 2.2. The problem is minimal in "vllm cc streaming" path when reasoning is skipped. Hence, current ls-responses looks better.

More Details of above table

2. Manual verification with Ollama and LlamaStack->Ollama on gpt-oss:20b

How:

  1. In a multi-call scenario with client-side tool call orchestration, verified that final response.output has reasoning objects in the right order (source of truth being what they look like in ollama and openai providers directly).
  2. Verified that, when these response outputs are propagated back with a new user message or tool output for the next conversation turn (next Responses API invocation), the internal conversion to chat-completions message array is correct. This is also reflected in the added tests.
  3. Verified 1 and 2 when using llamastack+ollama with a MCP server, where tool orchestration happens on the LlamaStack server side.
## Summary script for client-side tool orchestration verification ##
from openai import OpenAI
import json

client = OpenAI(base_url="http://127.0.0.1:8321/v1", api_key="fake_api_key")

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather for a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location", "unit"],
        },
    },
]

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the weather like in Tokyo?"},
]

# Turn 1: mostly likely returns [ReasoningItem, FunctionToolCall]
# Sometimes it might not because model itself decides to not use reasoning
resp = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=messages,
    tools=tools,
    tool_choice="auto",
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)

print("Turn 1 output types:", [item.type for item in resp.output])
# Expected: ['reasoning', 'function_call'] or ['function_call']

# Turn 2: execute tool, pass result back
new_input = list(messages) + list(resp.output)
for item in resp.output:
    if item.type == "function_call":
        new_input.append({
            "type": "function_call_output",
            "call_id": item.call_id,
            "output": json.dumps({"temperature": 27, "condition": "humid", "unit": "celsius", "location": "Tokyo"}),
        })

resp2 = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=new_input,
    tools=tools,
    tool_choice="auto",
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)

print("Turn 2 output types:", [item.type for item in resp2.output])
# Expected: ['reasoning', 'message'] or ['message']

3. Server-side MCP tool orchestration (OpenAI vs LlamaStack comparison)

Compared end-to-end outputs produced by OpenAI vs LlamaStack->OpenAI on gpt-5-mini.
Manually compared the response output structure when using function calls and MCP tool calls in a multi-turn scenario.

## Summary script for MCP server-side tool orchestration verification ##
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8321/v1", api_key="fake_api_key")

mcp_tool = {
    "type": "mcp",
    "server_label": "fetch",
    "server_url": "http://127.0.0.1:8080/sse",
    "require_approval": "never",
}

# Turn 1: greeting
resp = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=[{"role": "user", "content": "Hello! What tools do you have available?"}],
    tools=[mcp_tool],
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)
print("T1 output types:", [item.type for item in resp.output])

# Turn 2: trigger MCP tool calls
input_t2 = [{"role": "user", "content": "Hello!"}] + list(resp.output)
input_t2.append({"role": "user", "content": "Can you fetch https://pypi.org/project/tiktoken/ and tell me the latest version?"})

resp2 = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=input_t2,
    tools=[mcp_tool],
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)
print("T2 output types:", [item.type for item in resp2.output])

Output:

T1 output types: ['mcp_list_tools', 'reasoning', 'message']
T2 output types: ['mcp_list_tools', 'reasoning', 'mcp_call', 'mcp_call', 'reasoning', 'message']

Output structure comparison -- OpenAI vs LlamaStack on gpt-5-mini:

Turn OpenAI Direct LlamaStack + OpenAI Status
T1 (greeting) [McpListTools, ReasoningItem, OutputMessage] [McpListTools, ReasoningItem, OutputMessage] Match
T2 (fetch URL) [ReasoningItem, McpCall, McpCall, ReasoningItem, OutputMessage] [McpListTools, ReasoningItem, McpCall, McpCall, ReasoningItem, OutputMessage] Minor mismatch*

* Minor pre-existing difference: LlamaStack re-emits McpListTools on every request; OpenAI only emits it on T1. This is existing MCP behavior, not related to reasoning changes.

BFCL Evals Setup

Used this guide for setting up evaluation.
then tested with vllm v0.17 and ollama 0.6.1.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 19, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 19, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @robinnarsinghranabhat please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 19, 2026

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat: reasoning output responses api

Edit this comment to update it. It will appear in the SDK's changelogs.

llama-stack-client-node studio · conflict

Your SDK build had at least one new note diagnostic, which is a regression from the base state.

New diagnostics (3 note)
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
llama-stack-client-go studio · conflict

Your SDK build had at least one new note diagnostic, which is a regression from the base state.

New diagnostics (24 note)
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).
💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).
💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).
💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).
💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).
💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).
💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).
llama-stack-client-python studio · code · diff

Your SDK build had at least one "warning" diagnostic, but this did not represent a regression.
generate ⚠️build ✅lint ✅test ✅

pip install https://pkg.stainless.com/s/llama-stack-client-python/5208485475b01560bd12206361b630543f3db4d8/llama_stack_client-0.6.1a1-py3-none-any.whl
New diagnostics (3 note)
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
llama-stack-client-openapi studio · code · diff

Your SDK build had at least one "warning" diagnostic, but this did not represent a regression.
generate ⚠️

New diagnostics (3 note)
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.
💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-03-28 07:00:00 UTC

@robinnarsinghranabhat robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 776cb1f to 8ad6647 Compare March 19, 2026 03:20
@mergify mergify bot removed the needs-rebase label Mar 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 19, 2026

Recordings committed successfully

Recordings from the integration tests have been committed to this PR.

View commit workflow

@robinnarsinghranabhat robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 8ad6647 to a3337d1 Compare March 19, 2026 03:30
Copy link
Copy Markdown
Contributor

@jwm4 jwm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through all the code changes and they all look good to me. We still needed an actual maintainer review, of course.

Copy link
Copy Markdown
Collaborator

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the Ollama-reasoning suite to add integration tests for this see 87dc40b#diff-40d732a0defb244aec12e21fbd9cd387cbf212732f269549475db8de3877480c for more details on how I added some for the inference API previously.

@robinnarsinghranabhat
Copy link
Copy Markdown
Contributor Author

please use the Ollama-reasoning suite to add integration tests for this see 87dc40b#diff-40d732a0defb244aec12e21fbd9cd387cbf212732f269549475db8de3877480c for more details on how I added some for the inference API previously.

I added tests to the ollama-reasoning suite with llamastack client, as you had done.

As I had only tested with openai client, It looks like llamastack client deserializes response.output incorrectly. They come back as OutputOpenAIResponseMessageOutput instead of ResponseReasoningItem which breaks assertions. Then I Switched to use openai_client in tests. Still it's failing. Is it because cached outputs are being used in the tests from the earlier llamastack client ?

Sorry I don't understand llamastack CI better.

Copy link
Copy Markdown
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robinnarsinghranabhat please directly include details of how you ran bfcl

Copy link
Copy Markdown
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given reasoning is not part of the chat completions api, but reasoning is part of the responses api, and we implement our responses api atop our chat completions api -

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

  1. pick a name: magic_toc_tokens
  2. require chat providers to populate magic_toc_tokens when appropriate
  3. detect the magic_toc_tokens field in the responses impl and convert it to Reasoning output
  4. ensure we do not leak magic_toc_tokens to users

(1) is going to become an implementation gap between providers, e.g. how do you get cot tokens from the openai provider's chat api? we'll probably have to move to responses.
(1) is going to be hard work on provider adapters, e.g. vllm configured w/o a reasoning parser will return model specific cot tokens in the response or different versions of vllm will put the reasoning content in different response fields

this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.

as written, this pr puts us on an unmaintainable path.

some other ideas -

  • add stack_chat_completions_with_reasoning to the Inference contract, for internal use only by the responses implementation
  • add responses to the Inference contract for providers who can implement it. care will be needed here to ensure the provider responses loop does not execute any tools and no credentials are passed along.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdoern please confirm that this throws an error for novel outputs. for instance, error raised if the spec say fields x, y, z are to be returned and we return x, y, z & p?

class OpenAIChatCompletionResponseMessage(BaseModel):
"""An assistant message returned in a chat completion response."""

model_config = ConfigDict(extra="allow")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here

To build the next_turn_messages for next round of pinging chat-completions.

@robinnarsinghranabhat
Copy link
Copy Markdown
Contributor Author

robinnarsinghranabhat commented Mar 21, 2026

@mattf Really appreciate this thorough review !

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

I agree with this is as a long term plan. As long as we stick to chat-completions, i see a need to standardize message conversion as well between responses and chat-completions. And llamastack responses would keep staying inferior to directly using openai's responses.


this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.
as written, this pr puts us on an unmaintainable path.

But I notice that current responses adapter already expects a field called reasoning on chat completion streaming chunks, and accumulates it.

I inferred this as an contract where provider specific streaming-cc implementation is responsible to populate a field named reasoning field in the chunk.


@robinnarsinghranabhat please directly include details of how you ran bfcl

Updated the description.

@mattf
Copy link
Copy Markdown
Collaborator

mattf commented Mar 21, 2026

@mattf Really appreciate this thorough review !

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

I agree with this is as a long term plan. As long as we stick to chat-completions, i see a need to standardize message conversion as well between responses and chat-completions. And llamastack responses would keep staying inferior to directly using openai's responses.

this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.
as written, this pr puts us on an unmaintainable path.

But I notice that current responses adapter already expects a field called reasoning on chat completion streaming chunks, and accumulates it.

I inferred this as an contract where provider specific streaming-cc implementation is responsible to populate a field named reasoning field in the chunk.

good catch. i'd call that an oops that needs to be resolved. as implemented it means users will silently get different levels of service.

@robinnarsinghranabhat please directly include details of how you ran bfcl

Updated the description.

@robinnarsinghranabhat
Copy link
Copy Markdown
Contributor Author

@mattf Maybe it was a mistake, but isn't current implementation implicitly doing what you suggested, with a name of choice being reasoning (although not documented)

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

  1. pick a name: magic_toc_tokens
  2. require chat providers to populate magic_toc_tokens when appropriate
  3. detect the magic_toc_tokens field in the responses impl and convert it to Reasoning output
  4. ensure we do not leak magic_toc_tokens to users

I am not sure if this PR should be closed then. Any ideas on where we are with prioritization on defining a internal standard to enable llama-stack responses to support reasoning then. Given some guidance (newbie), I am happy to take it :)

Copy link
Copy Markdown
Collaborator

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robinnarsinghranabhat , take a look at Stainless SDK Builds / run-integration-tests / Integration Tests and generally test labeled Stainless SDK Builds / run-integration-tests / Integration Tests these tests generate a NEW client based on your changes and run the entire suite. It is ok if some of the regular integration tests fail as long as their equivalent from stainless passes if the issue is the client.

Copy link
Copy Markdown
Collaborator

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mattf , basically this impl is backwards, the API should not have specific handing per-provider, the API needs to have contracts that each provider implements differently.

specific issues:

  1. model_config = ConfigDict(extra="allow") on OpenAIAssistantMessageParam (src/llama_stack_api/inference/models.py:636) This opens up the assistant message model to accept any arbitrary field, which is a sledgehammer approach just to smuggle a reasoning field through. It bypasses Pydantic validation and could let malformed data through silently.

  2. Reasoning is stuffed into Chat Completions types via setattr/getattr hacks (streaming.py:693, utils.py:321-325)
    The code does things like msg.reasoning = reasoning and getattr(choice.message, "reasoning", None) on types that don't have a reasoning field. This only works because of the extra="allow" hack above. It's an untyped, informal contract — nothing enforces it, nothing documents it at the type level.

  3. Provider-specific reasoning parsing lives in the Responses layer (streaming.py:578-590) This means each new provider's quirks will need to be handled here, in the wrong layer.

  4. _get_preceding_reasoning is fragile (utils.py:424-433) It only looks at the single item immediately before the current one. If the input ordering ever changes, or if there are multiple reasoning items, this silently drops reasoning content.

  5. ChatCompletionResult.reasoning is a flat str | None (types.py:71)
    Reasoning content from providers can be structured (multiple segments, summaries, etc.), but this flattens it all into a single concatenated string, losing structure.

  6. Partial provider coverage: The PR only handles Ollama/vLLM-style reasoning. OpenAI's own reasoning, Gemini's tags, and other providers are explicitly not covered, making this a partial implementation that will need the same pattern repeated per-provider.

These all tie back to Matt's core point: reasoning extraction should be a provider-level concern with a well-typed internal contract, not ad-hoc field smuggling through the Responses layer.

@mattf
Copy link
Copy Markdown
Collaborator

mattf commented Mar 23, 2026

reasoning can be enabled via -

  • POST /v1/responses:include["reasoning.encrypted_content"]
  • POST /v1/responses:reasoning.effort/summary (both optional)

and the reasoning comes back to the user as an output message w/ a required(!) summary field and optional content / encrypted_content fields

a reasonable and simple path forward -

  1. treat reasoning.encrypted_content as unsupported (400 response)
  2. let output summary be optional or maybe always ""
  3. when reasoning is requested have responses impl call a new openai_chat_completions_with_reasoning
  4. implement openai_chat_completions_with_reasoning for the providers you care about
  5. let the other providers get a default openai_chat_completions_with_reasoning that raises a not implemented / value error

someone will come along later and fill out the provider implementations (3). in the meantime, we give users confidence that we're doing what they request.

@robinnarsinghranabhat
Copy link
Copy Markdown
Contributor Author

robinnarsinghranabhat commented Mar 24, 2026

@mattf @cdoern Made some changes while trying to keep things minimal and not break anything. This is WIP, tested with Ollama for now.

Main Ideas

  • OpenAIChoiceDelta already defines a typed reasoning_content field, and the VertexAIInferenceAdapter provider is populating it when sending OpenAIChatCompletionChunk to the Responses layer. Thus to stay consistent for now, I consider reasoning_content as the standard field the Responses layer consumes.

  • openai_chat_completions_with_reasoning is called when the reasoning flag is set and not "none". Providers that support reasoning implement it. With Unsupported providers, llamastack raises clear error.

  • Added summary field to OpenAIResponseReasoning ( no-op for now )

    Example Flow (Ollama)

    1. User → POST /v1/responses with reasoning={effort:"medium"} and conversation history as input.
    2. Responses layer converts input to CC messages via convert_response_input_to_chat_messages. Any ReasoningItem from previous turns becomes reasoning_content on OpenAIAssistantMessageParam.
    3. Responses layer calls ollama.openai_chat_completions_with_reasoning instead of regular openai_chat_completion.
    4. Ollama adapter (sending outbound request): Responsible for adjusting CC request params and messages to match what Ollama's CC endpoint expects — -- renames reasoning_contentreasoning on assistant messages, and adjusts reasoning_effort via _prepare_reasoning_params (e.g. defaults
      to "none" when not set, to prevent Ollama's own default of "medium"). Then calls regular openai_chat_completion with the modified params.
    5. Ollama server responds with streaming chunks containing reasoning='...'.
    6. Ollama adapter (handling inbound streaming chunks): Responsible to Map chunk.delta.reasoning to standardized chunk.delta.reasoning_content , which it propagates to the Responses layer.
    7. Responses layer reads reasoning_content, builds ReasoningItem for the output.

A Confusing Inconsistency :

OpenAIMixin.openai_chat_completion declares OpenAIChatCompletionChunk (LlamaStack's type) as its return type, but actually returns openai.types.chat.ChatCompletionChunk (the OpenAI SDK's type), which is what the Responses layer ends up consuming. Meanwhile, VertexAI's openai_chat_completion does return LlamaStack's type and populates OpenAIChoiceDelta.reasoning_content directly. As a Future direction, Should we be ensuring the Responses layer consistently receives LlamaStack types ?

Copy link
Copy Markdown
Collaborator

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is moving in the right direction! some questions:

@robinnarsinghranabhat robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 3f03d95 to 5195cba Compare March 25, 2026 02:42
@robinnarsinghranabhat robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 0b10222 to e166515 Compare March 27, 2026 04:08
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Recording workflow finished with status: failure

Providers: gpt

Recording attempt finished. Check the workflow run for details.

View workflow run

Fork PR: Recordings will be committed if you have "Allow edits from maintainers" enabled.

@robinnarsinghranabhat
Copy link
Copy Markdown
Contributor Author

  • The streaming chunks sent from provider to responses layer, there is existing OpenAIChoiceDelta that already has a typed reasoning_content field. It's utilized by VertexAIInferenceAdapter as well (as in my comments earlier).

I believe, this was a oops as well on letting OpenAIChoiceDelta have that field. So, trying to keep things minimal in this PR, I would like to do a follow up PR utilizing smth like InternalOpenAIChatCompletionChunkWithReasoning being sent on reasoning path. Then responses layer would check types of these instances to build ReasoningItem.

Copy link
Copy Markdown
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the progress on this.

there's a lot of type mismatches here. it's not your fault. we're excluding this entire module from mypy, see https://github.com/llamastack/llama-stack/blob/main/pyproject.toml#L494

ptal #5342

Comment on lines +167 to +171
else:
# Non-streaming reasoning is not tested — the Responses
# layer always uses stream=True (streaming.py:518).
raise NotImplementedError("Non-streaming reasoning is not yet supported for vLLM")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not implemented when stream=False or isn't present?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responses layer hardcodes "streaming mode" when it pings cc. Probably essential to act as non-blocking async server.

So, didn't scope that within this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will clarify these comments and logs. Currently it sound like vllm responses doesn't support streaming.

# Handle reasoning content if present.
# When openai_chat_completions_with_reasoning is used, the provider
# maps reasoning outputs it receives to the `reasoning_content` on the delta.
if getattr(chunk_choice.delta, "reasoning_content", None):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the types are accurate you won't need to getattr

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to clean this up in follow up as well. Only trying to limit the scope atm.

In case I wasn't clear, Only VertexAIInferenceAdapter correctly sends llamastack's OpenAIChatCompletionChunk.
While other providers still propagate openai.types.chat.ChatCompletionChunk, that they received directly.

That's why I mentioned this :

As a Future direction, Should we be ensuring the Responses layer consistently receives LlamaStack provided types ?

@cdoern
Copy link
Copy Markdown
Collaborator

cdoern commented Mar 27, 2026

those are bedrock recordings ^ the rest you need to do locally

@cdoern
Copy link
Copy Markdown
Collaborator

cdoern commented Mar 27, 2026

GPT recordings fail due to legitimate issues still with this PR

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 28, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @robinnarsinghranabhat please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 28, 2026
@robinnarsinghranabhat
Copy link
Copy Markdown
Contributor Author

GPT recordings fail due to legitimate issues still with this PR

Thanks @cdoern It was a struggle with integration tests initially. But they surfaced crucial bugs I could patched.

Now towards the direction to harden the tests, @s-akhtar-baig I am puzzled how test_reasoning_basic_streaming passed the first time because, that way vllm server is initialized in our CI, it should never emit response.reasoning_text.delta events.

For that, I needed to give additional --reasoning-parser in vllm (v0.18) server. or maybe CI's vllm version works like that. if not, we need to update server initialization yaml i guess.

rranabha and others added 16 commits March 27, 2026 23:03
Enable reasoning models (e.g. gpt-oss via Ollama/vLLM) to propagate
reasoning content through the Responses API pipeline:

- Accumulate reasoning text from streaming chat completion chunks into
  ChatCompletionResult.reasoning field (types.py)
- Construct OpenAIResponseOutputMessageReasoningItem from accumulated
  reasoning, using the content field, independent of response type (streaming.py)
- Propagate reasoning on assistant messages in multi-turn server-side
  tool loops via _separate_tool_calls (streaming.py)
- Consume reasoning items from input via look-back: skip reasoning items
  during CC conversion, attach text to the next assistant message for
  FunctionToolCall, McpCall, and ResponseMessage (utils.py)
- Add ReasoningItem to OpenAIResponseOutput union (openai_responses.py)
- Add tests for reasoning look-back in input conversion
  - Add test_reasoning_non_streaming: verifies ReasoningItem present in the output
  - Add test_reasoning_multi_turn_passthrough: verifies reasoning survives round-trip
  - Wire both tests into ollama-reasoning suite
`response.output` from llama_stack_client is different (and incorrect) compared to
openai client (e.g. reasoning items become generic
OutputOpenAIResponseMessageOutput instead of typed ResponseReasoningItem),
causing assertion failures.
…th_reasoning

Add openai_chat_completions_with_reasoning to InferenceProvider contract.
Each provider that supports reasoning implements it and owns its own
mapping logic — both for request params and response chunks.

Provider implementations:
- vLLM/Ollama: map between LlamaStack's 'reasoning_content' and the
  provider's CC field name (which may vary across versions). Each
  provider adjusts CC request params via _prepare_reasoning_params
  (e.g. Ollama defaults reasoning_effort="none" when not requested).
- OpenAI: raises NotImplementedError (reasoning only via native Responses API)
- All others: router catches unsupported providers and raises clear error

Responses layer changes:
- Calls new method when reasoning.effort is set and not "none"
- Reads typed reasoning_content field instead of hasattr/getattr hacks
- Returns 400 for reasoning.encrypted_content in include
- Add reasoning_content as typed field on OpenAIChatCompletionResponseMessage
  and OpenAIAssistantMessageParam

Tests:
- Reject reasoning.encrypted_content in include
- reasoning effort="none" uses regular CC path
…ying public API types

Per maintainer feedback: reasoning_content should not be added to public
CC spec types (OpenAIAssistantMessageParam, OpenAIChatCompletionResponseMessage)
as it deviates from the OpenAI API spec.

Instead, create AssistantMessageWithReasoning (internal type in responses/types.py)
that extends OpenAIAssistantMessageParam with reasoning_content. Providers check
isinstance(msg, AssistantMessageWithReasoning) to detect and map reasoning.
Reasoning flows from ChatCompletionResult directly to _separate_tool_calls,
avoiding the need to modify public response message types.
  - Move reasoning fallback from router to Responses layer so it's
    testable in unit tests. When provider raises NotImplementedError,
    log critical warning and fall back to regular CC instead of crashing.
  - Add openai_chat_completions_with_reasoning to Bedrock adapter
  - Add tests: supported provider uses reasoning path, unsupported
    provider falls back gracefully
  - Router now passes through directly — Responses layer owns the
    fallback logic
Current LlamaStack client deserializes response output as dicts, not
typed objects. Use _get_attr helper for dict-compatible assertions
so tests work with both current client (dicts) and OpenAI client
(typed objects). Remove stray pdb breakpoint.
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…breaking fallback

The router mutates params.model (strips provider prefix like openai/).
When reasoning fallback triggers, the mutated params can't be routed
again. Pass a copy to the reasoning method so the original stays intact.
@robinnarsinghranabhat robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from e085ec1 to 01bba9b Compare March 28, 2026 03:12
@mergify mergify bot removed the needs-rebase label Mar 28, 2026
When provider doesn't support reasoning and falls back to regular CC,
clear reasoning_effort from params — providers like OpenAI's gpt-4o
reject unrecognized reasoning_effort parameter with 400 error.
@robinnarsinghranabhat robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 92f4fd4 to ca82390 Compare March 28, 2026 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants