feat: reasoning output responses api by robinnarsinghranabhat · Pull Request #5206 · llamastack/llama-stack

robinnarsinghranabhat · 2026-03-19T03:14:59Z

What does this PR do?

Closed and Re-Opened 5087

Adds end-to-end reasoning output support for the llamastack's Responses API endpoint, enabling reasoning models (e.g. gpt-oss via Ollama/vLLM) to propagate their chain-of-thought reasoning through the LlamaStack Responses API pipeline.

Test Plan

1. BFCL Evals

1.1 GPT-OSS-120B :

vllm-chat-completions (1), llamastack-chat-completions (2.1) and llamastack-responses (3.1) are now equivalent.

Rank	Config	Overall	Base	Miss Func	Miss Param	Long Context
1	vLLM 0.17 Non-Streaming Chat Completions	49.75%	61.50%	54.00%	48.00%	35.50%
1.1	Same as Above ( VLLM 0.18 March 27)	52.12%	65.00%	53.00%	52.00%	38.50%
2	Llama Stack Chat Completions Before CC reasoning merged	46.50%	59.00%	48.00%	46.50%	32.50%
2.1	Updated LS-CC with reasoning support (merged)	50.36%	61.50%	53.50%	50.00%	36.50%
3	Llama Stack Responses with vllm 0.17	47.75%	62.00%	51.00%	46.50%	31.50%
3.1	Same as Above ( VLLM 0.18 March 27 Pull)	45.63%	54.50%	47.50%	45.50%	35.00%
3.2	Llama Stack Responses with Reasoning Propagation (THIS PR)	50.63%	63.00%	53.00%	51.00%	35.50%
3.3	PR Refactor run on March 27 run on vllm 0.18	50.37%	62.00%	51.00%	50.00%	38.50%

1.2 GPT-OSS-20B

Similarly we see that Row 1.2 and Row 2.2 are equivalent, meaning llamastack-responses itself brings no regression.

Rank	Config	Overall	Base	Miss Func	Miss Param	Long Context
1	vLLM CC	28.00%	33.50%	33.00%	28.00%	17.50%
1.1	vLLM CC with invalid tool-name handled client-side	41.25%	54.00%	40.50%	40%	30.50%
1.2	vLLM CC-Streaming with invalid tool-name handled client-side	42.5%	56.00%	39.0%	43.00%	32.5%
2	LS Responses	35.75%	41.00%	36.50%	43.00%	22.50%
2.1	LS Responses with Reasoning Propagation (THIS PR)	29.50%	38.00%	26.50%	33.50%	20.00%
2.2	Same as 2.1, and vllm's invalid tools handled client-side (THIS PR)	43.75%	57.50%	41.50%	43.50%	32.50

IMP Assuming vllm's corrupted tool output issue is dealt with, this PR improves gpt-oss-20b as shown in Row 2.2. The problem is minimal in "vllm cc streaming" path when reasoning is skipped. Hence, current ls-responses looks better.

More Details of above table

2. Manual verification with `Ollama` and `LlamaStack->Ollama` on `gpt-oss:20b`

How:

In a multi-call scenario with client-side tool call orchestration, verified that final response.output has reasoning objects in the right order (source of truth being what they look like in ollama and openai providers directly).
Verified that, when these response outputs are propagated back with a new user message or tool output for the next conversation turn (next Responses API invocation), the internal conversion to chat-completions message array is correct. This is also reflected in the added tests.
Verified 1 and 2 when using llamastack+ollama with a MCP server, where tool orchestration happens on the LlamaStack server side.

## Summary script for client-side tool orchestration verification ##
from openai import OpenAI
import json

client = OpenAI(base_url="http://127.0.0.1:8321/v1", api_key="fake_api_key")

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather for a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location", "unit"],
        },
    },
]

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the weather like in Tokyo?"},
]

# Turn 1: mostly likely returns [ReasoningItem, FunctionToolCall]
# Sometimes it might not because model itself decides to not use reasoning
resp = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=messages,
    tools=tools,
    tool_choice="auto",
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)

print("Turn 1 output types:", [item.type for item in resp.output])
# Expected: ['reasoning', 'function_call'] or ['function_call']

# Turn 2: execute tool, pass result back
new_input = list(messages) + list(resp.output)
for item in resp.output:
    if item.type == "function_call":
        new_input.append({
            "type": "function_call_output",
            "call_id": item.call_id,
            "output": json.dumps({"temperature": 27, "condition": "humid", "unit": "celsius", "location": "Tokyo"}),
        })

resp2 = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=new_input,
    tools=tools,
    tool_choice="auto",
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)

print("Turn 2 output types:", [item.type for item in resp2.output])
# Expected: ['reasoning', 'message'] or ['message']

3. Server-side MCP tool orchestration (OpenAI vs LlamaStack comparison)

Compared end-to-end outputs produced by OpenAI vs LlamaStack->OpenAI on gpt-5-mini.
Manually compared the response output structure when using function calls and MCP tool calls in a multi-turn scenario.

## Summary script for MCP server-side tool orchestration verification ##
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8321/v1", api_key="fake_api_key")

mcp_tool = {
    "type": "mcp",
    "server_label": "fetch",
    "server_url": "http://127.0.0.1:8080/sse",
    "require_approval": "never",
}

# Turn 1: greeting
resp = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=[{"role": "user", "content": "Hello! What tools do you have available?"}],
    tools=[mcp_tool],
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)
print("T1 output types:", [item.type for item in resp.output])

# Turn 2: trigger MCP tool calls
input_t2 = [{"role": "user", "content": "Hello!"}] + list(resp.output)
input_t2.append({"role": "user", "content": "Can you fetch https://pypi.org/project/tiktoken/ and tell me the latest version?"})

resp2 = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=input_t2,
    tools=[mcp_tool],
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)
print("T2 output types:", [item.type for item in resp2.output])

Output:

T1 output types: ['mcp_list_tools', 'reasoning', 'message']
T2 output types: ['mcp_list_tools', 'reasoning', 'mcp_call', 'mcp_call', 'reasoning', 'message']

Output structure comparison -- OpenAI vs LlamaStack on gpt-5-mini:

Turn	OpenAI Direct	LlamaStack + OpenAI	Status
T1 (greeting)	`[McpListTools, ReasoningItem, OutputMessage]`	`[McpListTools, ReasoningItem, OutputMessage]`	Match
T2 (fetch URL)	`[ReasoningItem, McpCall, McpCall, ReasoningItem, OutputMessage]`	`[McpListTools, ReasoningItem, McpCall, McpCall, ReasoningItem, OutputMessage]`	Minor mismatch*

* Minor pre-existing difference: LlamaStack re-emits McpListTools on every request; OpenAI only emits it on T1. This is existing MCP behavior, not related to reasoning changes.

BFCL Evals Setup

Used this guide for setting up evaluation.
then tested with vllm v0.17 and ollama 0.6.1.

mergify · 2026-03-19T03:15:39Z

This pull request has merge conflicts that must be resolved before it can be merged. @robinnarsinghranabhat please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · 2026-03-19T03:16:16Z

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat: reasoning output responses api

Edit this comment to update it. It will appear in the SDK's changelogs.

✅ llama-stack-client-node studio · conflict

Your SDK build had at least one new note diagnostic, which is a regression from the base state.

New diagnostics (3 note)

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

✅ llama-stack-client-go studio · conflict

Your SDK build had at least one new note diagnostic, which is a regression from the base state.

New diagnostics (24 note)

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

✅ llama-stack-client-python studio · code · diff

Your SDK build had at least one "warning" diagnostic, but this did not represent a regression.
generate ⚠️ → build ✅ → lint ✅ → test ✅
pip install https://pkg.stainless.com/s/llama-stack-client-python/5208485475b01560bd12206361b630543f3db4d8/llama_stack_client-0.6.1a1-py3-none-any.whl
New diagnostics (3 note)

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

✅ llama-stack-client-openapi studio · code · diff

Your SDK build had at least one "warning" diagnostic, but this did not represent a regression.
generate ⚠️

New diagnostics (3 note)

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-03-28 07:00:00 UTC

github-actions · 2026-03-19T03:25:06Z

✅ Recordings committed successfully

Recordings from the integration tests have been committed to this PR.

View commit workflow

jwm4

I read through all the code changes and they all look good to me. We still needed an actual maintainer review, of course.

cdoern

please use the Ollama-reasoning suite to add integration tests for this see 87dc40b#diff-40d732a0defb244aec12e21fbd9cd387cbf212732f269549475db8de3877480c for more details on how I added some for the inference API previously.

docs/docs/api-openai/conformance.mdx

robinnarsinghranabhat · 2026-03-20T21:44:39Z

please use the Ollama-reasoning suite to add integration tests for this see 87dc40b#diff-40d732a0defb244aec12e21fbd9cd387cbf212732f269549475db8de3877480c for more details on how I added some for the inference API previously.

I added tests to the ollama-reasoning suite with llamastack client, as you had done.

As I had only tested with openai client, It looks like llamastack client deserializes response.output incorrectly. They come back as OutputOpenAIResponseMessageOutput instead of ResponseReasoningItem which breaks assertions. Then I Switched to use openai_client in tests. Still it's failing. Is it because cached outputs are being used in the tests from the earlier llamastack client ?

Sorry I don't understand llamastack CI better.

mattf

@robinnarsinghranabhat please directly include details of how you ran bfcl

mattf

given reasoning is not part of the chat completions api, but reasoning is part of the responses api, and we implement our responses api atop our chat completions api -

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

pick a name: magic_toc_tokens
require chat providers to populate magic_toc_tokens when appropriate
detect the magic_toc_tokens field in the responses impl and convert it to Reasoning output
ensure we do not leak magic_toc_tokens to users

(1) is going to become an implementation gap between providers, e.g. how do you get cot tokens from the openai provider's chat api? we'll probably have to move to responses.
(1) is going to be hard work on provider adapters, e.g. vllm configured w/o a reasoning parser will return model specific cot tokens in the response or different versions of vllm will put the reasoning content in different response fields

this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.

as written, this pr puts us on an unmaintainable path.

some other ideas -

add stack_chat_completions_with_reasoning to the Inference contract, for internal use only by the responses implementation
add responses to the Inference contract for providers who can implement it. care will be needed here to ensure the provider responses loop does not execute any tools and no credentials are passed along.

mattf · 2026-03-21T11:41:52Z

docs/docs/api-openai/conformance.mdx

@cdoern please confirm that this throws an error for novel outputs. for instance, error raised if the spec say fields x, y, z are to be returned and we return x, y, z & p?

mattf · 2026-03-21T11:43:26Z

src/llama_stack_api/inference/models.py

 class OpenAIChatCompletionResponseMessage(BaseModel):
    """An assistant message returned in a chat completion response."""

+    model_config = ConfigDict(extra="allow")


why do we need this?

Here

To build the next_turn_messages for next round of pinging chat-completions.

robinnarsinghranabhat · 2026-03-21T15:32:52Z

@mattf Really appreciate this thorough review !

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

I agree with this is as a long term plan. As long as we stick to chat-completions, i see a need to standardize message conversion as well between responses and chat-completions. And llamastack responses would keep staying inferior to directly using openai's responses.

this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.
as written, this pr puts us on an unmaintainable path.

But I notice that current responses adapter already expects a field called reasoning on chat completion streaming chunks, and accumulates it.

I inferred this as an contract where provider specific streaming-cc implementation is responsible to populate a field named reasoning field in the chunk.

@robinnarsinghranabhat please directly include details of how you ran bfcl

Updated the description.

mattf · 2026-03-21T18:20:53Z

@mattf Really appreciate this thorough review !

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

I agree with this is as a long term plan. As long as we stick to chat-completions, i see a need to standardize message conversion as well between responses and chat-completions. And llamastack responses would keep staying inferior to directly using openai's responses.

this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.
as written, this pr puts us on an unmaintainable path.

But I notice that current responses adapter already expects a field called reasoning on chat completion streaming chunks, and accumulates it.

I inferred this as an contract where provider specific streaming-cc implementation is responsible to populate a field named reasoning field in the chunk.

good catch. i'd call that an oops that needs to be resolved. as implemented it means users will silently get different levels of service.

@robinnarsinghranabhat please directly include details of how you ran bfcl

Updated the description.

robinnarsinghranabhat · 2026-03-21T20:22:09Z

@mattf Maybe it was a mistake, but isn't current implementation implicitly doing what you suggested, with a name of choice being reasoning (although not documented)

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

pick a name: magic_toc_tokens

require chat providers to populate magic_toc_tokens when appropriate

detect the magic_toc_tokens field in the responses impl and convert it to Reasoning output

ensure we do not leak magic_toc_tokens to users

I am not sure if this PR should be closed then. Any ideas on where we are with prioritization on defining a internal standard to enable llama-stack responses to support reasoning then. Given some guidance (newbie), I am happy to take it :)

cdoern

@robinnarsinghranabhat , take a look at Stainless SDK Builds / run-integration-tests / Integration Tests and generally test labeled Stainless SDK Builds / run-integration-tests / Integration Tests these tests generate a NEW client based on your changes and run the entire suite. It is ok if some of the regular integration tests fail as long as their equivalent from stainless passes if the issue is the client.

cdoern

I agree with @mattf , basically this impl is backwards, the API should not have specific handing per-provider, the API needs to have contracts that each provider implements differently.

specific issues:

model_config = ConfigDict(extra="allow") on OpenAIAssistantMessageParam (src/llama_stack_api/inference/models.py:636) This opens up the assistant message model to accept any arbitrary field, which is a sledgehammer approach just to smuggle a reasoning field through. It bypasses Pydantic validation and could let malformed data through silently.
Reasoning is stuffed into Chat Completions types via setattr/getattr hacks (streaming.py:693, utils.py:321-325)
The code does things like msg.reasoning = reasoning and getattr(choice.message, "reasoning", None) on types that don't have a reasoning field. This only works because of the extra="allow" hack above. It's an untyped, informal contract — nothing enforces it, nothing documents it at the type level.
Provider-specific reasoning parsing lives in the Responses layer (streaming.py:578-590) This means each new provider's quirks will need to be handled here, in the wrong layer.
_get_preceding_reasoning is fragile (utils.py:424-433) It only looks at the single item immediately before the current one. If the input ordering ever changes, or if there are multiple reasoning items, this silently drops reasoning content.
ChatCompletionResult.reasoning is a flat str | None (types.py:71)
Reasoning content from providers can be structured (multiple segments, summaries, etc.), but this flattens it all into a single concatenated string, losing structure.
Partial provider coverage: The PR only handles Ollama/vLLM-style reasoning. OpenAI's own reasoning, Gemini's tags, and other providers are explicitly not covered, making this a partial implementation that will need the same pattern repeated per-provider.

These all tie back to Matt's core point: reasoning extraction should be a provider-level concern with a well-typed internal contract, not ad-hoc field smuggling through the Responses layer.

mattf · 2026-03-23T14:20:30Z

reasoning can be enabled via -

POST /v1/responses:include["reasoning.encrypted_content"]
POST /v1/responses:reasoning.effort/summary (both optional)

and the reasoning comes back to the user as an output message w/ a required(!) summary field and optional content / encrypted_content fields

a reasonable and simple path forward -

treat reasoning.encrypted_content as unsupported (400 response)
let output summary be optional or maybe always ""
when reasoning is requested have responses impl call a new openai_chat_completions_with_reasoning
implement openai_chat_completions_with_reasoning for the providers you care about
let the other providers get a default openai_chat_completions_with_reasoning that raises a not implemented / value error

someone will come along later and fill out the provider implementations (3). in the meantime, we give users confidence that we're doing what they request.

robinnarsinghranabhat · 2026-03-24T03:47:44Z

@mattf @cdoern Made some changes while trying to keep things minimal and not break anything. This is WIP, tested with Ollama for now.

Main Ideas

OpenAIChoiceDelta already defines a typed reasoning_content field, and the VertexAIInferenceAdapter provider is populating it when sending OpenAIChatCompletionChunk to the Responses layer. Thus to stay consistent for now, I consider reasoning_content as the standard field the Responses layer consumes.
openai_chat_completions_with_reasoning is called when the reasoning flag is set and not "none". Providers that support reasoning implement it. With Unsupported providers, llamastack raises clear error.
Added summary field to OpenAIResponseReasoning ( no-op for now )

Example Flow (Ollama)
1. User → POST /v1/responses with reasoning={effort:"medium"} and conversation history as input.
2. Responses layer converts input to CC messages via convert_response_input_to_chat_messages. Any ReasoningItem from previous turns becomes reasoning_content on OpenAIAssistantMessageParam.
3. Responses layer calls ollama.openai_chat_completions_with_reasoning instead of regular openai_chat_completion.
4. Ollama adapter (sending outbound request): Responsible for adjusting CC request params and messages to match what Ollama's CC endpoint expects — -- renames reasoning_content → reasoning on assistant messages, and adjusts reasoning_effort via _prepare_reasoning_params (e.g. defaults
  to "none" when not set, to prevent Ollama's own default of "medium"). Then calls regular openai_chat_completion with the modified params.
5. Ollama server responds with streaming chunks containing reasoning='...'.
6. Ollama adapter (handling inbound streaming chunks): Responsible to Map chunk.delta.reasoning to standardized chunk.delta.reasoning_content , which it propagates to the Responses layer.
7. Responses layer reads reasoning_content, builds ReasoningItem for the output.

A Confusing Inconsistency :

OpenAIMixin.openai_chat_completion declares OpenAIChatCompletionChunk (LlamaStack's type) as its return type, but actually returns openai.types.chat.ChatCompletionChunk (the OpenAI SDK's type), which is what the Responses layer ends up consuming. Meanwhile, VertexAI's openai_chat_completion does return LlamaStack's type and populates OpenAIChoiceDelta.reasoning_content directly. As a Future direction, Should we be ensuring the Responses layer consistently receives LlamaStack types ?

cdoern

this is moving in the right direction! some questions:

src/llama_stack/providers/remote/inference/vllm/vllm.py

src/llama_stack/core/routers/inference.py

github-actions · 2026-03-27T14:22:10Z

Recording workflow finished with status: failure

Providers: gpt

Recording attempt finished. Check the workflow run for details.

View workflow run

Fork PR: Recordings will be committed if you have "Allow edits from maintainers" enabled.

robinnarsinghranabhat · 2026-03-27T14:48:02Z

The streaming chunks sent from provider to responses layer, there is existing OpenAIChoiceDelta that already has a typed reasoning_content field. It's utilized by VertexAIInferenceAdapter as well (as in my comments earlier).

I believe, this was a oops as well on letting OpenAIChoiceDelta have that field. So, trying to keep things minimal in this PR, I would like to do a follow up PR utilizing smth like InternalOpenAIChatCompletionChunkWithReasoning being sent on reasoning path. Then responses layer would check types of these instances to build ReasoningItem.

mattf

thank you for the progress on this.

there's a lot of type mismatches here. it's not your fault. we're excluding this entire module from mypy, see https://github.com/llamastack/llama-stack/blob/main/pyproject.toml#L494

ptal #5342

mattf · 2026-03-27T16:18:04Z

src/llama_stack/providers/remote/inference/vllm/vllm.py

+        else:
+            # Non-streaming reasoning is not tested — the Responses
+            # layer always uses stream=True (streaming.py:518).
+            raise NotImplementedError("Non-streaming reasoning is not yet supported for vLLM")
+


why not implemented when stream=False or isn't present?

responses layer hardcodes "streaming mode" when it pings cc. Probably essential to act as non-blocking async server.

So, didn't scope that within this PR.

I will clarify these comments and logs. Currently it sound like vllm responses doesn't support streaming.

mattf · 2026-03-27T16:23:32Z

src/llama_stack/providers/inline/responses/builtin/responses/streaming.py

+                # Handle reasoning content if present.
+                # When openai_chat_completions_with_reasoning is used, the provider
+                # maps reasoning outputs it receives to the `reasoning_content` on the delta.
+                if getattr(chunk_choice.delta, "reasoning_content", None):


when the types are accurate you won't need to getattr

I would like to clean this up in follow up as well. Only trying to limit the scope atm.

In case I wasn't clear, Only VertexAIInferenceAdapter correctly sends llamastack's OpenAIChatCompletionChunk.
While other providers still propagate openai.types.chat.ChatCompletionChunk, that they received directly.

That's why I mentioned this :

As a Future direction, Should we be ensuring the Responses layer consistently receives LlamaStack provided types ?

cdoern · 2026-03-27T21:28:24Z

those are bedrock recordings ^ the rest you need to do locally

cdoern · 2026-03-27T21:39:48Z

GPT recordings fail due to legitimate issues still with this PR

mergify · 2026-03-28T00:02:46Z

This pull request has merge conflicts that must be resolved before it can be merged. @robinnarsinghranabhat please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robinnarsinghranabhat · 2026-03-28T02:43:46Z

GPT recordings fail due to legitimate issues still with this PR

Thanks @cdoern It was a struggle with integration tests initially. But they surfaced crucial bugs I could patched.

Now towards the direction to harden the tests, @s-akhtar-baig I am puzzled how test_reasoning_basic_streaming passed the first time because, that way vllm server is initialized in our CI, it should never emit response.reasoning_text.delta events.

For that, I needed to give additional --reasoning-parser in vllm (v0.18) server. or maybe CI's vllm version works like that. if not, we need to update server initialization yaml i guess.

Enable reasoning models (e.g. gpt-oss via Ollama/vLLM) to propagate reasoning content through the Responses API pipeline: - Accumulate reasoning text from streaming chat completion chunks into ChatCompletionResult.reasoning field (types.py) - Construct OpenAIResponseOutputMessageReasoningItem from accumulated reasoning, using the content field, independent of response type (streaming.py) - Propagate reasoning on assistant messages in multi-turn server-side tool loops via _separate_tool_calls (streaming.py) - Consume reasoning items from input via look-back: skip reasoning items during CC conversion, attach text to the next assistant message for FunctionToolCall, McpCall, and ResponseMessage (utils.py) - Add ReasoningItem to OpenAIResponseOutput union (openai_responses.py) - Add tests for reasoning look-back in input conversion

- Add test_reasoning_non_streaming: verifies ReasoningItem present in the output - Add test_reasoning_multi_turn_passthrough: verifies reasoning survives round-trip - Wire both tests into ollama-reasoning suite

`response.output` from llama_stack_client is different (and incorrect) compared to openai client (e.g. reasoning items become generic OutputOpenAIResponseMessageOutput instead of typed ResponseReasoningItem), causing assertion failures.

…th_reasoning Add openai_chat_completions_with_reasoning to InferenceProvider contract. Each provider that supports reasoning implements it and owns its own mapping logic — both for request params and response chunks. Provider implementations: - vLLM/Ollama: map between LlamaStack's 'reasoning_content' and the provider's CC field name (which may vary across versions). Each provider adjusts CC request params via _prepare_reasoning_params (e.g. Ollama defaults reasoning_effort="none" when not requested). - OpenAI: raises NotImplementedError (reasoning only via native Responses API) - All others: router catches unsupported providers and raises clear error Responses layer changes: - Calls new method when reasoning.effort is set and not "none" - Reads typed reasoning_content field instead of hasattr/getattr hacks - Returns 400 for reasoning.encrypted_content in include - Add reasoning_content as typed field on OpenAIChatCompletionResponseMessage and OpenAIAssistantMessageParam Tests: - Reject reasoning.encrypted_content in include - reasoning effort="none" uses regular CC path

…ying public API types Per maintainer feedback: reasoning_content should not be added to public CC spec types (OpenAIAssistantMessageParam, OpenAIChatCompletionResponseMessage) as it deviates from the OpenAI API spec. Instead, create AssistantMessageWithReasoning (internal type in responses/types.py) that extends OpenAIAssistantMessageParam with reasoning_content. Providers check isinstance(msg, AssistantMessageWithReasoning) to detect and map reasoning. Reasoning flows from ChatCompletionResult directly to _separate_tool_calls, avoiding the need to modify public response message types.

- Move reasoning fallback from router to Responses layer so it's testable in unit tests. When provider raises NotImplementedError, log critical warning and fall back to regular CC instead of crashing. - Add openai_chat_completions_with_reasoning to Bedrock adapter - Add tests: supported provider uses reasoning path, unsupported provider falls back gracefully - Router now passes through directly — Responses layer owns the fallback logic

Current LlamaStack client deserializes response output as dicts, not typed objects. Use _get_attr helper for dict-compatible assertions so tests work with both current client (dicts) and OpenAI client (typed objects). Remove stray pdb breakpoint.

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…recordings

…breaking fallback The router mutates params.model (strips provider prefix like openai/). When reasoning fallback triggers, the mutated params can't be routed again. Pass a copy to the reasoning method so the original stays intact.

When provider doesn't support reasoning and falls back to regular CC, clear reasoning_effort from params — providers like OpenAI's gpt-4o reject unrecognized reasoning_effort parameter with 400 error.

robinnarsinghranabhat requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners March 19, 2026 03:15

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 19, 2026

mergify bot added the needs-rebase label Mar 19, 2026

robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 776cb1f to 8ad6647 Compare March 19, 2026 03:20

mergify bot removed the needs-rebase label Mar 19, 2026

robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 8ad6647 to a3337d1 Compare March 19, 2026 03:30

jwm4 approved these changes Mar 20, 2026

View reviewed changes

cdoern requested changes Mar 20, 2026

View reviewed changes

docs/docs/api-openai/conformance.mdx Show resolved Hide resolved

mattf reviewed Mar 21, 2026

View reviewed changes

mattf requested changes Mar 21, 2026

View reviewed changes

cdoern reviewed Mar 23, 2026

View reviewed changes

cdoern requested changes Mar 23, 2026

View reviewed changes

cdoern reviewed Mar 24, 2026

View reviewed changes

src/llama_stack/providers/remote/inference/vllm/vllm.py Show resolved Hide resolved

src/llama_stack/core/routers/inference.py Outdated Show resolved Hide resolved

robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 3f03d95 to 5195cba Compare March 25, 2026 02:42

robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 0b10222 to e166515 Compare March 27, 2026 04:08

mattf reviewed Mar 27, 2026

View reviewed changes

mergify bot added the needs-rebase label Mar 28, 2026

rranabha and others added 16 commits March 27, 2026 23:03

chore: regenerate openai coverage after adding ReasoningItem variant

60898b4

test: add integration tests for reasoning output in Responses API

1b948d7

- Add test_reasoning_non_streaming: verifies ReasoningItem present in the output - Add test_reasoning_multi_turn_passthrough: verifies reasoning survives round-trip - Wire both tests into ollama-reasoning suite

test: add reasoning effort param to integration tests

a39d08f

fix: minor cleanups in reasoning tests

3872c64

Recordings update from CI

33c1cf6

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

test: re-record ollama-reasoning and vllm-reasoning integration test …

e2d2ab3

…recordings

test: re-record gpt-reasoning integration test recordings

9fc7445

chore: pre-commit formatting fixes

f157657

chore: regenerate docs after rebase

01bba9b

robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from e085ec1 to 01bba9b Compare March 28, 2026 03:12

mergify bot removed the needs-rebase label Mar 28, 2026

fix: clear reasoning_effort on fallback to prevent provider rejection

ca82390

When provider doesn't support reasoning and falls back to regular CC, clear reasoning_effort from params — providers like OpenAI's gpt-4o reject unrecognized reasoning_effort parameter with 400 error.

robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 92f4fd4 to ca82390 Compare March 28, 2026 06:02

fix: remove flawed unsupported-provider fallback test

63e2987

Conversation

robinnarsinghranabhat commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

1. BFCL Evals

1.1 GPT-OSS-120B :

1.2 GPT-OSS-20B

2. Manual verification with Ollama and LlamaStack->Ollama on gpt-oss:20b

3. Server-side MCP tool orchestration (OpenAI vs LlamaStack comparison)

BFCL Evals Setup

Uh oh!

mergify bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jwm4 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robinnarsinghranabhat commented Mar 20, 2026

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robinnarsinghranabhat commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattf commented Mar 21, 2026

Uh oh!

robinnarsinghranabhat commented Mar 21, 2026

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

mattf commented Mar 23, 2026

Uh oh!

robinnarsinghranabhat commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main Ideas

A Confusing Inconsistency :

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robinnarsinghranabhat commented Mar 27, 2026

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robinnarsinghranabhat commented Mar 19, 2026 •

edited

Loading

2. Manual verification with `Ollama` and `LlamaStack->Ollama` on `gpt-oss:20b`

github-actions bot commented Mar 19, 2026 •

edited

Loading

github-actions bot commented Mar 19, 2026 •

edited

Loading

jwm4 left a comment •

edited

Loading

robinnarsinghranabhat commented Mar 21, 2026 •

edited

Loading

robinnarsinghranabhat commented Mar 24, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading