feat: reasoning output responses api#5206
feat: reasoning output responses api#5206robinnarsinghranabhat wants to merge 18 commits intollamastack:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be merged. @robinnarsinghranabhat please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
✱ Stainless preview buildsThis PR will update the Edit this comment to update it. It will appear in the SDK's changelogs. ✅ llama-stack-client-node studio · conflict
✅ llama-stack-client-go studio · conflict
✅ llama-stack-client-python studio · code · diff
✅ llama-stack-client-openapi studio · code · diff
This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push. |
776cb1f to
8ad6647
Compare
|
✅ Recordings committed successfully Recordings from the integration tests have been committed to this PR. |
8ad6647 to
a3337d1
Compare
cdoern
left a comment
There was a problem hiding this comment.
please use the Ollama-reasoning suite to add integration tests for this see 87dc40b#diff-40d732a0defb244aec12e21fbd9cd387cbf212732f269549475db8de3877480c for more details on how I added some for the inference API previously.
I added tests to the ollama-reasoning suite with As I had only tested with Sorry I don't understand llamastack CI better. |
mattf
left a comment
There was a problem hiding this comment.
@robinnarsinghranabhat please directly include details of how you ran bfcl
mattf
left a comment
There was a problem hiding this comment.
given reasoning is not part of the chat completions api, but reasoning is part of the responses api, and we implement our responses api atop our chat completions api -
we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.
- pick a name: magic_toc_tokens
- require chat providers to populate magic_toc_tokens when appropriate
- detect the magic_toc_tokens field in the responses impl and convert it to Reasoning output
- ensure we do not leak magic_toc_tokens to users
(1) is going to become an implementation gap between providers, e.g. how do you get cot tokens from the openai provider's chat api? we'll probably have to move to responses.
(1) is going to be hard work on provider adapters, e.g. vllm configured w/o a reasoning parser will return model specific cot tokens in the response or different versions of vllm will put the reasoning content in different response fields
this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.
as written, this pr puts us on an unmaintainable path.
some other ideas -
- add stack_chat_completions_with_reasoning to the Inference contract, for internal use only by the responses implementation
- add responses to the Inference contract for providers who can implement it. care will be needed here to ensure the provider responses loop does not execute any tools and no credentials are passed along.
docs/docs/api-openai/conformance.mdx
Outdated
There was a problem hiding this comment.
@cdoern please confirm that this throws an error for novel outputs. for instance, error raised if the spec say fields x, y, z are to be returned and we return x, y, z & p?
| class OpenAIChatCompletionResponseMessage(BaseModel): | ||
| """An assistant message returned in a chat completion response.""" | ||
|
|
||
| model_config = ConfigDict(extra="allow") |
There was a problem hiding this comment.
To build the next_turn_messages for next round of pinging chat-completions.
|
@mattf Really appreciate this thorough review !
I agree with this is as a long term plan. As long as we stick to chat-completions, i see a need to standardize message conversion as well between responses and chat-completions. And
But I notice that current responses adapter already expects a field called I inferred this as an contract where provider specific streaming-cc implementation is responsible to populate a field named
Updated the description. |
good catch. i'd call that an oops that needs to be resolved. as implemented it means users will silently get different levels of service.
|
|
@mattf Maybe it was a mistake, but isn't current implementation implicitly doing what you suggested, with a name of choice being
I am not sure if this PR should be closed then. Any ideas on where we are with prioritization on defining a |
cdoern
left a comment
There was a problem hiding this comment.
@robinnarsinghranabhat , take a look at Stainless SDK Builds / run-integration-tests / Integration Tests and generally test labeled Stainless SDK Builds / run-integration-tests / Integration Tests these tests generate a NEW client based on your changes and run the entire suite. It is ok if some of the regular integration tests fail as long as their equivalent from stainless passes if the issue is the client.
cdoern
left a comment
There was a problem hiding this comment.
I agree with @mattf , basically this impl is backwards, the API should not have specific handing per-provider, the API needs to have contracts that each provider implements differently.
specific issues:
-
model_config = ConfigDict(extra="allow") on OpenAIAssistantMessageParam (src/llama_stack_api/inference/models.py:636) This opens up the assistant message model to accept any arbitrary field, which is a sledgehammer approach just to smuggle a reasoning field through. It bypasses Pydantic validation and could let malformed data through silently.
-
Reasoning is stuffed into Chat Completions types via setattr/getattr hacks (streaming.py:693, utils.py:321-325)
The code does things like msg.reasoning = reasoning and getattr(choice.message, "reasoning", None) on types that don't have a reasoning field. This only works because of the extra="allow" hack above. It's an untyped, informal contract — nothing enforces it, nothing documents it at the type level. -
Provider-specific reasoning parsing lives in the Responses layer (streaming.py:578-590) This means each new provider's quirks will need to be handled here, in the wrong layer.
-
_get_preceding_reasoning is fragile (utils.py:424-433) It only looks at the single item immediately before the current one. If the input ordering ever changes, or if there are multiple reasoning items, this silently drops reasoning content.
-
ChatCompletionResult.reasoning is a flat str | None (types.py:71)
Reasoning content from providers can be structured (multiple segments, summaries, etc.), but this flattens it all into a single concatenated string, losing structure. -
Partial provider coverage: The PR only handles Ollama/vLLM-style reasoning. OpenAI's own reasoning, Gemini's tags, and other providers are explicitly not covered, making this a partial implementation that will need the same pattern repeated per-provider.
These all tie back to Matt's core point: reasoning extraction should be a provider-level concern with a well-typed internal contract, not ad-hoc field smuggling through the Responses layer.
|
reasoning can be enabled via -
and the reasoning comes back to the user as an output message w/ a required(!) summary field and optional content / encrypted_content fields a reasonable and simple path forward -
someone will come along later and fill out the provider implementations (3). in the meantime, we give users confidence that we're doing what they request. |
|
@mattf @cdoern Made some changes while trying to keep things minimal and not break anything. This is WIP, tested with Ollama for now. Main Ideas
A Confusing Inconsistency :
|
cdoern
left a comment
There was a problem hiding this comment.
this is moving in the right direction! some questions:
3f03d95 to
5195cba
Compare
0b10222 to
e166515
Compare
|
Recording workflow finished with status: failure Providers: gpt Recording attempt finished. Check the workflow run for details. Fork PR: Recordings will be committed if you have "Allow edits from maintainers" enabled. |
I believe, this was a oops as well on letting |
mattf
left a comment
There was a problem hiding this comment.
thank you for the progress on this.
there's a lot of type mismatches here. it's not your fault. we're excluding this entire module from mypy, see https://github.com/llamastack/llama-stack/blob/main/pyproject.toml#L494
ptal #5342
| else: | ||
| # Non-streaming reasoning is not tested — the Responses | ||
| # layer always uses stream=True (streaming.py:518). | ||
| raise NotImplementedError("Non-streaming reasoning is not yet supported for vLLM") | ||
|
|
There was a problem hiding this comment.
why not implemented when stream=False or isn't present?
There was a problem hiding this comment.
responses layer hardcodes "streaming mode" when it pings cc. Probably essential to act as non-blocking async server.
So, didn't scope that within this PR.
There was a problem hiding this comment.
I will clarify these comments and logs. Currently it sound like vllm responses doesn't support streaming.
| # Handle reasoning content if present. | ||
| # When openai_chat_completions_with_reasoning is used, the provider | ||
| # maps reasoning outputs it receives to the `reasoning_content` on the delta. | ||
| if getattr(chunk_choice.delta, "reasoning_content", None): |
There was a problem hiding this comment.
when the types are accurate you won't need to getattr
There was a problem hiding this comment.
I would like to clean this up in follow up as well. Only trying to limit the scope atm.
In case I wasn't clear, Only VertexAIInferenceAdapter correctly sends llamastack's OpenAIChatCompletionChunk.
While other providers still propagate openai.types.chat.ChatCompletionChunk, that they received directly.
That's why I mentioned this :
As a Future direction, Should we be ensuring the Responses layer consistently receives LlamaStack provided types ?
|
those are bedrock recordings ^ the rest you need to do locally |
|
GPT recordings fail due to legitimate issues still with this PR |
|
This pull request has merge conflicts that must be resolved before it can be merged. @robinnarsinghranabhat please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
Thanks @cdoern It was a struggle with integration tests initially. But they surfaced crucial bugs I could patched. Now towards the direction to harden the tests, @s-akhtar-baig I am puzzled how For that, I needed to give additional |
Enable reasoning models (e.g. gpt-oss via Ollama/vLLM) to propagate reasoning content through the Responses API pipeline: - Accumulate reasoning text from streaming chat completion chunks into ChatCompletionResult.reasoning field (types.py) - Construct OpenAIResponseOutputMessageReasoningItem from accumulated reasoning, using the content field, independent of response type (streaming.py) - Propagate reasoning on assistant messages in multi-turn server-side tool loops via _separate_tool_calls (streaming.py) - Consume reasoning items from input via look-back: skip reasoning items during CC conversion, attach text to the next assistant message for FunctionToolCall, McpCall, and ResponseMessage (utils.py) - Add ReasoningItem to OpenAIResponseOutput union (openai_responses.py) - Add tests for reasoning look-back in input conversion
- Add test_reasoning_non_streaming: verifies ReasoningItem present in the output - Add test_reasoning_multi_turn_passthrough: verifies reasoning survives round-trip - Wire both tests into ollama-reasoning suite
`response.output` from llama_stack_client is different (and incorrect) compared to openai client (e.g. reasoning items become generic OutputOpenAIResponseMessageOutput instead of typed ResponseReasoningItem), causing assertion failures.
…th_reasoning Add openai_chat_completions_with_reasoning to InferenceProvider contract. Each provider that supports reasoning implements it and owns its own mapping logic — both for request params and response chunks. Provider implementations: - vLLM/Ollama: map between LlamaStack's 'reasoning_content' and the provider's CC field name (which may vary across versions). Each provider adjusts CC request params via _prepare_reasoning_params (e.g. Ollama defaults reasoning_effort="none" when not requested). - OpenAI: raises NotImplementedError (reasoning only via native Responses API) - All others: router catches unsupported providers and raises clear error Responses layer changes: - Calls new method when reasoning.effort is set and not "none" - Reads typed reasoning_content field instead of hasattr/getattr hacks - Returns 400 for reasoning.encrypted_content in include - Add reasoning_content as typed field on OpenAIChatCompletionResponseMessage and OpenAIAssistantMessageParam Tests: - Reject reasoning.encrypted_content in include - reasoning effort="none" uses regular CC path
…ying public API types Per maintainer feedback: reasoning_content should not be added to public CC spec types (OpenAIAssistantMessageParam, OpenAIChatCompletionResponseMessage) as it deviates from the OpenAI API spec. Instead, create AssistantMessageWithReasoning (internal type in responses/types.py) that extends OpenAIAssistantMessageParam with reasoning_content. Providers check isinstance(msg, AssistantMessageWithReasoning) to detect and map reasoning. Reasoning flows from ChatCompletionResult directly to _separate_tool_calls, avoiding the need to modify public response message types.
- Move reasoning fallback from router to Responses layer so it's
testable in unit tests. When provider raises NotImplementedError,
log critical warning and fall back to regular CC instead of crashing.
- Add openai_chat_completions_with_reasoning to Bedrock adapter
- Add tests: supported provider uses reasoning path, unsupported
provider falls back gracefully
- Router now passes through directly — Responses layer owns the
fallback logic
Current LlamaStack client deserializes response output as dicts, not typed objects. Use _get_attr helper for dict-compatible assertions so tests work with both current client (dicts) and OpenAI client (typed objects). Remove stray pdb breakpoint.
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…breaking fallback The router mutates params.model (strips provider prefix like openai/). When reasoning fallback triggers, the mutated params can't be routed again. Pass a copy to the reasoning method so the original stays intact.
e085ec1 to
01bba9b
Compare
When provider doesn't support reasoning and falls back to regular CC, clear reasoning_effort from params — providers like OpenAI's gpt-4o reject unrecognized reasoning_effort parameter with 400 error.
92f4fd4 to
ca82390
Compare
What does this PR do?
Closed and Re-Opened 5087
Adds end-to-end reasoning output support for the llamastack's Responses API endpoint, enabling reasoning models (e.g.
gpt-ossvia Ollama/vLLM) to propagate their chain-of-thought reasoning through the LlamaStack Responses API pipeline.Test Plan
1. BFCL Evals
1.1 GPT-OSS-120B :
vllm-chat-completions(1),llamastack-chat-completions(2.1) andllamastack-responses(3.1) are now equivalent.1.2 GPT-OSS-20B
Similarly we see that
Row 1.2andRow 2.2are equivalent, meaning llamastack-responses itself brings no regression.More Details of above table
2. Manual verification with
OllamaandLlamaStack->Ollamaongpt-oss:20bHow:
response.outputhas reasoning objects in the right order (source of truth being what they look like inollamaandopenaiproviders directly).llamastack+ollamawith a MCP server, where tool orchestration happens on the LlamaStack server side.3. Server-side MCP tool orchestration (OpenAI vs LlamaStack comparison)
Compared end-to-end outputs produced by OpenAI vs LlamaStack->OpenAI on
gpt-5-mini.Manually compared the response output structure when using function calls and MCP tool calls in a multi-turn scenario.
Output:
Output structure comparison -- OpenAI vs LlamaStack on
gpt-5-mini:[McpListTools, ReasoningItem, OutputMessage][McpListTools, ReasoningItem, OutputMessage][ReasoningItem, McpCall, McpCall, ReasoningItem, OutputMessage][McpListTools, ReasoningItem, McpCall, McpCall, ReasoningItem, OutputMessage]* Minor pre-existing difference: LlamaStack re-emits
McpListToolson every request; OpenAI only emits it on T1. This is existing MCP behavior, not related to reasoning changes.BFCL Evals Setup
Used this guide for setting up evaluation.
then tested with vllm v0.17 and ollama
0.6.1.