-
Notifications
You must be signed in to change notification settings - Fork 87
OpenClaw + codex-lb: visible intermediate assistant messages can stop before tool calls #285
Description
TL;DR
I initially thought this was a pure codex-lb transport bug. After spending a lot more time debugging OpenClaw itself, my current understanding is more nuanced:
- I did hit real OpenClaw-side integration bugs first (
No API key for provider: codex-lb, thenFailed to extract accountId from token) - I locally patched those in OpenClaw and moved from the older HTTP
/v1path to the real Codex websocket path (openai-codex-responses->/backend-api/codex) - that improved the transport/auth layer, but it did not fully fix the original symptom
- the remaining symptom now looks more like an OpenClaw continuation/orchestration problem than a pure
codex-lbtransport problem
I am posting this because I may still be misunderstanding something, and I would really appreciate feedback from people who know this stack better.
Full story
About a month ago I installed codex-lb for the first time and started using it with OpenClaw.
Very quickly I noticed a strange behavior: the agent could send a visible intermediate message like:
"OK, I'll open the file and check it"
or
"First I'll send an intermediate update, then I'll run ping"
but after that, nothing happened. The next step, where the model was supposed to make a tool call, simply never came. The reasoning chain just stopped on a normal assistant text response.
An important detail: this was not a case where tool calls never worked at all. Tool calls did work in general. The problem was narrower: when the model first emitted a visible intermediate message and was then supposed to continue into a tool call, that chain often broke.
At first I assumed this was a codex-lb problem. I spent a long time digging into codex-lb itself, wondering whether it was dropping tool calls, mishandling websocket traffic, or doing something wrong in the proxy layer. At that time I never got to the real root cause, so I dropped the investigation.
Recently I came back to it, but this time I focused much more on OpenClaw itself.
What I found first: a separate OpenClaw-side auth bug
The first confirmed issue I hit was:
No API key for provider: codex-lb
At first this looked like codex-lb itself was rejecting the key. But after tracing the code, it turned out the problem was deeper and not actually in codex-lb.
What I found was roughly this:
- OpenClaw resolved the custom provider key correctly in its model registry / auth path
- somewhere in the embedded runtime session path, that key was effectively lost before the call reached
pi-ai - so
pi-aiended up seeing providercodex-lbwithout anapiKey - and it threw
No API key for provider: codex-lbbefore any real network request even reachedcodex-lb
So there was at least one real OpenClaw-side bug before the request even got far enough for codex-lb to matter.
I worked around that locally by restoring runtime API key propagation for the embedded agent session.
The next issue: Codex path expected an OpenAI-style token
Once the API key propagation issue was fixed, I hit another blocker:
Failed to extract accountId from token
This was a different problem.
At this point the key was reaching the transport layer, but the openai-codex-responses path expected something that looked like an OpenAI/Codex-style token and tried to extract chatgpt_account_id from it.
That makes sense for the official direct chatgpt.com/backend-api/codex path. But codex-lb uses its own client-side key format like sk-clb-..., and that key is not required to be an OpenAI JWT.
So the situation became:
- the
apiKeywas now reaching the transport layer - but OpenClaw /
pi-aiwas still trying to interpretsk-clb-...as if it were an OpenAI-style token - it could not extract an
accountId - and the Codex path still failed
At that point I started moving away from the older generic OpenAI-compatible HTTP path and tried to use the more native Codex route instead.
Why I switched from HTTP /v1 to the real Codex websocket path
Originally, OpenClaw was talking to codex-lb through the simple compatibility route:
baseUrl = /v1api = openai-completions
That path did work, and this detail matters a lot: basic tool calls also worked there.
This is important because switching to websocket did not “turn tools on from zero”. Tools were already partially working before.
However, I started suspecting that the harder failure case — “visible intermediate message first, then expected tool call never comes” — might be related to the fact that OpenClaw was not using a true Codex-native transport, but only a generic HTTP compatibility path.
So I moved the integration toward:
openai-codex-responsescodex-lb/backend-api/codex- websocket transport
The hope was that if I made OpenClaw use the same class of transport as the native Codex client, it would handle the pattern “intermediate visible message -> tool call -> continuation” correctly.
What I changed locally in OpenClaw
I ended up making local OpenClaw source changes in a few places.
The important ideas were:
- restore runtime API key propagation for custom providers
- add a proxy-aware Codex transport path for non-OpenAI base URLs
- keep downstream auth as
Bearer sk-clb-... - provide a synthetic
chatgpt-account-idseparately instead of trying to parsesk-clb-...as an OpenAI JWT
I want to be careful here: I am not claiming these patches are the architecturally correct upstream fix. I am only saying this is the path I used to isolate the problem and get much farther than before.
After those changes, a few important things became true:
- the API key propagation problem was gone
- the
Failed to extract accountId from tokenproblem was gone for the proxy Codex path - OpenClaw really did start talking to
codex-lbover websocket, not just through/v1 - tool-call transport itself became operational on that path
So at that point it was no longer correct to say “the websocket transport is broken” or “codex-lb cannot accept Codex-style traffic from OpenClaw”.
The crucial realization: basic tool calls already worked before websocket
This turned out to be really important.
Once I switched to websocket, I initially assumed that if the transport was now “correct”, then the original problem should disappear. But that turned out not to be true.
Basic tool calls had already worked before, on the older HTTP route /v1 + openai-completions.
So:
- the websocket migration did not magically “enable tools”
- it fixed transport/auth mismatches
- it made the path more Codex-native
- but it did not by itself solve the main remaining symptom
That is why, from the outside, it could still look like “nothing changed”, even though some very real lower-layer issues had actually been fixed.
What still remained broken
After all the transport/auth fixes, the original complaint was still there:
- the model sends a visible intermediate message
- it should then continue and call a tool
- but the run ends like an ordinary assistant response with
stop
At first I still suspected websocket or codex-lb, but more testing made that explanation weaker.
Here is what I was able to confirm:
1. In the failing sessions there were no new runtime/tool errors
During the failing windows I was no longer seeing things like:
No API key for provider: codex-lbFailed to extract accountId from tokenUnexpected server response: 500
So the lower transport/auth layer already looked clean.
2. A direct websocket probe to codex-lb could return message -> completed with no tool call at all
This was an important experiment.
I opened a direct websocket to codex-lb at /backend-api/codex/responses and used a prompt like “first send a short intermediate update, then call a tool”.
The model really could return:
message- text
completed
with no function_call.
So websocket transport by itself does not guarantee “after visible text there will definitely be a tool call”.
3. The same transport could also produce tool calls if the orchestration pressure changed
So the problem is not that websocket “cannot do tools”. It clearly can.
4. Codex CLI / Codex App through the same codex-lb behaved differently
This may be the most important observation.
When I tested similar scenarios through Codex CLI / Codex App, it looked like after an intermediate visible message they could continue the workflow not necessarily inside the exact same provider response, but by making additional websocket requests.
In other words, the native Codex client appears able to do something like:
- show an intermediate message to the user
- then make another model step
- then call a tool
- then continue again
OpenClaw, in the analogous situation, often behaves differently:
- it receives a normal assistant text
- it sees
stop - it assumes the turn is over
- and the loop does not continue
My current interpretation
Based on all of that, my current interpretation is:
codex-lbis not simply “dropping” tool calls- the websocket route in
codex-lbis working - OpenClaw’s auth path can be made to work with it
- basic tools are possible
- but the remaining bug is higher-level, in agent-loop / continuation logic inside OpenClaw
So the problem now looks more like this:
if the model emits a visible intermediate assistant text without a tool call in the same response, OpenClaw too often treats that as the final end of the turn and does not continue, even though Codex App / CLI may continue with additional model calls.
That is a very important shift in focus.
Without making this explicit, it is very easy to go back to the wrong explanation again and say “then websocket must still be broken” or “then codex-lb must still be cutting something off”.
One more nuance: I have also seen something similar with the normal OpenClaw OpenAI Codex provider
I want to be careful here.
Through codex-lb this problem appeared much more often for me, and that is why I started the whole investigation.
But later I noticed that similar behavior could sometimes happen even with the ordinary built-in OpenClaw OpenAI Codex provider.
So my cautious conclusion is:
- I do not think
codex-lbis necessarily creating this behavior entirely from scratch - but through
codex-lbit showed up much more frequently and much more clearly - so either the proxy path makes the model more likely to produce a plain
text + stopturn - or the native provider path is simply better aligned with OpenClaw’s assumptions
- or the native stack more often returns the tool call inside the same turn
But based on what I have so far, I would not say “this is only a proxy problem”. It looks more like the proxy path exposed an existing weakness in OpenClaw’s orchestration logic.
Why I am posting this
I want to be honest: I am not fully confident that my local OpenClaw patches were architecturally correct.
Part of this investigation and part of the code changes were done with the help of an agent, so while I think the symptom tracing and the resulting hypothesis are fairly strong, I absolutely accept that:
- I may still be misunderstanding something
- I may have patched the wrong layer in places
- there may be a much simpler and more correct solution
- this may already be a known OpenClaw behavior that I rediscovered in a messy way
That is why I am posting this not as “here is my final fix”, but as a detailed investigation path. I would really appreciate feedback from people who know this stack better.
I would especially appreciate comments from anyone who has seen this exact pattern:
- the model first emits a visible intermediate message
- then it is supposed to call a tool
- but instead the reasoning chain stops
- and the run ends as an ordinary assistant text response
Final short summary
If I compress the whole story into a sequence, it looks like this:
- first I saw the symptom that tools did not fire after an intermediate visible message
- at first I thought this was a
codex-lbproblem - I spent a long time unsuccessfully digging into
codex-lbitself - then I switched to investigating OpenClaw
- I found and locally worked around an OpenClaw issue where
apiKeywas lost for custom providers - I found and locally worked around a Codex transport issue involving
chatgpt_account_idextraction from an opaque bearer token - I moved OpenClaw from the older HTTP
/v1route to websocket /openai-codex-responses//backend-api/codex - I confirmed that the transport/auth layer really became functional
- but the original symptom still remained
- and my current conclusion is that the remaining bug is much more likely to be an OpenClaw continuation/orchestration issue than a pure
codex-lbtransport issue
If anyone has seen this before, or if I am misunderstanding some part of the intended integration, I would be very grateful for any feedback.