Skip to content

fix: eliminate 60-second delay for OpenRouter Pipe requests (issue #378)#384

Open
cwawak wants to merge 15 commits intocogwheel0:mainfrom
cwawak:fix/direct-flag-pipe-model-delay
Open

fix: eliminate 60-second delay for OpenRouter Pipe requests (issue #378)#384
cwawak wants to merge 15 commits intocogwheel0:mainfrom
cwawak:fix/direct-flag-pipe-model-delay

Conversation

@cwawak
Copy link

@cwawak cwawak commented Mar 2, 2026

Summary

Fixes the ~60 second OpenRouter Pipe delay in Conduit and hardens SSE-only
streaming to restore usage/cost info and prevent duplicate output.

Root Cause

OpenWebUI routes /api/chat/completions through an async task queue when
session_id, chat_id, and id (message_id) are all present, adding ~60
seconds of latency. Conduit was sending all three; the web UI does not.

Fix

  1. Remove session_id, id, and chat_id from the request payload to bypass
    the async queue and force direct SSE streaming.
  2. Parse OpenWebUI SSE formats (event payloads + reasoning content).
  3. Consume SSE usage markers so the info/cost button is restored.
  4. Preserve whitespace chunks so tokens don’t concatenate and “Thinking…”
    content appears reliably.
  5. Skip channel socket subscriptions in SSE-only mode to avoid duplicate
    content from both HTTP and sockets.

Changes

  • lib/core/services/api_service.dart
    • Remove identifiers from payload; add SSE parsing (usage + reasoning).
  • lib/core/services/streaming_helper.dart
    • SSE usage marker handling, whitespace preservation, and skip channel
      socket subscriptions when httpStreamOnly.
  • lib/features/chat/providers/chat_providers.dart
    • Pass httpStreamOnly to streaming helper.
  • lib/features/chat/widgets/assistant_message_widget.dart
    • Usage/cost display logic for info button.

Testing

  • HTTP request completes in ~2–3 seconds instead of ~60 seconds.
  • SSE response streams properly with reasoning and usage/cost info.
  • No duplicate content in SSE-only mode.

@cwawak
Copy link
Author

cwawak commented Mar 2, 2026

Root cause → Fix → Results (please note: I’m not an expert here, just doing my best to help)

Root cause

  • OpenWebUI routes /api/chat/completions through a Redis async task queue when all three identifiers are present: session_id, chat_id, and id (message_id). That queue adds ~60s latency. Conduit was sending all three; the web UI does not.

Fix

  • Stop sending session_id, chat_id, and id in the Conduit request to avoid the async queue and force direct SSE streaming.
  • Update SSE parsing to handle OpenWebUI event format + reasoning blocks.
  • Consume SSE usage markers so the info/cost button works.
  • Preserve whitespace chunks and skip channel socket subscriptions in SSE-only mode to avoid concatenation/duplication.

Results

  • ✅ 60-second delay eliminated (responses start in ~2–3s).
  • ✅ SSE streaming works with reasoning + usage/cost info.
  • ✅ No duplicated output in SSE-only mode (channel socket unsubscribed).
  • ✅ Spacing/“Thinking…” content no longer drops when chunks are whitespace-only.

cwawak added 4 commits March 2, 2026 16:43
- Remove session_id, id, and chat_id from request payload to bypass
  OpenWebUI's async task queue (issue cogwheel0#378)
- Add SSE parsing to handle streaming response directly via HTTP
- Add isHttpStreamOnly flag to prevent duplicate content from WebSocket
- When using SSE-only mode, skip WebSocket subscriptions entirely

The 60-second delay was caused by OpenWebUI routing requests through an
async task queue when session_id, chat_id, and message_id (id) were all
present. By removing these identifiers, requests go directly through SSE
streaming instead.
@cwawak cwawak force-pushed the fix/direct-flag-pipe-model-delay branch from 9c69496 to f57d791 Compare March 2, 2026 21:46
@cwawak
Copy link
Author

cwawak commented Mar 2, 2026

Quick update: I added SSE replay‑dedupe + whitespace handling for SSE‑only mode.

  • Skip replayed SSE chunks when the server re‑sends already‑seen content.
  • Always append whitespace‑only chunks so spaces between tokens aren’t dropped.

This resolved the duplicated responses and the “Thesky” first‑word spacing issue in my testing.

@cwawak
Copy link
Author

cwawak commented Mar 2, 2026

Update: I rebased the PR branch onto the refactor version.

What changed in the PR branch:

  • fix/direct-flag-pipe-model-delay now points to the refactor/sse-normalize history.
  • SSE content is normalized in api_service.dart into deltas before the UI sees it.
  • UI stream handler (streaming_helper.dart) no longer needs replay/duplicate heuristics.
  • Mixed SSE formats are handled by skipping OpenAI deltas once SSE event content is seen.

Result: no duplicated responses and first‑word spacing issues resolved in my tests.

@cogwheel0
Copy link
Owner

Hey @cwawak, thank you for the PR! Streaming via SSE has it's own problems and is not very reliable for mobile clients. I struggled to balance and keep it as a path in the initial days of the app. Duplication is just one of issues you may have come across. OWUI web also primarily relies on websockets.

Anyways, could you try this and let me know if it alleviates the issue without relying on SSE?: https://docs.openwebui.com/troubleshooting/connection-error#websocket-troubleshooting

@cwawak
Copy link
Author

cwawak commented Mar 3, 2026

Hi @cogwheel0 - thanks for being so kind. I don't think my patch is usable, but maybe helpful for someone who is troubleshooting this!

I have verified that websockets are working fine. Using Chrome on my mac, I do not have any issues with Websockets. Using Conduit to Anthropic hosted or NIM hosted models works fine with the latest couple versions of the app.

I only encounter the 30s+30s=60s total delay when using models that are deployed using the "Open WebUI OpenRouter Pipe". This pipe allows for easy toggle of "OpenRouter search" (where OpenRouter, for a few pennies, inserts search results into the model response), some enforcement of Zero Data Retention flags per model, and a nice little display of tokens consumed and total cost.

In the Web UI, I have no problems with WS or delays, only via the Conduit app. I don't necessarily think this is something that should be fixed in Conduit, but I couldn't figure out what was wrong with the OpenRouter Pipe itself.

Anyway, I greatly appreciate your kind words, I think your application is positively delightful and very easy to work on for a novice coder!

@cogwheel0
Copy link
Owner

Hi @cogwheel0 - thanks for being so kind. I don't think my patch is usable, but maybe helpful for someone who is troubleshooting this!

I have verified that websockets are working fine. Using Chrome on my mac, I do not have any issues with Websockets. Using Conduit to Anthropic hosted or NIM hosted models works fine with the latest couple versions of the app.

I only encounter the 30s+30s=60s total delay when using models that are deployed using the "Open WebUI OpenRouter Pipe". This pipe allows for easy toggle of "OpenRouter search" (where OpenRouter, for a few pennies, inserts search results into the model response), some enforcement of Zero Data Retention flags per model, and a nice little display of tokens consumed and total cost.

In the Web UI, I have no problems with WS or delays, only via the Conduit app. I don't necessarily think this is something that should be fixed in Conduit, but I couldn't figure out what was wrong with the OpenRouter Pipe itself.

Anyway, I greatly appreciate your kind words, I think your application is positively delightful and very easy to work on for a novice coder!

Ah I see! If you are just looking to use the websearch for openrouter models, might I suggest another solution? If you put :online after any model in the OWUI connections with a openrouter endpoint, it would work just as well. It even returns the sources along with number of tokens consumed in the newer versions.

I haven't tried out the pipe you linked so I might be missing something as well.

@cwawak
Copy link
Author

cwawak commented Mar 3, 2026

I'm too cheap to actually pay the $0.02 per request for web search, so it's not actually a feature I care about! :D

I think you're right the best solution is to simply use the normal OpenAI model flow rather than a custom pipe, especially if there's some strange issues that are taking days to figure out. I thought this would be a 15 minute fix.

The only thing I'd be missing, is I don't think the native OpenAI model API has a way to display the cost and token consumption like in this picture.

Screenshot 2026-03-03 at 1 34 29 PM

Thank you so much for your assistance!

@swever826
Copy link

The only thing I'd be missing, is I don't think the native OpenAI model API has a way to display the cost and token consumption like in this picture.

Hi @cwawak. Sorry just wanted to share that to see the consumption in real time I am using LiteLLM if you want to give it a try. Its open-source and can be run locally.

It works like a proxy so all the LLMs you use can be used from 1 place and it has a lot of cool functionalities. One of them is that it shows in real time the cost and token consumption of each LLM.

@TeenBiscuits
Copy link

I also use OpenRouter Pipe; it's very useful for integrating OpenRouter features into Open Web UI, but latency is a major issue. Is there a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants