Skip to content

fix: make upstream timeout configurable, default to 300s#799

Open
syedhashmi wants to merge 6 commits intomainfrom
syed/issue_787
Open

fix: make upstream timeout configurable, default to 300s#799
syedhashmi wants to merge 6 commits intomainfrom
syed/issue_787

Conversation

@syedhashmi
Copy link
Copy Markdown
Collaborator

@syedhashmi syedhashmi commented Mar 5, 2026

Placeholder PR for Adil to review the timeout related changes.

fixes #787

@syedhashmi syedhashmi requested a review from adilhafeez March 5, 2026 00:33
…787)

Hardcoded 30s timeouts in envoy config caused premature termination of
long-running LLM requests (tool-use, agentic workflows). Make timeouts
configurable via upstream_timeout_ms override and default to 300s.
Copy link
Copy Markdown
Contributor

@salmanap salmanap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the PR. Also, I think we would need a test case here - else its hard to tell if the change in timeout is actually working. Lastly, I don't follow from the issues request how the user simulated a timeout. From the looks of that code, I see that he is adding a time.sleep on local code, which has no implication on the response from an upstream LLM

"upstream_tls_ca_path", "/etc/ssl/certs/ca-certificates.crt"
)

upstream_timeout_ms = overrides.get("upstream_timeout_ms")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we we have an upstream_timeout_rs field, when the model_listener object already has a timeout field. Can you elaborate a bit more?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use existing per listener timeout.

type: boolean
use_agent_orchestrator:
type: boolean
upstream_timeout_ms:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. I don't think we need this field, especially if we already support a timeout field for model_listener objects. Please review more carefully

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned over the zoom call - we don't need any changes to the prompt_gateway side of things. The issue talked about how the llm_gateway was the one timing out and the developer may have had a tool call scenario that could have taken longer.

@adilhafeez
Copy link
Copy Markdown
Contributor

So there are at multiple timeouts we are talking here,

  • Connection timeout — Envoy cluster connect_timeout (5s default)
  • Route timeout — Envoy route timeout (inbound client request lifecycle)
  • WASM outbound call timeout — dispatch_http_call Duration + x-envoy-upstream-rq-timeout-ms
  • Brightstaff outbound call timeout — reqwest client timeout (currently none)

For default values we should use sensible defaults for connection and request timeouts. And a developer should be able to modify them using overrides section in config.yaml. And defaults should be defined centrally somewhere and let's discuss their values here.

For example here is what I think default should be,

  • connection_timeout: 1s
  • request_timeout: 120s

@salmanap
Copy link
Copy Markdown
Contributor

So there are at multiple timeouts we are talking here,

  • Connection timeout — Envoy cluster connect_timeout (5s default)
  • Route timeout — Envoy route timeout (inbound client request lifecycle)
  • WASM outbound call timeout — dispatch_http_call Duration + x-envoy-upstream-rq-timeout-ms
  • Brightstaff outbound call timeout — reqwest client timeout (currently none)

For default values we should use sensible defaults for connection and request timeouts. And a developer should be able to modify them using overrides section in config.yaml. And defaults should be defined centrally somewhere and let's discuss their values here.

For example here is what I think default should be,

  • connection_timeout: 1s
  • request_timeout: 120s

I think we should only expose a single timeout field right now to the developer via the config and set sensible defaults for the rest. The one timeout field is request_timeout, and the rest are internal timeouts with sensible defaults. Note for arcfc.katanemo.dev we can't set a connection_timeout of 1s especially for non-US access to our hosted models. It must be higher.

@salmanap
Copy link
Copy Markdown
Contributor

@syedhashmi when are we wrapping this up? We need to get this over the finish line please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WASM filter hardcoded 30s timeout breaks agentic/tool-use workloads

3 participants