-
Notifications
You must be signed in to change notification settings - Fork 19
bug: Safeoutputs MCP sessions expire during long-running agent tasks (~30 min idle timeout) #3078
Description
Summary
The MCP gateway drops safeoutputs sessions after approximately 30 minutes of inactivity. When an agent runs a long CPU-bound task (e.g., ML training, large builds) inside a shell tool call, no MCP requests are made during that period. The session expires server-side and becomes unrecoverable — all subsequent safeoutputs calls fail with session not found, and the agent has no way to deliver its results.
This is a correctness issue: safeoutputs is the only channel for the agent to report results, and it silently expires during the work the agent was asked to do.
Upstream issue
github/gh-aw#23153 — two independent reports:
dsyme/fv-squad— 45-minute job, session expiredgithubnext/autoresearch_local— 30-minute training run, session expired
Reproduction timeline (autoresearch_local)
| Time (UTC) | Event |
|---|---|
| 22:29:07 | safeoutputs MCP server started (v0.2.9), session established |
| 22:29:32 | First initialize + tools/list calls succeed |
| 22:50–23:00 | Agent runs ML training (~30 min, no MCP calls) |
| 23:00:34 | Training completes successfully |
| 23:02:21 | First safeoutputs failure: session not found |
| 23:02–23:13 | All subsequent calls fail — noop, create_pull_request, push_repo_memory, add_comment, missing_tool |
| 23:13:48 | Agent gives up: "safeoutputs MCP session is permanently expired" |
Error
All calls return the same error:
✗ noop (MCP: safeoutputs)
└ MCP server 'safeoutputs': Error: Streamable HTTP error: Error POSTing to endpoint: session not found
✗ create_pull_request (MCP: safeoutputs)
└ MCP server 'safeoutputs': Error: Streamable HTTP error: Error POSTing to endpoint: session not found
Impact
- Zero safe outputs completed — no PR, no comments, no repo-memory
- Training succeeded (val_bpb 2.236 → 2.107) but results were lost
- Agent wasted ~11 minutes retrying with sleep waits before giving up
- Client-side workarounds (keepalive prompts) don't help because the agent can't send MCP calls while blocked on a long shell execution
Proposed fixes
Any of these would resolve the issue:
-
Remove or significantly extend session timeout for safeoutputs — these sessions should live for the duration of the workflow (up to 6 hours for autoloop). A 30-minute idle timeout is incompatible with long-running tasks.
-
Automatic keepalive from the gateway side — the gateway could ping/refresh sessions internally rather than relying on client activity.
-
Transparent session reconnect — allow the client to re-establish a session when it receives
session not found, without requiring manual intervention from the agent.
References
- Upstream issue: MCP gateway drops out in long running jobs: Streamable HTTP error: Error POSTing to endpoint: session not found gh-aw#23153
- Comment with detailed analysis by @insop
- Client-side keepalive attempt (ineffective): https://github.com/githubnext/autoresearch_local/commit/b856479