Skip to content

bug: Safeoutputs MCP sessions expire during long-running agent tasks (~30 min idle timeout) #3078

@lpcox

Description

@lpcox

Summary

The MCP gateway drops safeoutputs sessions after approximately 30 minutes of inactivity. When an agent runs a long CPU-bound task (e.g., ML training, large builds) inside a shell tool call, no MCP requests are made during that period. The session expires server-side and becomes unrecoverable — all subsequent safeoutputs calls fail with session not found, and the agent has no way to deliver its results.

This is a correctness issue: safeoutputs is the only channel for the agent to report results, and it silently expires during the work the agent was asked to do.

Upstream issue

github/gh-aw#23153 — two independent reports:

Reproduction timeline (autoresearch_local)

Time (UTC) Event
22:29:07 safeoutputs MCP server started (v0.2.9), session established
22:29:32 First initialize + tools/list calls succeed
22:50–23:00 Agent runs ML training (~30 min, no MCP calls)
23:00:34 Training completes successfully
23:02:21 First safeoutputs failure: session not found
23:02–23:13 All subsequent calls fail — noop, create_pull_request, push_repo_memory, add_comment, missing_tool
23:13:48 Agent gives up: "safeoutputs MCP session is permanently expired"

Error

All calls return the same error:

✗ noop (MCP: safeoutputs)
  └ MCP server 'safeoutputs': Error: Streamable HTTP error: Error POSTing to endpoint: session not found

✗ create_pull_request (MCP: safeoutputs)
  └ MCP server 'safeoutputs': Error: Streamable HTTP error: Error POSTing to endpoint: session not found

Impact

  • Zero safe outputs completed — no PR, no comments, no repo-memory
  • Training succeeded (val_bpb 2.236 → 2.107) but results were lost
  • Agent wasted ~11 minutes retrying with sleep waits before giving up
  • Client-side workarounds (keepalive prompts) don't help because the agent can't send MCP calls while blocked on a long shell execution

Proposed fixes

Any of these would resolve the issue:

  1. Remove or significantly extend session timeout for safeoutputs — these sessions should live for the duration of the workflow (up to 6 hours for autoloop). A 30-minute idle timeout is incompatible with long-running tasks.

  2. Automatic keepalive from the gateway side — the gateway could ping/refresh sessions internally rather than relying on client activity.

  3. Transparent session reconnect — allow the client to re-establish a session when it receives session not found, without requiring manual intervention from the agent.

References

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions