Skip to content

fix: prevent safeoutputs session expiry during long-running agent tasks#3079

Merged
lpcox merged 5 commits intomainfrom
copilot/fix-safeoutputs-session-expiration
Apr 3, 2026
Merged

fix: prevent safeoutputs session expiry during long-running agent tasks#3079
lpcox merged 5 commits intomainfrom
copilot/fix-safeoutputs-session-expiration

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 3, 2026

MCP sessions (notably safeoutputs) expire after ~30 minutes of backend inactivity. When agents run long CPU-bound tasks (ML training, large builds), no MCP calls are made, the server-side session silently expires, and all subsequent calls fail with session not found — making results undeliverable.

Changes

HTTP backend keepalive (primary fix)

  • Adds DefaultHTTPKeepaliveInterval = 25 * time.Minute — intentionally under the typical 30-min backend timeout
  • Threads a keepAlive time.Duration parameter through newMCPClient into the SDK's built-in ClientOptions.KeepAlive, which automatically sends periodic ping requests to keep the server-side session alive
  • Stores keepAliveInterval on the Connection struct so reconnectSDKTransport preserves the same keepalive on reconnect
  • STDIO connections use keepAlive=0 — process lifecycle manages the session, no pings needed
// HTTP connections now use keepalive; STDIO does not
client := newMCPClient(logConn, DefaultHTTPKeepaliveInterval) // HTTP
client := newMCPClient(logConn, 0)                            // stdio

Connection pool idle timeout extension (secondary fix)

  • DefaultIdleTimeout: 30m → 6h — aligns with the maximum autoloop workflow duration, preventing premature eviction of STDIO backend connections from the session pool during long quiet periods

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • example.com
    • Triggering command: /tmp/go-build1955081425/b336/launcher.test /tmp/go-build1955081425/b336/launcher.test -test.testlogfile=/tmp/go-build1955081425/b336/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/runtime/c-I AqUKQ403Y de/node/bin/as (dns block)
    • Triggering command: /tmp/go-build216210123/b264/launcher.test /tmp/go-build216210123/b264/launcher.test -test.testlogfile=/tmp/go-build216210123/b264/testlog.txt -test.paniconexit0 -test.timeout=10m0s 5081�� 5081425/b189/_pkg_.a ache/go/1.25.8/x64/src/regexp/syntax/compile.go x_amd64/vet -p l -lang=go1.25 x_amd64/vet go_.�� 64/src/net SkhS/x3QNW0bkE1zmain 64/pkg/tool/linu-lang=go1.25 -I /tmp/go-build195-buildid -I 64/pkg/tool/linu-dwarf=false (dns block)
    • Triggering command: /tmp/go-build3185549484/b001/launcher.test /tmp/go-build3185549484/b001/launcher.test -test.testlogfile=/tmp/go-build3185549484/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ve�� 359733afab535821 ache/go/1.25.8/x64/pkg/tool/linujson 4769656/b379/vet.cfg 524e3443524efd17git 5081425/b152/vetshow f33/log.json 841eeb0f198986d61e1/log.json -qE (dns block)
  • invalid-host-that-does-not-exist-12345.com
    • Triggering command: /tmp/go-build1955081425/b318/config.test /tmp/go-build1955081425/b318/config.test -test.testlogfile=/tmp/go-build1955081425/b318/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true -c=4 -nolocalimports -importcfg /tmp/go-build1955081425/b286/importcfg -pack /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/oidc/provider.go 64/pkg/tool/linu/tmp/go-build1955081425/b124/vet.cfg --no�� go dedword.go 64/pkg/tool/linux_amd64/compile (dns block)
    • Triggering command: /tmp/go-build1294769656/b322/config.test /tmp/go-build1294769656/b322/config.test -test.testlogfile=/tmp/go-build1294769656/b322/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ctstate INVALID,NEW -j DROP --gdwarf-5 --64 -o x_amd64/compile (dns block)
  • nonexistent.local
    • Triggering command: /tmp/go-build1955081425/b336/launcher.test /tmp/go-build1955081425/b336/launcher.test -test.testlogfile=/tmp/go-build1955081425/b336/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/runtime/c-I AqUKQ403Y de/node/bin/as (dns block)
    • Triggering command: /tmp/go-build216210123/b264/launcher.test /tmp/go-build216210123/b264/launcher.test -test.testlogfile=/tmp/go-build216210123/b264/testlog.txt -test.paniconexit0 -test.timeout=10m0s 5081�� 5081425/b189/_pkg_.a ache/go/1.25.8/x64/src/regexp/syntax/compile.go x_amd64/vet -p l -lang=go1.25 x_amd64/vet go_.�� 64/src/net SkhS/x3QNW0bkE1zmain 64/pkg/tool/linu-lang=go1.25 -I /tmp/go-build195-buildid -I 64/pkg/tool/linu-dwarf=false (dns block)
    • Triggering command: /tmp/go-build3185549484/b001/launcher.test /tmp/go-build3185549484/b001/launcher.test -test.testlogfile=/tmp/go-build3185549484/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ve�� 359733afab535821 ache/go/1.25.8/x64/pkg/tool/linujson 4769656/b379/vet.cfg 524e3443524efd17git 5081425/b152/vetshow f33/log.json 841eeb0f198986d61e1/log.json -qE (dns block)
  • slow.example.com
    • Triggering command: /tmp/go-build1955081425/b336/launcher.test /tmp/go-build1955081425/b336/launcher.test -test.testlogfile=/tmp/go-build1955081425/b336/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/runtime/c-I AqUKQ403Y de/node/bin/as (dns block)
    • Triggering command: /tmp/go-build216210123/b264/launcher.test /tmp/go-build216210123/b264/launcher.test -test.testlogfile=/tmp/go-build216210123/b264/testlog.txt -test.paniconexit0 -test.timeout=10m0s 5081�� 5081425/b189/_pkg_.a ache/go/1.25.8/x64/src/regexp/syntax/compile.go x_amd64/vet -p l -lang=go1.25 x_amd64/vet go_.�� 64/src/net SkhS/x3QNW0bkE1zmain 64/pkg/tool/linu-lang=go1.25 -I /tmp/go-build195-buildid -I 64/pkg/tool/linu-dwarf=false (dns block)
    • Triggering command: /tmp/go-build3185549484/b001/launcher.test /tmp/go-build3185549484/b001/launcher.test -test.testlogfile=/tmp/go-build3185549484/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ve�� 359733afab535821 ache/go/1.25.8/x64/pkg/tool/linujson 4769656/b379/vet.cfg 524e3443524efd17git 5081425/b152/vetshow f33/log.json 841eeb0f198986d61e1/log.json -qE (dns block)
  • this-host-does-not-exist-12345.com
    • Triggering command: /tmp/go-build1955081425/b345/mcp.test /tmp/go-build1955081425/b345/mcp.test -test.testlogfile=/tmp/go-build1955081425/b345/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true --noprofile 64/src/runtime/c-ifaceassert x_amd64/vet (dns block)
    • Triggering command: /tmp/go-build216210123/b252/mcp.test /tmp/go-build216210123/b252/mcp.test -test.testlogfile=/tmp/go-build216210123/b252/testlog.txt -test.paniconexit0 -test.timeout=10m0s 5081�� internal/chacha8rand ache/go/1.25.8/x64/src/runtime/cgo x_amd64/vet 5081425/b171/ /tmp/go-build195run (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI and others added 2 commits April 3, 2026 00:31
…end connection pool idle timeout

Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/e656bbfd-cb96-4f37-b453-1acce2b238df

Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
…vs HTTP session management

Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/e656bbfd-cb96-4f37-b453-1acce2b238df

Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix safeoutputs MCP sessions expiration during long-running tasks fix: prevent safeoutputs session expiry during long-running agent tasks Apr 3, 2026
Copilot AI requested a review from lpcox April 3, 2026 00:34
Copilot finished work on behalf of lpcox April 3, 2026 00:34
Add keepalive_interval field to GatewayConfig (seconds). Default 1500
(25 min) preserves current behavior. Set to -1 to disable keepalive
pings entirely when higher-level timeouts manage session lifecycle.

Thread the configurable duration through NewHTTPConnection → trySDKTransport
→ newMCPClient instead of hardcoding DefaultHTTPKeepaliveInterval.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox lpcox marked this pull request as ready for review April 3, 2026 00:49
Copilot AI review requested due to automatic review settings April 3, 2026 00:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent MCP (notably safeoutputs) sessions from expiring during long-running, CPU-bound agent tasks by introducing periodic HTTP keepalive pings and by extending the STDIO connection pool idle eviction timeout.

Changes:

  • Add HTTP keepalive support by threading a keepAlive time.Duration through HTTP connection creation into the MCP SDK client options, and preserving it across reconnects.
  • Extend the connection pool default idle timeout from 30 minutes to 6 hours to avoid prematurely evicting long-idle STDIO backend connections.
  • Update affected call sites and tests to match the new NewHTTPConnection(..., keepAlive) API.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/mcp/http_transport.go Introduces default keepalive interval constant and threads keepalive into SDK client creation and HTTP connection construction.
internal/mcp/connection.go Adds keepalive param to NewHTTPConnection and ensures reconnect reuses the stored keepalive interval; disables keepalive for STDIO.
internal/launcher/launcher.go Passes gateway-configured HTTP keepalive interval when creating HTTP connections.
internal/config/config_core.go Adds gateway keepalive configuration + defaulting and exposes it as time.Duration.
internal/launcher/connection_pool.go Extends default pool idle timeout to 6 hours.
internal/mcp/*_test.go, test/integration/http_error_test.go Updates tests to match the new HTTP connection signature and adds coverage for keepalive-related helpers/constants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sessionID = session.ID()
}
logHTTP.Printf("Creating HTTP connection: serverID=%s, url=%s, transport=%s, headers=%d, sessionID=%s", serverID, url, transportType, len(headers), sessionID)
logHTTP.Printf("Creating HTTP connection: serverID=%s, url=%s, transport=%s, headers=%d, sessionID=%s, keepAlive=%v", serverID, url, transportType, len(headers), sessionID, keepAlive)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sessionID is logged in full here. Backend session IDs can be sensitive and the codebase elsewhere truncates session IDs before logging (e.g., internal/auth.TruncateSessionID used in internal/server/*). Please truncate/redact the session ID (or omit it entirely) in this log line to avoid leaking secrets in logs.

Suggested change
logHTTP.Printf("Creating HTTP connection: serverID=%s, url=%s, transport=%s, headers=%d, sessionID=%s, keepAlive=%v", serverID, url, transportType, len(headers), sessionID, keepAlive)
logHTTP.Printf("Creating HTTP connection: serverID=%s, url=%s, transport=%s, headers=%d, keepAlive=%v", serverID, url, transportType, len(headers), keepAlive)

Copilot uses AI. Check for mistakes.
Comment on lines 121 to +127

// HTTPKeepaliveInterval returns the keepalive interval as a time.Duration.
// A negative KeepaliveInterval disables keepalive (returns 0).
func (g *GatewayConfig) HTTPKeepaliveInterval() time.Duration {
if g.KeepaliveInterval < 0 {
return 0
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeepaliveInterval is added to GatewayConfig, but it isn’t wired into the stdin JSON config path: StdinGatewayConfig/the embedded JSON schema don’t include a keepalive field, and convertStdinConfig doesn’t call applyGatewayDefaults when stdinCfg.Gateway is present. As a result, for --config-stdin runs (which are schema-validated), keepalive will remain 0 and be effectively disabled, undermining the main fix. Please extend the stdin schema + StdinGatewayConfig to accept a keepalive value (likely keepaliveInterval in camelCase), propagate it into GatewayConfig, and ensure defaults are applied for the stdin conversion path too.

Suggested change
// HTTPKeepaliveInterval returns the keepalive interval as a time.Duration.
// A negative KeepaliveInterval disables keepalive (returns 0).
func (g *GatewayConfig) HTTPKeepaliveInterval() time.Duration {
if g.KeepaliveInterval < 0 {
return 0
}
// defaultHTTPKeepaliveIntervalSeconds is used when KeepaliveInterval is left unset.
// Negative values still explicitly disable keepalive.
const defaultHTTPKeepaliveIntervalSeconds = 30
// HTTPKeepaliveInterval returns the keepalive interval as a time.Duration.
// A negative KeepaliveInterval disables keepalive (returns 0).
// A zero KeepaliveInterval is treated as "unset" and falls back to the default.
func (g *GatewayConfig) HTTPKeepaliveInterval() time.Duration {
if g == nil {
return defaultHTTPKeepaliveIntervalSeconds * time.Second
}
if g.KeepaliveInterval < 0 {
return 0
}
if g.KeepaliveInterval == 0 {
return defaultHTTPKeepaliveIntervalSeconds * time.Second
}

Copilot uses AI. Check for mistakes.
Comment on lines +42 to 49
// DefaultIdleTimeout is the maximum duration a STDIO-backend connection can remain unused
// before being removed from the session pool. Set to 6 hours to accommodate long-running
// workflow tasks (e.g. ML training, large builds) that may not make MCP calls for extended
// periods. Note: this is distinct from HTTP backend keepalive (DefaultHTTPKeepaliveInterval)
// which keeps the remote session alive on the HTTP server side; STDIO connections run as local
// child processes whose sessions are bounded only by this pool eviction window.
DefaultIdleTimeout = 6 * time.Hour
DefaultCleanupInterval = 5 * time.Minute
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising DefaultIdleTimeout to 6h increases the time STDIO backend processes can remain resident when idle. In this pool’s cleanup path, removed entries are not actually closed (there is no call to Connection.Close()), so longer timeouts can prolong leaked resources if callers forget to close or if pool eviction is relied on for cleanup. Consider closing the underlying mcp.Connection when evicting (and/or confirming callers always close) so the longer timeout doesn’t translate into longer-lived orphaned processes.

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +42
// DefaultHTTPKeepaliveInterval is the default interval for sending keepalive pings to HTTP backends.
// This should be less than the backend's session idle timeout (typically 30 minutes for safeoutputs).
// A value of 25 minutes gives a 5-minute buffer before the session would expire.
const DefaultHTTPKeepaliveInterval = 25 * time.Minute
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are now two separate “default keepalive” constants: DefaultHTTPKeepaliveInterval here (a time.Duration) and DefaultKeepaliveInterval in internal/config/config_core.go (seconds). Keeping defaults in two places risks drift over time. Consider using a single source of truth for the default (e.g., define it once and derive the other representation from it) to avoid inconsistent behavior between config defaults and code defaults.

Copilot uses AI. Check for mistakes.
- Remove session ID from HTTP connection log line to avoid leaking
  sensitive data (use TruncateSessionID pattern)
- Wire keepalive_interval into stdin JSON config path via
  StdinGatewayConfig.KeepaliveInterval and convertStdinConfig
- Add nil guard to GatewayConfig.HTTPKeepaliveInterval()
- Close connections during pool eviction instead of just marking state
- Make Connection.Close() nil-safe for cancel function
- Consolidate duplicate keepalive constants: remove
  DefaultHTTPKeepaliveInterval from mcp package, use single source of
  truth in config.DefaultKeepaliveInterval

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox
Copy link
Copy Markdown
Collaborator

lpcox commented Apr 3, 2026

Note on keepalive_interval = 0 semantics

applyGatewayDefaults() treats 0 as "unset" and silently overrides it to 1500 (25 min). This means a user who explicitly sets keepalive_interval = 0 expecting "no keepalive" will instead get 25-minute pings — a surprising default.

The -1 sentinel for "disabled" is documented in the GatewayConfig comment, but the 0 → default behavior isn't called out and could trip up users or integrators (e.g., gh-aw passing keepaliveInterval: 0 in stdin JSON thinking it disables keepalive).

Suggestion: Consider using a pointer (*int) for KeepaliveInterval in GatewayConfig (like StdinGatewayConfig already does) so that nil = use default, 0 = explicitly disabled, -1 isn't needed. Alternatively, at minimum document that 0 maps to the default and -1 is the disable value.

Not a blocker — the fix itself is solid and the 25-min default is the right choice for almost all users.

@lpcox lpcox merged commit f985606 into main Apr 3, 2026
17 checks passed
@lpcox lpcox deleted the copilot/fix-safeoutputs-session-expiration branch April 3, 2026 03:31
@lpcox lpcox restored the copilot/fix-safeoutputs-session-expiration branch April 3, 2026 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Safeoutputs MCP sessions expire during long-running agent tasks (~30 min idle timeout)

3 participants