-
Notifications
You must be signed in to change notification settings - Fork 519
Description
Summary
ConnectionMonitor creates an adaptive timeout signal for every heartbeat ping, but the signal is never cleaned up. This retains listeners/refs created by any-signal + AbortSignal.timeout() and causes sustained memory growth in long-running nodes.
Affected code
packages/libp2p/src/connection-monitor.tspackages/utils/src/adaptive-timeout.ts
Current flow in ConnectionMonitor:
- every
pingInterval(default 10s), for each active connection:const signal = this.timeout.getTimeoutSignal(...)- use signal for
newStream,write,read,close
- no call to
this.timeout.cleanUp(signal)in success/failure paths
Current AdaptiveTimeout.cleanUp(signal) updates moving averages, but does not call signal.clear().
Why this leaks
getTimeoutSignal() composes signals via anySignal([options.signal, AbortSignal.timeout(timeout)]).
anySignal returns a ClearableSignal with signal.clear() to detach abort listeners and release references. If clear() is not called, listener/ref chains accumulate under sustained request volume.
In our case, ConnectionMonitor runs per-peer pings continuously, so this path is very hot.
Evidence from production + local A/B
Production symptom
In ChainSafe/lodestar (libp2p v3 migration), network worker old-space regressed by ~215 MB over 5 days on unstable while stable remained flat.
Local A/B soak (same host, same commit/config)
- patched node:
ConnectionMonitorcleanup +AdaptiveTimeout.cleanUp -> signal.clear() - unpatched node: upstream behavior
- runtime: 6.9h, 817 samples @ 30s
Smoothed growth rates:
- patched: 8.8 MB/h
- unpatched: 10.8 MB/h
- delta: ~2.0 MB/h faster on unpatched
Linear slope delta stays consistent across windows (full/4h/2h/1h): ~1.7–2.5 MB/h faster on unpatched.
That delta matches the observed long-run production leak scale (~1.8 MB/h).
Proposed fix
- In
ConnectionMonitor, always callthis.timeout.cleanUp(signal)in afinallyblock for each heartbeat ping attempt. - In
AdaptiveTimeout.cleanUp(signal), callsignal.clear()before recording timing stats.
I opened a PR with both changes + tests.
Reference
Downstream investigation thread with heap evidence and elimination of other suspects: