WIP: Harden Connection Management to Prevent Zombie Connections and Indefinite Blocking #207
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the process of diagnosing this flaky test holochain/holochain#4174, I've experimented with these changes in hope to avoid and/or mitigate loss of connection in some cases.
I'm posting this PR for visibility - stopping work on this for now in favor of the iroh transport implementation.
Summary
This PR tries to fix critical connection management issues that caused threads to block indefinitely on failed connections and allowed zombie connections to remain in the peer map, leading to cascading timeout failures. It also fixes a channel capacity bug that caused "closed" errors under concurrent load.
Problem
The existing implementation had several critical issues:
wait_for_ready()would block indefinitely on a semaphore when waiting for peer connections, causing threads to hang forever when connections failedpeer_mapand were reused by subsequent sends, causing repeated 20-second timeoutsChanges
1. Add Timeout Parameter to
wait_for_ready()(715648d)tokio::time::timeoutto prevent indefinite blockingMaybeReady::wait_for_ready()andPeer::wait_for_ready()config.timeout2. Add Failed Connection Detection (2a92be6)
MaybeReady::is_failed()to checkMaybeReadyState::FailedPeer::is_failed()wrapper method3. Fix Channel Capacity Bug (2b2dc39)
Critical fix: Endpoint event channel was hardcoded to 32 despite config specifying 1024
config.internal_event_channel_size(32→1024, 32x increase)4. Add Aggressive Peer Cleanup (0c9d81f)
Prevents failed connections from remaining in
peer_mapand being reused, which would cause repeated timeoutsconnect_peer()- remove failed peers before reusetask()when connection fails during negotiationsend()whenwait_for_ready()times outImpact
Testing
While this changed the signature of flaky test failures, it didn't (yet?) result in 100% stable connections.