WIP: Harden Connection Management to Prevent Zombie Connections and Indefinite Blocking #207

lucksus · 2025-12-09T19:52:26Z

In the process of diagnosing this flaky test holochain/holochain#4174, I've experimented with these changes in hope to avoid and/or mitigate loss of connection in some cases.

I'm posting this PR for visibility - stopping work on this for now in favor of the iroh transport implementation.

Summary

This PR tries to fix critical connection management issues that caused threads to block indefinitely on failed connections and allowed zombie connections to remain in the peer map, leading to cascading timeout failures. It also fixes a channel capacity bug that caused "closed" errors under concurrent load.

Problem

The existing implementation had several critical issues:

Indefinite Blocking: wait_for_ready() would block indefinitely on a semaphore when waiting for peer connections, causing threads to hang forever when connections failed
Zombie Connection Reuse: Failed connections remained in peer_map and were reused by subsequent sends, causing repeated 20-second timeouts
Channel Saturation: Event channel was hardcoded to 32 slots despite config specifying 1024, causing channel saturation and "closed" errors under concurrent load
No Failed State Detection: No mechanism to proactively detect and clean up failed peer connections before reuse attempts

Changes

1. Add Timeout Parameter to `wait_for_ready()` (`715648d`)

Wrap semaphore acquire in tokio::time::timeout to prevent indefinite blocking
Add timeout parameter to MaybeReady::wait_for_ready() and Peer::wait_for_ready()
Update all call sites to pass config.timeout
Prevents threads from blocking indefinitely on failed connections

2. Add Failed Connection Detection (`2a92be6`)

Add MaybeReady::is_failed() to check MaybeReadyState::Failed
Add Peer::is_failed() wrapper method
Enables proactive cleanup of zombie connections before reuse attempts

3. Fix Channel Capacity Bug (`2b2dc39`)

Critical fix: Endpoint event channel was hardcoded to 32 despite config specifying 1024

ep.rs: Use config.internal_event_channel_size (32→1024, 32x increase)
hub.rs: Increase hub command channel from 32→256 (8x increase)
framed.rs: Increase cmd and msg channels from 32→256 (8x increase)
Prevents channel saturation during concurrent connection establishment and high message throughput

4. Add Aggressive Peer Cleanup (`0c9d81f`)

Prevents failed connections from remaining in peer_map and being reused, which would cause repeated timeouts

Add stale peer check in connect_peer() - remove failed peers before reuse
Add early cleanup in peer task() when connection fails during negotiation
Add cleanup in send() when wait_for_ready() times out
Add debug logging for peer lifecycle events

Impact

Prevents indefinite thread blocking on failed connections
Eliminates zombie connection reuse that caused cascading timeout failures
Fixes channel saturation under concurrent load (32x increase for main event channel)
Enables proactive failed connection detection and cleanup
Improves system reliability during high-concurrency scenarios and network failures

Testing

While this changed the signature of flaky test failures, it didn't (yet?) result in 100% stable connections.

…blocking Previously, wait_for_ready() would block indefinitely on a semaphore when waiting for peer connections. This caused threads to hang forever when connections failed, leading to cascading failures. Changes: - Add timeout parameter to MaybeReady::wait_for_ready() - Wrap semaphore acquire in tokio::time::timeout - Update Peer::wait_for_ready() signature to accept timeout parameter - Update all call sites to pass config.timeout This prevents threads from blocking indefinitely on failed connections.

Add methods to check if a peer connection is in a failed state, enabling proactive cleanup of zombie connections before reuse attempts. Changes: - Add MaybeReady::is_failed() to check MaybeReadyState::Failed - Add Peer::is_failed() wrapper method - Enables detection of stale connections in peer_map

Critical fix: endpoint event channel was hardcoded to 32 slots despite config specifying internal_event_channel_size=1024. This caused channel saturation under concurrent load, resulting in "closed" errors. Changes: - ep.rs: Use config.internal_event_channel_size (32→1024, 32x increase) - hub.rs: Increase hub command channel from 32→256 (8x increase) - framed.rs: Increase cmd and msg channels from 32→256 (8x increase) These increases prevent channel saturation during concurrent connection establishment and high message throughput.

Prevent failed connections from remaining in peer_map and being reused by subsequent sends, which would cause them to timeout repeatedly. Changes: - Add stale peer check in connect_peer() - remove failed peers before reuse - Add early cleanup in peer task() when connection fails during negotiation - Add cleanup in send() when wait_for_ready() times out - Add debug logging for peer lifecycle events This prevents the pattern where a failed connection stays in peer_map, gets reused by subsequent sends, and times out repeatedly (20s each time).

cocogitto-bot · 2025-12-09T19:52:31Z

✔️ 715648d...0c9d81f - Conventional commits check succeeded.

coderabbitai · 2025-12-09T19:52:32Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lucksus added 4 commits December 5, 2025 18:54

lucksus mentioned this pull request Dec 9, 2025

TX5 retry logic holochain/kitsune2#406

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Harden Connection Management to Prevent Zombie Connections and Indefinite Blocking #207

WIP: Harden Connection Management to Prevent Zombie Connections and Indefinite Blocking #207

Uh oh!

lucksus commented Dec 9, 2025

Uh oh!

cocogitto-bot bot commented Dec 9, 2025

Uh oh!

coderabbitai bot commented Dec 9, 2025

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WIP: Harden Connection Management to Prevent Zombie Connections and Indefinite Blocking #207

Are you sure you want to change the base?

WIP: Harden Connection Management to Prevent Zombie Connections and Indefinite Blocking #207

Uh oh!

Conversation

lucksus commented Dec 9, 2025

Summary

Problem

Changes

1. Add Timeout Parameter to wait_for_ready() (715648d)

2. Add Failed Connection Detection (2a92be6)

3. Fix Channel Capacity Bug (2b2dc39)

4. Add Aggressive Peer Cleanup (0c9d81f)

Impact

Testing

Uh oh!

cocogitto-bot bot commented Dec 9, 2025

Uh oh!

coderabbitai bot commented Dec 9, 2025

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Add Timeout Parameter to `wait_for_ready()` (`715648d`)

2. Add Failed Connection Detection (`2a92be6`)

3. Fix Channel Capacity Bug (`2b2dc39`)

4. Add Aggressive Peer Cleanup (`0c9d81f`)