Skip to content

Conversation

@lucksus
Copy link

@lucksus lucksus commented Dec 9, 2025

In the process of diagnosing this flaky test holochain/holochain#4174, I've experimented with these changes in hope to avoid and/or mitigate loss of connection in some cases.

I'm posting this PR for visibility - stopping work on this for now in favor of the iroh transport implementation.

Summary

This PR tries to fix critical connection management issues that caused threads to block indefinitely on failed connections and allowed zombie connections to remain in the peer map, leading to cascading timeout failures. It also fixes a channel capacity bug that caused "closed" errors under concurrent load.

Problem

The existing implementation had several critical issues:

  1. Indefinite Blocking: wait_for_ready() would block indefinitely on a semaphore when waiting for peer connections, causing threads to hang forever when connections failed
  2. Zombie Connection Reuse: Failed connections remained in peer_map and were reused by subsequent sends, causing repeated 20-second timeouts
  3. Channel Saturation: Event channel was hardcoded to 32 slots despite config specifying 1024, causing channel saturation and "closed" errors under concurrent load
  4. No Failed State Detection: No mechanism to proactively detect and clean up failed peer connections before reuse attempts

Changes

1. Add Timeout Parameter to wait_for_ready() (715648d)

  • Wrap semaphore acquire in tokio::time::timeout to prevent indefinite blocking
  • Add timeout parameter to MaybeReady::wait_for_ready() and Peer::wait_for_ready()
  • Update all call sites to pass config.timeout
  • Prevents threads from blocking indefinitely on failed connections

2. Add Failed Connection Detection (2a92be6)

  • Add MaybeReady::is_failed() to check MaybeReadyState::Failed
  • Add Peer::is_failed() wrapper method
  • Enables proactive cleanup of zombie connections before reuse attempts

3. Fix Channel Capacity Bug (2b2dc39)

Critical fix: Endpoint event channel was hardcoded to 32 despite config specifying 1024

  • ep.rs: Use config.internal_event_channel_size (32→1024, 32x increase)
  • hub.rs: Increase hub command channel from 32→256 (8x increase)
  • framed.rs: Increase cmd and msg channels from 32→256 (8x increase)
  • Prevents channel saturation during concurrent connection establishment and high message throughput

4. Add Aggressive Peer Cleanup (0c9d81f)

Prevents failed connections from remaining in peer_map and being reused, which would cause repeated timeouts

  • Add stale peer check in connect_peer() - remove failed peers before reuse
  • Add early cleanup in peer task() when connection fails during negotiation
  • Add cleanup in send() when wait_for_ready() times out
  • Add debug logging for peer lifecycle events

Impact

  • Prevents indefinite thread blocking on failed connections
  • Eliminates zombie connection reuse that caused cascading timeout failures
  • Fixes channel saturation under concurrent load (32x increase for main event channel)
  • Enables proactive failed connection detection and cleanup
  • Improves system reliability during high-concurrency scenarios and network failures

Testing

While this changed the signature of flaky test failures, it didn't (yet?) result in 100% stable connections.

…blocking

  Previously, wait_for_ready() would block indefinitely on a semaphore when
  waiting for peer connections. This caused threads to hang forever when
  connections failed, leading to cascading failures.

  Changes:
  - Add timeout parameter to MaybeReady::wait_for_ready()
  - Wrap semaphore acquire in tokio::time::timeout
  - Update Peer::wait_for_ready() signature to accept timeout parameter
  - Update all call sites to pass config.timeout

  This prevents threads from blocking indefinitely on failed connections.
  Add methods to check if a peer connection is in a failed state, enabling
  proactive cleanup of zombie connections before reuse attempts.

  Changes:
  - Add MaybeReady::is_failed() to check MaybeReadyState::Failed
  - Add Peer::is_failed() wrapper method
  - Enables detection of stale connections in peer_map
  Critical fix: endpoint event channel was hardcoded to 32 slots despite
  config specifying internal_event_channel_size=1024. This caused channel
  saturation under concurrent load, resulting in "closed" errors.

  Changes:
  - ep.rs: Use config.internal_event_channel_size (32→1024, 32x increase)
  - hub.rs: Increase hub command channel from 32→256 (8x increase)
  - framed.rs: Increase cmd and msg channels from 32→256 (8x increase)

  These increases prevent channel saturation during concurrent connection
  establishment and high message throughput.
  Prevent failed connections from remaining in peer_map and being reused by
  subsequent sends, which would cause them to timeout repeatedly.

  Changes:
  - Add stale peer check in connect_peer() - remove failed peers before reuse
  - Add early cleanup in peer task() when connection fails during negotiation
  - Add cleanup in send() when wait_for_ready() times out
  - Add debug logging for peer lifecycle events

  This prevents the pattern where a failed connection stays in peer_map,
  gets reused by subsequent sends, and times out repeatedly (20s each time).
@cocogitto-bot
Copy link

cocogitto-bot bot commented Dec 9, 2025

✔️ 715648d...0c9d81f - Conventional commits check succeeded.

@coderabbitai
Copy link

coderabbitai bot commented Dec 9, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant