Skip to content

Conversation

@mattyg
Copy link
Member

@mattyg mattyg commented Nov 11, 2025

This is 2 changes to reduce test flakes of sending_second_message_after_remote_disconnect_succeeds_after_disconnect_event.

This PR seems to eliminate the flakes completely. On main it consistently failed with 10 test runs. In this PR it does not arise within 500 test runs with either backend. The test run that reproduced it was sending_second_message_after_remote_disconnect_succeeds_after_disconnect_event.

The changes are explained in inline comments.

Summary by CodeRabbit

  • Bug Fixes
    • Peer disconnects now use a 10-second timeout and log failures/timeouts to improve reliability and diagnostics.
    • Cleanup of peer connections is now driven by an explicit guard to avoid premature or immediate disconnects, reducing race conditions during shutdown and tests.

@coderabbitai
Copy link

coderabbitai bot commented Nov 11, 2025

Walkthrough

Modified peer cleanup in crates/tx5/src/peer.rs: DropPeer's Drop now sends the disconnect notification with a 10-second timeout and logs failures/timeouts; the local drop guard was renamed and explicit end-of-loop disconnects were removed; comments explaining manual drop timing were added.

Changes

Cohort / File(s) Summary
DropPeer cleanup & task-local changes
crates/tx5/src/peer.rs
DropPeer::drop now sends the Disconnected notification under a 10-second timeout and logs on error/timeout; local drop guard renamed from _drop to drop_guard; removed immediate end-of-loop disconnect send and added comments explaining manual drop/cleanup timing and test-related behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Check correctness of the 10s timeout usage and that errors/timeouts are logged appropriately.
  • Verify manual drop_guard dropping path does not introduce races or change cleanup ordering.
  • Confirm the rename is applied consistently and that removed end-of-loop send doesn't regress behavior in tests or shutdown sequences.

Possibly related PRs

Suggested reviewers

  • ThetaSinner
  • matthme
  • neonphog
  • jost-s

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main fix: preventing Disconnect notification from being sent until after the peer is removed from peer_map, addressing a race condition causing test flakes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/hubmap-removal-race

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8133256 and fbc8b30.

📒 Files selected for processing (1)
  • crates/tx5/src/peer.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ThetaSinner
Repo: holochain/tx5 PR: 167
File: crates/tx5/tests/tests/flaky_sig.rs:73-84
Timestamp: 2025-08-08T08:41:38.069Z
Learning: Repo holochain/tx5: In test code (crates/tx5/tests/tests/flaky_sig.rs), maintainer (ThetaSinner) prefers not to refactor non-ideal async patterns; leaving block_on in Drop for FlakyRelay is acceptable. Treat similar test-only cleanup suggestions as non-blocking unless they cause flakes/panics.
📚 Learning: 2025-08-08T08:41:38.069Z
Learnt from: ThetaSinner
Repo: holochain/tx5 PR: 167
File: crates/tx5/tests/tests/flaky_sig.rs:73-84
Timestamp: 2025-08-08T08:41:38.069Z
Learning: Repo holochain/tx5: In test code (crates/tx5/tests/tests/flaky_sig.rs), maintainer (ThetaSinner) prefers not to refactor non-ideal async patterns; leaving block_on in Drop for FlakyRelay is acceptable. Treat similar test-only cleanup suggestions as non-blocking unless they cause flakes/panics.

Applied to files:

  • crates/tx5/src/peer.rs
🧬 Code graph analysis (1)
crates/tx5/src/peer.rs (2)
crates/tx5/src/sig.rs (3)
  • task (171-265)
  • drop (18-20)
  • drop (151-166)
crates/tx5-connection/src/conn.rs (2)
  • drop (49-60)
  • drop (418-420)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Test (windows-latest)
  • GitHub Check: test (x86_64, 33, 26.0.10792818, datachannel)
  • GitHub Check: Test (ubuntu-latest)
  • GitHub Check: test (x86_64, 29, 26.0.10792818, datachannel)
  • GitHub Check: test (x86_64, 33, 26.0.10792818, go-pion)
  • GitHub Check: test (x86_64, 29, 26.0.10792818, go-pion)
🔇 Additional comments (3)
crates/tx5/src/peer.rs (3)

192-217: LGTM! Proper async cleanup in Drop with timeout protection.

The disconnect notification is now sent with a 10-second timeout and appropriate error logging. Spawning the async work ensures the Drop remains non-blocking, and the timeout prevents indefinite stalls. This properly sequences the disconnect notification after peer removal from the map, addressing the race condition.


243-243: Good rename for clarity.

Changing from _drop to drop_guard better reflects that the guard is now explicitly manipulated at line 373, rather than just relying on implicit scope-based cleanup.


363-373: Pragmatic solution with clear documentation.

The manual drop is unusual since Rust's RAII would handle cleanup automatically at scope end, but your empirical testing (500 successful runs vs. consistent failures without it) validates this approach. The extensive comments clearly document both the intent and the uncertainty, which is valuable for future maintainers.

Given the significant reduction in test flakiness and the maintainer consensus from past reviews, this pragmatic solution is appropriate.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mattyg mattyg force-pushed the fix/hubmap-removal-race branch from 2366e81 to 967e6e0 Compare November 11, 2025 12:29
@mattyg mattyg requested review from a team and removed request for a team November 12, 2025 00:20
@mattyg mattyg force-pushed the fix/hubmap-removal-race branch from 967e6e0 to f189357 Compare November 13, 2025 01:18
@mattyg mattyg force-pushed the fix/hubmap-removal-race branch 2 times, most recently from 89d8745 to 0429b26 Compare November 13, 2025 04:15
@mattyg mattyg changed the title fix: race condition when attempting to reestablish a recently removed connection fix: don't send Disconnect until peer has been removed from peer_map Nov 13, 2025
@mattyg mattyg force-pushed the fix/hubmap-removal-race branch from 0429b26 to 8f856ff Compare November 13, 2025 21:57
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
crates/tx5/src/peer.rs (1)

243-248: Drop guard wiring is sound; explicit drop(drop_guard) is redundant but fine

Binding the DropPeer as drop_guard at the top of task gives you consistent cleanup on all early returns and on task abort (via Peer::drop aborting the join handle). The explicit drop(drop_guard) after the recv loop is technically unnecessary—RAII would drop it on function return—but it does make the cleanup point more explicit in the “normal loop finished” path and doesn’t change semantics.

If you prefer to trim a line, you could rely on natural scope drop and remove the explicit drop(drop_guard), but it’s not required.

Also applies to: 363-368

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0429b26 and 8f856ff.

📒 Files selected for processing (2)
  • crates/tx5-connection/src/conn.rs (1 hunks)
  • crates/tx5/src/peer.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ThetaSinner
Repo: holochain/tx5 PR: 167
File: crates/tx5/tests/tests/flaky_sig.rs:73-84
Timestamp: 2025-08-08T08:41:38.069Z
Learning: Repo holochain/tx5: In test code (crates/tx5/tests/tests/flaky_sig.rs), maintainer (ThetaSinner) prefers not to refactor non-ideal async patterns; leaving block_on in Drop for FlakyRelay is acceptable. Treat similar test-only cleanup suggestions as non-blocking unless they cause flakes/panics.
📚 Learning: 2025-08-08T08:41:38.069Z
Learnt from: ThetaSinner
Repo: holochain/tx5 PR: 167
File: crates/tx5/tests/tests/flaky_sig.rs:73-84
Timestamp: 2025-08-08T08:41:38.069Z
Learning: Repo holochain/tx5: In test code (crates/tx5/tests/tests/flaky_sig.rs), maintainer (ThetaSinner) prefers not to refactor non-ideal async patterns; leaving block_on in Drop for FlakyRelay is acceptable. Treat similar test-only cleanup suggestions as non-blocking unless they cause flakes/panics.

Applied to files:

  • crates/tx5-connection/src/conn.rs
  • crates/tx5/src/peer.rs
🧬 Code graph analysis (2)
crates/tx5-connection/src/conn.rs (1)
crates/tx5/src/peer.rs (2)
  • drop (25-27)
  • drop (180-218)
crates/tx5/src/peer.rs (2)
crates/tx5/src/sig.rs (3)
  • task (171-265)
  • drop (18-20)
  • drop (151-166)
crates/tx5-connection/src/conn.rs (3)
  • drop (49-60)
  • drop (248-250)
  • drop (431-433)
🔇 Additional comments (2)
crates/tx5-connection/src/conn.rs (1)

241-251: RAII guard for ready looks correct and closes remaining hang paths

Using DropReady to call ready.close() when con_task exits (including early returns and aborts) is a solid way to ensure Conn::ready() futures don’t hang indefinitely if the handshake or webrtc negotiation fails. The extra close from the guard is consistent with the existing explicit ready.close() calls on success/fallback, and doesn’t introduce new blocking or ordering problems as it runs only on task teardown.

Also applies to: 258-261

crates/tx5/src/peer.rs (1)

192-217: Disconnect send via DropPeer with timeout matches desired cleanup ordering

Having DropPeer::drop (a) call drop_peer_url while holding the EpInner lock, (b) mark ready failed, and only then (c) spawn a timed attempt to send EndpointEvent::Disconnected ensures the endpoint is notified after the peer is removed from internal state, which is exactly what this PR is aiming for. The 10s timeout around evt_send.send in the spawned task prevents a slow or stuck event consumer from effectively “leaking” the cleanup task while keeping DropPeer::drop itself non-blocking.

@mattyg mattyg requested review from a team and neonphog November 14, 2025 00:51
Copy link
Member

@ThetaSinner ThetaSinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change to peer.rs makes sense, I don't think the change to conn.rs does. Does one work without the other? Dealing with the ready success/failure states is a bigger change

Copy link
Contributor

@neonphog neonphog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth spending a little time to attempt making the ready return more semantic results. If that turns out to be a rabbit hole, I'm pretty comfortable with these changes, maybe just adding a comment that closing the ready without somehow better indicating it is an error case is not ideal.

@mattyg mattyg force-pushed the fix/hubmap-removal-race branch 3 times, most recently from ccc7853 to 8133256 Compare November 17, 2025 19:16
@mattyg
Copy link
Member Author

mattyg commented Nov 17, 2025

For the sake of not putting too much time into tx5 edge case issues, I'm going to keep in the peer.rs change with an explanation comment and remove the conn.rs change rather than do additional refactoring.

This significantly reduces test flakes such that it shouldn't disrupt our productivity, so good enough to move forward.

@mattyg mattyg requested a review from ThetaSinner November 17, 2025 19:40
… Disconnect event is always sent after peer is removed from peer_map
@mattyg mattyg force-pushed the fix/hubmap-removal-race branch from 8133256 to fbc8b30 Compare November 18, 2025 00:34
@cocogitto-bot
Copy link

cocogitto-bot bot commented Nov 18, 2025

✔️ fbc8b30 - Conventional commits check succeeded.

@mattyg mattyg enabled auto-merge (squash) November 20, 2025 16:38
@mattyg mattyg merged commit 31e22c7 into main Nov 20, 2025
6 of 10 checks passed
@mattyg mattyg deleted the fix/hubmap-removal-race branch November 20, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants