fix(shared-log): comprehensive shutdown guards for all async operations#4
Open
Faolain wants to merge 1 commit intofix/pubsub-subscribe-racefrom
Open
Conversation
Adds closed-state guards to all async methods that can fire after SharedLog._close() tears down internal indices: - persistCoordinate: TOCTOU guard + index existence checks + try/catch - handleSubscriptionChange: bail if closed - removeReplicator: bail if closed - rebalanceParticipation: bail if closed (top-level, before inner fn) Validated: reduces unhandled errors from 6 to 2 in integration tests. The remaining 2 errors originate from @peerbit/program RPC layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds comprehensive shutdown guards to multiple SharedLog async operations to prevent unhandled rejections during node teardown. It targets all identified async code paths that can fire after
_close()has torn down internal state.This is the comprehensive approach (Root Cause C) — compare with the minimal approach in the companion PR for Root Cause B which only guards
persistCoordinate.Root Cause Analysis
During systematic investigation of SharedLog teardown races, three hypotheses were tested in parallel using isolated git worktrees:
rebalanceParticipation+__freqClosedflag@peerbit/timedebouncer holds a direct closure reference, bypassing prototype patchespersistCoordinateonlyhandleSubscriptionChange,removeReplicator,rebalanceParticipationChanges
1.
persistCoordinate— TOCTOU Race Guard (~line 3662)Four-layer protection against the Time-of-Check-to-Time-of-Use race where
_close()completes between the!this.closedcheck infindLeadersand actualpersistCoordinateexecution:2.
handleSubscriptionChange— Subscription Event Guard (~line 3926)Early bail when subscription change events arrive after close:
3.
removeReplicator— Replicator Cleanup Guard (~line 996)Prevents replicator removal from accessing torn-down state:
4.
rebalanceParticipation— Rebalance Guard (~line 4477)Guards the debounced rebalance operation. Note: prototype patching alone is insufficient because
@peerbit/time's debouncer captures a direct closure reference. This guard works because it's applied at the source method level before the debouncer wraps it:Test Results
Methodology: Full vitest test suite (178 tests) run on each worktree. Integration tests (
library,playlist,identity,replication.boundaries) also run 3x separately to check for flakiness.Full Suite Results
The 2 additional test failures compared to baseline are flaky tests (
WalletPanel.purchaseForm) that vary between runs (0-12 failures), not caused by these changes.Remaining 2 Errors
The remaining 2 unhandled errors originate from
@peerbit/program's RPC layer (RPC.close → TypedEventEmitter.dispatchEvent), not from SharedLog. These require a separate fix in the RPC/program teardown path.Pre-existing Failures (unchanged, present on all branches)
chunkedAesGcmV1.test.ts(2 failures)ProfilePage.test.tsx(2 failures)WalletPanel.purchaseForm.test.tsx(flaky, 0-12 failures)RecoverySetupPanel.test.tsx(1 failure)generateSampleMp3.test.ts(1 failure)Relationship to PR dao-xyz#589
This fix is complementary to the pubsub subscribe race fix in PR dao-xyz#589 (dao-xyz#589). PR dao-xyz#589 fixes dropped subscription messages during topic initialization. This PR addresses separate race conditions exposed during teardown — multiple async operations executing after
_close()has torn down internal state.Comparison: This PR vs Root Cause B (Minimal Fix)
persistCoordinate)persistCoordinate,handleSubscriptionChange,removeReplicator,rebalanceParticipation)Recommendation: If the goal is minimal, low-risk fix — use Root Cause B. If the goal is defense-in-depth against future teardown races — use this PR (Root Cause C). Both achieve the same measured error reduction because all observed errors flow through
persistCoordinate, but this PR provides additional safety margins against unobserved edge cases.