Skip to content

Comments

fix(shared-log): comprehensive shutdown guards for all async operations#4

Open
Faolain wants to merge 1 commit intofix/pubsub-subscribe-racefrom
fix/rootcause-c-comprehensive-shutdown-guards
Open

fix(shared-log): comprehensive shutdown guards for all async operations#4
Faolain wants to merge 1 commit intofix/pubsub-subscribe-racefrom
fix/rootcause-c-comprehensive-shutdown-guards

Conversation

@Faolain
Copy link
Owner

@Faolain Faolain commented Feb 6, 2026

Summary

This PR adds comprehensive shutdown guards to multiple SharedLog async operations to prevent unhandled rejections during node teardown. It targets all identified async code paths that can fire after _close() has torn down internal state.

This is the comprehensive approach (Root Cause C) — compare with the minimal approach in the companion PR for Root Cause B which only guards persistCoordinate.

Root Cause Analysis

During systematic investigation of SharedLog teardown races, three hypotheses were tested in parallel using isolated git worktrees:

Approach Description Errors (baseline: 6) Result
A — Debounce guard Patch rebalanceParticipation + __freqClosed flag 7 errors WORSE — @peerbit/time debouncer holds a direct closure reference, bypassing prototype patches
B — persistCoordinate TOCTOU guard Multi-layer guard in persistCoordinate only 2 errors Minimal fix, 4 errors eliminated
C — Comprehensive shutdown guards (this PR) All of B + guards on handleSubscriptionChange, removeReplicator, rebalanceParticipation 2 errors Same reduction, but provides defense-in-depth

Changes

1. persistCoordinate — TOCTOU Race Guard (~line 3662)

Four-layer protection against the Time-of-Check-to-Time-of-Use race where _close() completes between the !this.closed check in findLeaders and actual persistCoordinate execution:

async persistCoordinate(...) {
  if (this.closed) return;                          // Guard 1: already closed
  if (!entry?.hash) return;                         // Guard 2: invalid entry
  if (!this.entryCoordinatesIndex) return;          // Guard 3: index torn down
  if (!this._replicationRangeIndex) return;         // Guard 3: index torn down
  try {
    return await originalPersistCoordinate(...);
  } catch (err) {
    if (this.closed) return;                        // Guard 4: closed mid-execution
    throw err;
  }
}

2. handleSubscriptionChange — Subscription Event Guard (~line 3926)

Early bail when subscription change events arrive after close:

async handleSubscriptionChange(...) {
  if (this.closed) return;
  return await originalHandleSubscriptionChange(...);
}

3. removeReplicator — Replicator Cleanup Guard (~line 996)

Prevents replicator removal from accessing torn-down state:

async removeReplicator(...) {
  if (this.closed) return;
  return await originalRemoveReplicator(...);
}

4. rebalanceParticipation — Rebalance Guard (~line 4477)

Guards the debounced rebalance operation. Note: prototype patching alone is insufficient because @peerbit/time's debouncer captures a direct closure reference. This guard works because it's applied at the source method level before the debouncer wraps it:

async rebalanceParticipation(...) {
  if (this.closed) return;
  return await originalRebalanceParticipation(...);
}

Test Results

Methodology: Full vitest test suite (178 tests) run on each worktree. Integration tests (library, playlist, identity, replication.boundaries) also run 3x separately to check for flakiness.

Full Suite Results

Metric Baseline (PR dao-xyz#589 only) This PR
Passed 171 169
Failed 7 (pre-existing) 9
Unhandled errors 6 2

The 2 additional test failures compared to baseline are flaky tests (WalletPanel.purchaseForm) that vary between runs (0-12 failures), not caused by these changes.

Remaining 2 Errors

The remaining 2 unhandled errors originate from @peerbit/program's RPC layer (RPC.close → TypedEventEmitter.dispatchEvent), not from SharedLog. These require a separate fix in the RPC/program teardown path.

Pre-existing Failures (unchanged, present on all branches)

  • chunkedAesGcmV1.test.ts (2 failures)
  • ProfilePage.test.tsx (2 failures)
  • WalletPanel.purchaseForm.test.tsx (flaky, 0-12 failures)
  • RecoverySetupPanel.test.tsx (1 failure)
  • generateSampleMp3.test.ts (1 failure)

Relationship to PR dao-xyz#589

This fix is complementary to the pubsub subscribe race fix in PR dao-xyz#589 (dao-xyz#589). PR dao-xyz#589 fixes dropped subscription messages during topic initialization. This PR addresses separate race conditions exposed during teardown — multiple async operations executing after _close() has torn down internal state.

Comparison: This PR vs Root Cause B (Minimal Fix)

Aspect Root Cause B (minimal) Root Cause C (this PR)
Error reduction 6 → 2 6 → 2
Methods guarded 1 (persistCoordinate) 4 (persistCoordinate, handleSubscriptionChange, removeReplicator, rebalanceParticipation)
Lines changed ~20 ~50
Defense-in-depth Single choke point Multiple barriers
Risk of side effects Lower Slightly higher

Recommendation: If the goal is minimal, low-risk fix — use Root Cause B. If the goal is defense-in-depth against future teardown races — use this PR (Root Cause C). Both achieve the same measured error reduction because all observed errors flow through persistCoordinate, but this PR provides additional safety margins against unobserved edge cases.

Adds closed-state guards to all async methods that can fire after
SharedLog._close() tears down internal indices:

- persistCoordinate: TOCTOU guard + index existence checks + try/catch
- handleSubscriptionChange: bail if closed
- removeReplicator: bail if closed
- rebalanceParticipation: bail if closed (top-level, before inner fn)

Validated: reduces unhandled errors from 6 to 2 in integration tests.
The remaining 2 errors originate from @peerbit/program RPC layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant