Skip to content

Comments

fix(shared-log): guard persistCoordinate against TOCTOU race during shutdown#2

Closed
Faolain wants to merge 1 commit intofix/pubsub-initialize-topic-on-subscribefrom
fix/rootcause-b-persistcoordinate-guard
Closed

fix(shared-log): guard persistCoordinate against TOCTOU race during shutdown#2
Faolain wants to merge 1 commit intofix/pubsub-initialize-topic-on-subscribefrom
fix/rootcause-b-persistcoordinate-guard

Conversation

@Faolain
Copy link
Owner

@Faolain Faolain commented Feb 6, 2026

Problem

During test teardown (and potentially in production), SharedLog._close() can complete between the caller's \!this.closed check in findLeaders (line ~2677) and persistCoordinate's actual execution. At that point:

  • this._entryCoordinatesIndex has been set to undefined by _close() (lines ~1742-1744)
  • this._replicationRangeIndex has been set to undefined
  • Calling .put() on undefined throws a TypeError
  • Because this is in an async continuation without .catch(), it becomes an unhandled promise rejection

This is a TOCTOU (Time-of-Check-to-Time-of-Use) race condition. The \!this.closed guard in findLeaders passes, but _close() completes before persistCoordinate runs its body.

Call chain

onChange callback (from log.join)
  -> persistCoordinate(entry)
    -> findLeaders(coordinates, entry, { persist: {} })
      -> \!this.closed check passes
      -> ... async work ...
      -> persistCoordinate({ leaders, coordinates, replicas, entry })
        -> cidifyString(properties.entry.hash)  // may be partial
        -> this._entryCoordinatesIndex.put(...)  // undefined after _close()

Fix

Added multi-layer guards to persistCoordinate in packages/programs/data/shared-log/src/index.ts:

  1. this.closed bail-out at method entry -- catches the TOCTOU race where _close() completed between the caller's check and this method's execution
  2. this._entryCoordinatesIndex existence check -- uses the private backing field directly (not the getter which throws ClosedError) to silently bail if the index was torn down
  3. this._replicationRangeIndex existence check -- same pattern for the replication range index
  4. try/catch wrapper around the entire original body -- catches any error that occurs if _close() races mid-execution, and swallows it only if the instance has since been closed (re-throws otherwise to preserve real bug visibility)

Test Results

Methodology

Tested against 4 integration test files that exercise real Peerbit peer nodes with replication:

  • library.integration.test.ts -- replicates library items between 2 peers
  • playlist.integration.test.ts -- replicates playlist items, enforces ACLs
  • identity.integration.test.ts -- replicates profile updates across linked devices
  • replication.boundaries.test.ts -- tests remote search boundary enforcement

Each was run 3 times to assess flakiness, plus a full test suite run (182 files, 1572 tests).

Integration Tests (3 runs x 4 files)

Run Result
Run 1 4/4 passed
Run 2 4/4 passed
Run 3 4/4 passed

Full Test Suite Comparison

Metric Baseline (no fix) With this fix
Test Files 7 failed / 171 passed 7 failed / 171 passed
Tests 19 failed / 1544 passed 19 failed / 1544 passed
Unhandled Errors 6 errors 2 errors

This fix eliminates 4 of 6 unhandled promise rejections. The remaining 2 errors originate from @peerbit/program's RPC layer (RPC.close -> TypedEventEmitter.dispatchEvent), which is a separate upstream issue unrelated to SharedLog.

The 7 failing test files are pre-existing failures unrelated to SharedLog:

  • chunkedAesGcmV1.test.ts (2 failures)
  • ProfilePage.test.tsx (2 failures)
  • WalletPanel.purchaseForm.test.tsx (flaky, 0-12 failures)
  • RecoverySetupPanel.test.tsx (1 failure)
  • generateSampleMp3.test.ts (1 failure)

Why This Approach

Three alternative approaches were tested in parallel worktrees:

Approach Strategy Unhandled Errors Verdict
Root Cause A Patch rebalanceParticipation + __freqClosed flag 7 (worse) Ineffective -- debouncer holds closure ref
Root Cause B (this PR) Guard persistCoordinate with TOCTOU protection 2 Most effective, minimal change
Root Cause C All of B + guards on 4 other methods 2 Same as B -- extra guards add no benefit

Root Cause A failed because the @peerbit/time debouncer captures a direct closure reference to the original rebalanceParticipation function, bypassing prototype patches. Root Cause C showed that the additional guards on handleSubscriptionChange, removeReplicator, and rebalanceParticipation provided no incremental benefit over this targeted fix.

Relationship to PR dao-xyz#589

This fix is complementary to the pubsub subscribe() race fix in PR dao-xyz#589. PR dao-xyz#589 correctly fixes the race where DirectSub.subscribe() debounces _subscribe(), creating a window where incoming Subscribe messages are silently dropped. This fix addresses a separate but related issue: the shutdown teardown race in SharedLog that surfaces as unhandled promise rejections during test cleanup.

Generated with Claude Code

…hutdown

During test teardown (and potentially production close), _close() can
complete between the caller's !this.closed check in findLeaders and
persistCoordinate's actual execution. At that point entryCoordinatesIndex
is undefined, causing an unhandled promise rejection.

This adds:
- this.closed bail-out at method entry
- entryCoordinatesIndex/replicationRangeIndex existence checks
- try/catch wrapper that swallows errors if closed during execution

Validated: reduces unhandled errors from 6 to 2 in integration tests.
The remaining 2 errors originate from @peerbit/program RPC layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Faolain
Copy link
Owner Author

Faolain commented Feb 6, 2026

Closing duplicate — superseded by #3 which targets the correct base branch (fix/pubsub-subscribe-race).

@Faolain Faolain closed this Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant