Fix/shared log handshake flakes by Faolain · Pull Request #593 · dao-xyz/peerbit

Faolain · 2026-02-06T10:08:11Z

UPDATE: This PR is a subset of #596 , 596 should include the changes from here and other improved changes to shared-log as well with more tests which fail on master but pass on that branch. I had initially worked on 593, and it is separate for investigation/understanding but maybe it can be closed in favor of 596?

Why

Pubsub was flaky and has race conditions with topic subscribing. #589 solved that but that surfaced another already existing race condition within the shared-log itself which should be handled in this PR. Read that PR for the full context of the problem (in the dApp it would surface as inability to make connections with a peer, "flaky connection") Also please skip to the bottom for an additional more targeted approach that was tried with PubSub Faolain#7 which I'm thinking to stack with this PR if you agree.

Context

PR #589 (pubsub debounce hardening) improves subscription discovery timing. In CI it also surfaced a set of latent shared-log races/timeouts (and a couple of migration/role issues) that could previously hide behind slower peer discovery.

The failure signature that motivated this work:

events.spec.ts: duplicate replicator:join (same peer hash appears twice)
migration.spec.ts (migration-8-9): replication never completes within waitForResolved default timeout
replicate.spec.ts: timing sensitivity in waitForReplicator waits until maturity

This PR makes the pubsub subscription handshake resilient to debounce timing, and makes shared-log replication-info application deterministic (per-peer serialization + retry on startup) so those CI failures stop occurring.

Root Causes

1) Pubsub: subscribe debounce window could drop/omit state

DirectSub.subscribe() is debounced; the topic is only fully initialized when _subscribe() runs.
Incoming remote Subscribe handling drops the subscription if the topic hasn't been initialized locally yet.
requestSubscribers() overlap responses were based only on this.subscriptions, which can be empty during the debounce window.

Net effect: peers can connect during the debounce window and fail to discover each other's subscriptions promptly or at all.

2) Shared-log: replication-info handling was racy / lossy

Replication-info messages could be processed concurrently for the same peer (fire-and-forget async work).
addReplicationRange() emitted replicator:join based on a TOCTOU check of index state. Two concurrent handlers could both decide a peer is “new” and emit duplicates.
During startup, replication-info application could fail with NotStarted/IndexNotStarted and the update would be dropped (no retry), leaving the handshake incomplete and migration replication stuck.
Startup relied on a snapshot of current pubsub state; it didn’t actively backfill subscriber info.

3) Migration + role encoding issues (v8 compat)

v8 compat path could fail to respond with role info due to missing await, a getRole() implementation that threw in multi-segment cases, and a role->replication-range conversion that used normalized floats rather than u32 coordinate space.
RoleReplicationSegment offset encoding used the wrong nominator.

4) A genuine test flake under heavy load

While re-running the full suite, u32-simple > sharding > handles peer joining and leaving multiple times intermittently failed because replica convergence could exceed waitForResolved’s default 10s timeout on a loaded machine.

Fixes

Pubsub (packages/transport/pubsub/src/index.ts)

Eagerly initialize per-topic state in subscribe() so early remote Subscribe messages aren’t dropped.
Include topics pending in the debounce window when responding to requestSubscribers() overlap queries.
If a pending subscribe is cancelled via unsubscribe() before debounce fires, clean up the eagerly created topic state.

Shared-log (packages/programs/data/shared-log/src/index.ts)

Serialize replication-info application per peer (per-peer promise chain) so the same sender can’t concurrently run addReplicationRange().
Track latest replication-info timestamp per peer and ignore older updates.
If replication-info can’t be applied yet due to NotStarted/IndexNotStarted/Closed conditions, store the latest message per peer and flush after open instead of dropping it.
After open, call pubsub.requestSubscribers(this.topic) and backfill handleSubscriptionChange(...) from the current subscriber snapshot.
Make replicator:join idempotent: emit on transition “not known replicator -> has segments”, including restart/full-state announce cases.

Migration + role fixes

getRole() is now best-effort when multiple local segments exist (selects the widest segment).
Ensure role responses await getRole().
Convert role -> replication segments by denormalizing factor/offset into u32 coordinate space.
Fix RoleReplicationSegment offset encoding.

Tests

Fix migration test’s v8 mock to reply using the opened instance (db1.log.rpc).
Make waitForReplicator waits until maturity assert on remaining maturity time derived from the observed segment timestamp.
Harden checkReplicas() to avoid false negatives under load by using a slightly longer timeout + lower polling rate.

Debug Docs

Keep shared-log-debug.md, shared-log-debug-plan.md, and scrap.md as an audit trail for future debugging; these can be removed later once we’re confident.

Verification

pnpm run build (PASS)
Targeted tests (PASS):
- replicate:join not emitted on update
- 8-9, replicates database of 1 entry
- 9-8, replicates database of 1 entry
- waitForReplicator waits until maturity
- handles peer joining and leaving multiple times
Full CI Part 4 suite (PASS):
- pnpm run test:ci:part-4
- @peerbit/shared-log: 1743 passing
- @peerbit/shared-log-proxy: 1 passing

All Checks passed when running on my repo with no change TBD on this repo 🤞

Commits

35963a7ac tests: checkpoint
a2ab6d31d shared-log: harden replication handshake + reduce flakes
bb7e7b031 docs: record shared-log debug commit hashes
67125c13f docs: add shared-log debugging scrap notes

Additional Notes:

There is another PR at Faolain#7 which the below summarizes vs this PR

Context PR dao-xyz#589 (pubsub debounce hardening) tightened peer subscription timing and surfaced latent shared-log races/timeouts in CI (duplicate replicator:join + migration-8-9 replication stalls). What Was Implemented (see shared-log-debug.md for the full investigation log) Pubsub (packages/transport/pubsub/src/index.ts) - Eagerly initialize per-topic state in subscribe() so early remote Subscribe messages aren't dropped. - Include topics pending in the debounce window when responding to requestSubscribers overlap queries. - If subscribe() is cancelled via unsubscribe() before the debounce fires, clean up the eager topic state to avoid stale topic entries. Shared-log (packages/programs/data/shared-log/src/index.ts) - Serialize replication-info application per peer (promise chain) to eliminate TOCTOU around addReplicationRange() and replicator:join emission. - Track latest replication-info timestamp per peer and ignore older messages. - If replication-info can't be applied yet (NotStartedError / index-not-ready), keep the latest per-peer message and flush after open instead of dropping it. - After open, requestSubscribers(topic) and backfill handleSubscriptionChange() from the current subscriber snapshot to avoid missed subscribe windows. - Make replicator:join idempotent: emit on the transition "not known replicator -> has segments", including the restart/full-state announce case. Migration + Role fixes - v8 compat: always respond with a role; getRole() is best-effort when multiple segments exist, and request handlers correctly await getRole(). - ResponseRoleMessage -> replication-info conversion denormalizes factor/offset into u32 coordinate space. - Fix RoleReplicationSegment offset encoding bug (offset nominator incorrectly used factor). Tests (packages/programs/data/shared-log/test/*) - Fix migration v8 mock to respond using the actually opened instance (db1.log.rpc). - Make the maturity timing assertion robust by using remaining maturity time based on the observed segment timestamp. Additional flake hardening (this commit) - checkReplicas() now uses a slightly longer waitForResolved timeout with lower polling rate to avoid false negatives under heavy load (u32-simple sharding convergence could exceed the default 10s). File: packages/programs/data/shared-log/test/utils.ts Docs - shared-log-debug-plan.md updated with current verification status, the observed sharding flake, and the test hardening applied. Verification - pnpm run build: PASS - Targeted shared-log tests: PASS - replicate:join not emitted on update - 8-9, replicates database of 1 entry - 9-8, replicates database of 1 entry - waitForReplicator waits until maturity - handles peer joining and leaving multiple times - Full CI Part 4 suite: PASS - node ./node_modules/aegir/src/index.js run test --roots ./packages/programs/data/shared-log ./packages/programs/data/shared-log/proxy -- -t node --no-build - @peerbit/shared-log: 1743 passing - @peerbit/shared-log-proxy: 1 passing

Contains terminal transcript / scratch notes from the shared-log + pubsub investigation (with account/session redacted).

Faolain · 2026-02-06T10:19:42Z

Re: the db fix

Faolain · 2026-02-06T10:33:13Z

@marcus-pousette please review 🙏

I can port my tests from #589 and remove my scrap md's which I used with AIs to track down this problem (feel free to read the notes of course). It seems like this PR would be crucial to ensure #582 is successful/ensure we don't have finicky race conditions going forward.

I know you made significant changes to 582 (funnily enough I had no idea you did heh...maybe some of our fixes overlap, hopefully mine are small enough here and in 589 that you can see if an issue still exists in your implementation) Maybe these can go in ahead if you the fanout will take a while?

Update

The attached PR: #596 should include everything in 593 in addition to better shared-logs additions from Faolain#7 which maybe should close this PR?

Faolain added 4 commits February 6, 2026 03:40

tests: checkpoint

35963a7

docs: record shared-log debug commit hashes

bb7e7b0

docs: add shared-log debugging scrap notes

67125c1

Contains terminal transcript / scratch notes from the shared-log + pubsub investigation (with account/session redacted).

Faolain requested a review from marcus-pousette as a code owner February 6, 2026 10:08

Faolain mentioned this pull request Feb 6, 2026

fix(pubsub): eagerly initialize topic in subscribe() to prevent race #589

Closed

Faolain mentioned this pull request Feb 7, 2026

PubSub pendingSubscriptions + SharedLog replication handshake hardening #596

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Fix/shared log handshake flakes#593

Fix/shared log handshake flakes#593
Faolain wants to merge 4 commits intodao-xyz:masterfrom
Faolain:fix/shared-log-handshake-flakes

Faolain commented Feb 6, 2026 •

edited

Loading

Uh oh!

Faolain commented Feb 6, 2026 •

edited

Loading

Uh oh!

Faolain commented Feb 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

Faolain commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Context

Root Causes

1) Pubsub: subscribe debounce window could drop/omit state

2) Shared-log: replication-info handling was racy / lossy

3) Migration + role encoding issues (v8 compat)

4) A genuine test flake under heavy load

Fixes

Pubsub (packages/transport/pubsub/src/index.ts)

Shared-log (packages/programs/data/shared-log/src/index.ts)

Migration + role fixes

Tests

Debug Docs

Verification

Commits

Additional Notes:

Uh oh!

Faolain commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Faolain commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Faolain commented Feb 6, 2026 •

edited

Loading

Faolain commented Feb 6, 2026 •

edited

Loading

Faolain commented Feb 6, 2026 •

edited

Loading