Skip to content

Comments

Fix/shared log handshake flakes#593

Open
Faolain wants to merge 4 commits intodao-xyz:masterfrom
Faolain:fix/shared-log-handshake-flakes
Open

Fix/shared log handshake flakes#593
Faolain wants to merge 4 commits intodao-xyz:masterfrom
Faolain:fix/shared-log-handshake-flakes

Conversation

@Faolain
Copy link
Contributor

@Faolain Faolain commented Feb 6, 2026

UPDATE: This PR is a subset of #596 , 596 should include the changes from here and other improved changes to shared-log as well with more tests which fail on master but pass on that branch. I had initially worked on 593, and it is separate for investigation/understanding but maybe it can be closed in favor of 596?

Why

Pubsub was flaky and has race conditions with topic subscribing. #589 solved that but that surfaced another already existing race condition within the shared-log itself which should be handled in this PR. Read that PR for the full context of the problem (in the dApp it would surface as inability to make connections with a peer, "flaky connection") Also please skip to the bottom for an additional more targeted approach that was tried with PubSub Faolain#7 which I'm thinking to stack with this PR if you agree.

Context

PR #589 (pubsub debounce hardening) improves subscription discovery timing. In CI it also surfaced a set of latent shared-log races/timeouts (and a couple of migration/role issues) that could previously hide behind slower peer discovery.

The failure signature that motivated this work:

  • events.spec.ts: duplicate replicator:join (same peer hash appears twice)
  • migration.spec.ts (migration-8-9): replication never completes within waitForResolved default timeout
  • replicate.spec.ts: timing sensitivity in waitForReplicator waits until maturity

This PR makes the pubsub subscription handshake resilient to debounce timing, and makes shared-log replication-info application deterministic (per-peer serialization + retry on startup) so those CI failures stop occurring.

Root Causes

1) Pubsub: subscribe debounce window could drop/omit state

  • DirectSub.subscribe() is debounced; the topic is only fully initialized when _subscribe() runs.
  • Incoming remote Subscribe handling drops the subscription if the topic hasn't been initialized locally yet.
  • requestSubscribers() overlap responses were based only on this.subscriptions, which can be empty during the debounce window.

Net effect: peers can connect during the debounce window and fail to discover each other's subscriptions promptly or at all.

2) Shared-log: replication-info handling was racy / lossy

  • Replication-info messages could be processed concurrently for the same peer (fire-and-forget async work).
  • addReplicationRange() emitted replicator:join based on a TOCTOU check of index state. Two concurrent handlers could both decide a peer is “new” and emit duplicates.
  • During startup, replication-info application could fail with NotStarted/IndexNotStarted and the update would be dropped (no retry), leaving the handshake incomplete and migration replication stuck.
  • Startup relied on a snapshot of current pubsub state; it didn’t actively backfill subscriber info.

3) Migration + role encoding issues (v8 compat)

  • v8 compat path could fail to respond with role info due to missing await, a getRole() implementation that threw in multi-segment cases, and a role->replication-range conversion that used normalized floats rather than u32 coordinate space.
  • RoleReplicationSegment offset encoding used the wrong nominator.

4) A genuine test flake under heavy load

While re-running the full suite, u32-simple > sharding > handles peer joining and leaving multiple times intermittently failed because replica convergence could exceed waitForResolved’s default 10s timeout on a loaded machine.

Fixes

Pubsub (packages/transport/pubsub/src/index.ts)

  • Eagerly initialize per-topic state in subscribe() so early remote Subscribe messages aren’t dropped.
  • Include topics pending in the debounce window when responding to requestSubscribers() overlap queries.
  • If a pending subscribe is cancelled via unsubscribe() before debounce fires, clean up the eagerly created topic state.

Shared-log (packages/programs/data/shared-log/src/index.ts)

  • Serialize replication-info application per peer (per-peer promise chain) so the same sender can’t concurrently run addReplicationRange().
  • Track latest replication-info timestamp per peer and ignore older updates.
  • If replication-info can’t be applied yet due to NotStarted/IndexNotStarted/Closed conditions, store the latest message per peer and flush after open instead of dropping it.
  • After open, call pubsub.requestSubscribers(this.topic) and backfill handleSubscriptionChange(...) from the current subscriber snapshot.
  • Make replicator:join idempotent: emit on transition “not known replicator -> has segments”, including restart/full-state announce cases.

Migration + role fixes

  • getRole() is now best-effort when multiple local segments exist (selects the widest segment).
  • Ensure role responses await getRole().
  • Convert role -> replication segments by denormalizing factor/offset into u32 coordinate space.
  • Fix RoleReplicationSegment offset encoding.

Tests

  • Fix migration test’s v8 mock to reply using the opened instance (db1.log.rpc).
  • Make waitForReplicator waits until maturity assert on remaining maturity time derived from the observed segment timestamp.
  • Harden checkReplicas() to avoid false negatives under load by using a slightly longer timeout + lower polling rate.

Debug Docs

  • Keep shared-log-debug.md, shared-log-debug-plan.md, and scrap.md as an audit trail for future debugging; these can be removed later once we’re confident.

Verification

  • pnpm run build (PASS)
  • Targeted tests (PASS):
    • replicate:join not emitted on update
    • 8-9, replicates database of 1 entry
    • 9-8, replicates database of 1 entry
    • waitForReplicator waits until maturity
    • handles peer joining and leaving multiple times
  • Full CI Part 4 suite (PASS):
    • pnpm run test:ci:part-4
    • @peerbit/shared-log: 1743 passing
    • @peerbit/shared-log-proxy: 1 passing

All Checks passed when running on my repo with no change TBD on this repo 🤞

Commits

  • 35963a7ac tests: checkpoint
  • a2ab6d31d shared-log: harden replication handshake + reduce flakes
  • bb7e7b031 docs: record shared-log debug commit hashes
  • 67125c13f docs: add shared-log debugging scrap notes

Additional Notes:

There is another PR at Faolain#7 which the below summarizes vs this PR

Screenshot 2026-02-06 at 5 03 24 PM

Context

PR dao-xyz#589 (pubsub debounce hardening) tightened peer subscription timing and surfaced latent shared-log races/timeouts in CI (duplicate replicator:join + migration-8-9 replication stalls).

What Was Implemented (see shared-log-debug.md for the full investigation log)

Pubsub (packages/transport/pubsub/src/index.ts)
- Eagerly initialize per-topic state in subscribe() so early remote Subscribe messages aren't dropped.
- Include topics pending in the debounce window when responding to requestSubscribers overlap queries.
- If subscribe() is cancelled via unsubscribe() before the debounce fires, clean up the eager topic state to avoid stale topic entries.

Shared-log (packages/programs/data/shared-log/src/index.ts)
- Serialize replication-info application per peer (promise chain) to eliminate TOCTOU around addReplicationRange() and replicator:join emission.
- Track latest replication-info timestamp per peer and ignore older messages.
- If replication-info can't be applied yet (NotStartedError / index-not-ready), keep the latest per-peer message and flush after open instead of dropping it.
- After open, requestSubscribers(topic) and backfill handleSubscriptionChange() from the current subscriber snapshot to avoid missed subscribe windows.
- Make replicator:join idempotent: emit on the transition "not known replicator -> has segments", including the restart/full-state announce case.

Migration + Role fixes
- v8 compat: always respond with a role; getRole() is best-effort when multiple segments exist, and request handlers correctly await getRole().
- ResponseRoleMessage -> replication-info conversion denormalizes factor/offset into u32 coordinate space.
- Fix RoleReplicationSegment offset encoding bug (offset nominator incorrectly used factor).

Tests (packages/programs/data/shared-log/test/*)
- Fix migration v8 mock to respond using the actually opened instance (db1.log.rpc).
- Make the maturity timing assertion robust by using remaining maturity time based on the observed segment timestamp.

Additional flake hardening (this commit)
- checkReplicas() now uses a slightly longer waitForResolved timeout with lower polling rate to avoid false negatives under heavy load (u32-simple sharding convergence could exceed the default 10s).
  File: packages/programs/data/shared-log/test/utils.ts

Docs
- shared-log-debug-plan.md updated with current verification status, the observed sharding flake, and the test hardening applied.

Verification
- pnpm run build: PASS
- Targeted shared-log tests: PASS
  - replicate:join not emitted on update
  - 8-9, replicates database of 1 entry
  - 9-8, replicates database of 1 entry
  - waitForReplicator waits until maturity
  - handles peer joining and leaving multiple times
- Full CI Part 4 suite: PASS
  - node ./node_modules/aegir/src/index.js run test --roots ./packages/programs/data/shared-log ./packages/programs/data/shared-log/proxy -- -t node --no-build
  - @peerbit/shared-log: 1743 passing
  - @peerbit/shared-log-proxy: 1 passing
Contains terminal transcript / scratch notes from the shared-log + pubsub investigation (with account/session redacted).
@Faolain
Copy link
Contributor Author

Faolain commented Feb 6, 2026

Screenshot 2026-02-06 at 6 19 36 AM

Re: the db fix
Screenshot 2026-02-06 at 6 20 57 AM

@Faolain
Copy link
Contributor Author

Faolain commented Feb 6, 2026

@marcus-pousette please review 🙏

I can port my tests from #589 and remove my scrap md's which I used with AIs to track down this problem (feel free to read the notes of course). It seems like this PR would be crucial to ensure #582 is successful/ensure we don't have finicky race conditions going forward.

I know you made significant changes to 582 (funnily enough I had no idea you did heh...maybe some of our fixes overlap, hopefully mine are small enough here and in 589 that you can see if an issue still exists in your implementation) Maybe these can go in ahead if you the fanout will take a while?

Update

The attached PR: #596 should include everything in 593 in addition to better shared-logs additions from Faolain#7 which maybe should close this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant