PubSub pendingSubscriptions + SharedLog replication handshake hardening by Faolain · Pull Request #596 · dao-xyz/peerbit

Faolain · 2026-02-07T01:11:33Z

Summary

This PR is an integration of:

PR Fix/shared log handshake flakes #593 (shared-log correctness + replication handshake hardening)
Faolain PR Fix connectivity races (pubsub pending subscribe + shared-log replication-info) Faolain/peerbit#7 (stronger PubSub semantics for pending subscriptions + focused PubSub regressions)

It addresses a class of races caused by debounced pubsub.subscribe(). During the debounce window, a node has expressed local intent to subscribe, but historically it would:

not fully track the topic yet (dropping inbound topic traffic), and/or
not advertise the pending subscription in subscribe-time handshakes, and/or
drop strict deliveries targeted at it (because it wasn't yet in subscriptions).

Those gaps can cascade into shared-log replication handshake flakes and incorrect/unstable replication state under suite load.

This PR combines the best parts of #593 (shared-log correctness) with the stronger PubSub model from PR #7 (pending-subscription semantics + regressions).

What This PR Fixes

1) PubSub: Pending Subscriptions Are First-Class

Problem: DirectSub.subscribe() is debounced. Between the call to subscribe(topic) and the eventual debounced _subscribe([topic]), the node has local interest but was treated as "not subscribed".

Fix: Introduce explicit pendingSubscriptions tracking and treat it as local interest.

Key changes (packages/transport/pubsub/src/index.ts):

Add pendingSubscriptions: Set<string>.
subscribe(topic):
- add topic to pendingSubscriptions
- eagerly initializeTopic(topic) (so we track the topic immediately and don't drop inbound subscription traffic)
_subscribe():
- remove topic from pendingSubscriptions once the real subscription is applied
unsubscribe(topic):
- remove topic from pendingSubscriptions
- if cancelling a debounced subscribe, fully clean up eager topic state (avoid "ghost topics"), including:
  - peerToTopic
  - lastSubscriptionMessages
  - topics / topicsToPeers
Strict delivery: for PubSubData with strict: true, treat pending topics as local interest so strict deliveries are not dropped during the debounce window.
Subscribe-time handshake: when responding to Subscribe{requestSubscribers:true}, include pending topics in the overlap response.

Why this matters:

It closes BUG4 (pending subscribe must receive strict deliveries).
It ensures subscribe-time backfill/handshake traffic doesn't miss a peer just because the peer is still inside the debounce window.

2) Shared-log: Replication Handshake More Deterministic Under Load

This PR keeps #593's shared-log correctness work (queueing/buffering replication-info, backfill, v8 conversion/role fixes, etc.) and adds a small but important ordering/hardening layer.

Key changes:

`packages/programs/data/shared-log/src/index.ts`

afterOpen():
- still calls pubsub.requestSubscribers(this.topic) for backfill
- processes pre-existing subscribers sequentially via for...of + await this.handleSubscriptionChange(...) (instead of fire-and-forget), then flushes buffered replication-info.
handleSubscriptionChange(...):
- When a peer subscribes, we send:
  - AllReplicatingSegmentsMessage
  - (v8) ResponseRoleMessage
  - RequestReplicationInfoMessage
- These sends are now best-effort ordered, but also bounded:
  - we await Promise.race([ rpc.send(...), delay(10s) ])
  - avoids hangs if delivery/acks stall
  - reduces ordering races that show up only in full-suite runs

This specifically stabilizes the existing shared-log test:

replication.spec.ts: "applies replication segments even if waitFor() fails"

3) Test Flake Mitigations (Targeted)

Two tests were observed to be stable in isolation but flaky in long-suite execution due to timing sensitivity. This PR increases only the tightest waits.

packages/programs/data/shared-log/test/leader.spec.ts:
- aligns waitForResolved timeouts with the test's 120s timeout for reachableOnly cover stability
- increases polling interval to reduce busy-looping under load
packages/programs/data/shared-log/test/sharding.spec.ts:
- increases the final convergence waits for calculateTotalParticipation() to { timeout: 30_000, delayInterval: 200 }

4) Tooling: Run Tests In Git Worktrees

.aegir.js updated to handle git worktrees where .git is a file (not a directory).

This is not a runtime change, but it enables running the new regressions in secondary worktrees (which is how these comparisons were validated).

Why This Is Better Than #593 Alone Or PR Faolain#7 Alone

Compared to #593 alone

Fix/shared log handshake flakes #593 fixes the debounced topic initialization race and contains broad shared-log correctness improvements.
But Fix/shared log handshake flakes #593 still fails the stronger PubSub semantic: pending subscribe must receive strict deliveries (BUG4).
This PR adds the missing pending-subscription model and proves it with a regression.

Compared to Faolain PR Faolain#7 alone

PR Fix connectivity races (pubsub pending subscribe + shared-log replication-info) Faolain/peerbit#7 has the stronger PubSub pending-subscription model and includes focused regressions.
PR Fix connectivity races (pubsub pending subscribe + shared-log replication-info) Faolain/peerbit#7 does not include Fix/shared log handshake flakes #593's broader shared-log correctness work (notably: v8 conversion/role issues and the replication handshake hardening/backfill behavior).
This PR includes Fix/shared log handshake flakes #593’s shared-log correctness work and then layers PR https://github.com/Faolain/peerbit/pull/7’s stronger PubSub semantics + the regressions.

Tests Added / Updated

PubSub regression suite (added)

packages/transport/pubsub/test/bug1-initializeTopic-race.spec.ts
- proves topics.has(topic) is true immediately after subscribe(topic) (before debounce fires)
packages/transport/pubsub/test/bug2-requestSubscribers-pendingSubscribe.spec.ts
- includes a skipped "design probe" documenting a future enhancement
- includes an active design guard ensuring we do not start tracking a topic solely due to receiving remote Subscribe traffic
packages/transport/pubsub/test/bug3-subscribe-then-unsubscribe-before-debounce.spec.ts
- proves canceling a debounced subscribe does not leave ghost topic state and does not advertise subscriptions
packages/transport/pubsub/test/bug4-pending-subscribe-receives-strict-pubsubdata.spec.ts
- proves strict SeekDelivery PubSubData is delivered during the pending-subscribe debounce window (BUG4)

Shared-log

No new shared-log tests added here, but the PR stabilizes existing suite behavior and adjusts two timing-sensitive waits.

Evidence / Repro (Master vs #593 vs This PR)

1) Master + regressions (proves BUG1 exists)

pnpm --filter @peerbit/pubsub test
Result: FAIL
Failure: BUG 1 ... topics.has(topic) is true immediately after subscribe() (deterministic)

2) #593 + regressions (proves BUG4 remains)

pnpm --filter @peerbit/pubsub test
Result: FAIL
Failure: BUG 4 ... pending subscribe receives strict PubSubData

3) This PR

pnpm --filter @peerbit/pubsub test
- Result: PASS (46 passing, 1 pending)
pnpm run test:ci:part-4 (shared-log + shared-log-proxy)
- Result: PASS, run twice back-to-back
- @peerbit/shared-log: 1743 passing
- @peerbit/shared-log-proxy: 1 passing

How To Run

PubSub:
- pnpm --filter @peerbit/pubsub test
Shared-log CI slice:
- pnpm run test:ci:part-4

Behavioral Changes / Risk Notes

Strict deliveries during pending subscribe:
- New behavior: strict PubSubData addressed to a node is processed if the node has either a real subscription or a pending subscription.
- This matches the intuitive semantics of calling subscribe() (local intent) and is covered by BUG4 regression.
Cancel path cleanup is deeper:
- When unsubscribe() cancels a pending (debounced) subscribe, we now fully undo eager topic initialization to prevent ghost topic state.
- This only triggers when no real subscription exists.
Shared-log RPC send awaits are bounded:
- We wait up to 10s for key replication handshake messages to reduce ordering races.
- The wait is bounded (won't hang indefinitely) and is aborted on close.

Claims-to-Tests Matrix

Claim	Test(s) / Evidence
`subscribe()` tracks topic immediately (no debounce window race)	`bug1-initializeTopic-race.spec.ts` (master fails, this PR passes)
Cancelled pending subscribe leaves no ghost topic state	`bug3-subscribe-then-unsubscribe-before-debounce.spec.ts`
Strict deliveries are not dropped during pending subscribe	`bug4-pending-subscribe-receives-strict-pubsubdata.spec.ts` (#593 fails, this PR passes)
Non-subscribers do not start tracking topics just by observing remote Subscribe traffic	`bug2-requestSubscribers-pendingSubscribe.spec.ts` (design guard)
Shared-log applies replication segments even when `waitFor()` rejects	`replication.spec.ts` (existing) + `pnpm run test:ci:part-4` pass
Shared-log + proxy CI slice stable under load	`pnpm run test:ci:part-4` pass twice

Context PR dao-xyz#589 (pubsub debounce hardening) tightened peer subscription timing and surfaced latent shared-log races/timeouts in CI (duplicate replicator:join + migration-8-9 replication stalls). What Was Implemented (see shared-log-debug.md for the full investigation log) Pubsub (packages/transport/pubsub/src/index.ts) - Eagerly initialize per-topic state in subscribe() so early remote Subscribe messages aren't dropped. - Include topics pending in the debounce window when responding to requestSubscribers overlap queries. - If subscribe() is cancelled via unsubscribe() before the debounce fires, clean up the eager topic state to avoid stale topic entries. Shared-log (packages/programs/data/shared-log/src/index.ts) - Serialize replication-info application per peer (promise chain) to eliminate TOCTOU around addReplicationRange() and replicator:join emission. - Track latest replication-info timestamp per peer and ignore older messages. - If replication-info can't be applied yet (NotStartedError / index-not-ready), keep the latest per-peer message and flush after open instead of dropping it. - After open, requestSubscribers(topic) and backfill handleSubscriptionChange() from the current subscriber snapshot to avoid missed subscribe windows. - Make replicator:join idempotent: emit on the transition "not known replicator -> has segments", including the restart/full-state announce case. Migration + Role fixes - v8 compat: always respond with a role; getRole() is best-effort when multiple segments exist, and request handlers correctly await getRole(). - ResponseRoleMessage -> replication-info conversion denormalizes factor/offset into u32 coordinate space. - Fix RoleReplicationSegment offset encoding bug (offset nominator incorrectly used factor). Tests (packages/programs/data/shared-log/test/*) - Fix migration v8 mock to respond using the actually opened instance (db1.log.rpc). - Make the maturity timing assertion robust by using remaining maturity time based on the observed segment timestamp. Additional flake hardening (this commit) - checkReplicas() now uses a slightly longer waitForResolved timeout with lower polling rate to avoid false negatives under heavy load (u32-simple sharding convergence could exceed the default 10s). File: packages/programs/data/shared-log/test/utils.ts Docs - shared-log-debug-plan.md updated with current verification status, the observed sharding flake, and the test hardening applied. Verification - pnpm run build: PASS - Targeted shared-log tests: PASS - replicate:join not emitted on update - 8-9, replicates database of 1 entry - 9-8, replicates database of 1 entry - waitForReplicator waits until maturity - handles peer joining and leaving multiple times - Full CI Part 4 suite: PASS - node ./node_modules/aegir/src/index.js run test --roots ./packages/programs/data/shared-log ./packages/programs/data/shared-log/proxy -- -t node --no-build - @peerbit/shared-log: 1743 passing - @peerbit/shared-log-proxy: 1 passing

Contains terminal transcript / scratch notes from the shared-log + pubsub investigation (with account/session redacted).

Faolain · 2026-02-07T02:10:01Z

Test Results for Master vs #596

What these tests mean

Faolain · 2026-02-07T02:43:35Z

Visual Summary of Bug Fix:

clip.mp4

Cascade:

Bug 3 Test(Doesn't apply to master but applies to this branch bc cleanup required):

Bug 1 vs Bug 4

Faolain · 2026-02-07T02:56:34Z

Re: the CI initial Failure (addressed in subsequent commit)

The sequential replication-info processing and bounded sends introduced in this PR can take longer under CI load. Increase the test-level timeout to 300s and inner waitForResolved to 180s with 500ms polling to avoid flaky convergence failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Faolain · 2026-02-07T03:49:01Z

AI Review

Bug 2 Spec description

Will unskip it on this PR but that too passes on this branch and doesn't pass on master

Faolain added 9 commits February 6, 2026 03:40

tests: checkpoint

35963a7

docs: record shared-log debug commit hashes

bb7e7b0

docs: add shared-log debugging scrap notes

67125c1

Contains terminal transcript / scratch notes from the shared-log + pubsub investigation (with account/session redacted).

test(pubsub): add debounced subscribe regression cases

cfadfe1

test(pubsub): add pending-subscribe strict delivery regression

710cb68

chore(aegir): support git worktrees

b85dbb0

fix(pubsub): track pending subscriptions for strict delivery

c8e65b6

fix(shared-log): bound replication-info sends; reduce flakes

11258f8

Faolain requested a review from marcus-pousette as a code owner February 7, 2026 01:11

Faolain mentioned this pull request Feb 7, 2026

Fix/shared log handshake flakes #593

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

PubSub pendingSubscriptions + SharedLog replication handshake hardening#596

PubSub pendingSubscriptions + SharedLog replication handshake hardening#596
Faolain wants to merge 10 commits intodao-xyz:masterfrom
Faolain:codex/pr593+faolain-pending-subscriptions

Faolain commented Feb 7, 2026 •

edited

Loading

Uh oh!

Faolain commented Feb 7, 2026 •

edited

Loading

Uh oh!

Faolain commented Feb 7, 2026 •

edited

Loading

Uh oh!

Faolain commented Feb 7, 2026 •

edited

Loading

Uh oh!

Faolain commented Feb 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

Faolain commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What This PR Fixes

1) PubSub: Pending Subscriptions Are First-Class

2) Shared-log: Replication Handshake More Deterministic Under Load

packages/programs/data/shared-log/src/index.ts

3) Test Flake Mitigations (Targeted)

4) Tooling: Run Tests In Git Worktrees

Why This Is Better Than #593 Alone Or PR Faolain#7 Alone

Compared to #593 alone

Compared to Faolain PR Faolain#7 alone

Tests Added / Updated

PubSub regression suite (added)

Shared-log

Evidence / Repro (Master vs #593 vs This PR)

1) Master + regressions (proves BUG1 exists)

2) #593 + regressions (proves BUG4 remains)

3) This PR

How To Run

Behavioral Changes / Risk Notes

Claims-to-Tests Matrix

Uh oh!

Faolain commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results for Master vs #596

What these tests mean

Uh oh!

Faolain commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Visual Summary of Bug Fix:

Uh oh!

Faolain commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Faolain commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Bug 2 Spec description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Faolain commented Feb 7, 2026 •

edited

Loading

`packages/programs/data/shared-log/src/index.ts`

Faolain commented Feb 7, 2026 •

edited

Loading

Faolain commented Feb 7, 2026 •

edited

Loading

Faolain commented Feb 7, 2026 •

edited

Loading

Faolain commented Feb 7, 2026 •

edited

Loading