Skip to content

Comments

PubSub pendingSubscriptions + SharedLog replication handshake hardening#596

Open
Faolain wants to merge 10 commits intodao-xyz:masterfrom
Faolain:codex/pr593+faolain-pending-subscriptions
Open

PubSub pendingSubscriptions + SharedLog replication handshake hardening#596
Faolain wants to merge 10 commits intodao-xyz:masterfrom
Faolain:codex/pr593+faolain-pending-subscriptions

Conversation

@Faolain
Copy link
Contributor

@Faolain Faolain commented Feb 7, 2026

Summary

This PR is an integration of:

It addresses a class of races caused by debounced pubsub.subscribe(). During the debounce window, a node has expressed local intent to subscribe, but historically it would:

  • not fully track the topic yet (dropping inbound topic traffic), and/or
  • not advertise the pending subscription in subscribe-time handshakes, and/or
  • drop strict deliveries targeted at it (because it wasn't yet in subscriptions).

Those gaps can cascade into shared-log replication handshake flakes and incorrect/unstable replication state under suite load.

This PR combines the best parts of #593 (shared-log correctness) with the stronger PubSub model from PR #7 (pending-subscription semantics + regressions).


What This PR Fixes

1) PubSub: Pending Subscriptions Are First-Class

Problem: DirectSub.subscribe() is debounced. Between the call to subscribe(topic) and the eventual debounced _subscribe([topic]), the node has local interest but was treated as "not subscribed".

Fix: Introduce explicit pendingSubscriptions tracking and treat it as local interest.

Key changes (packages/transport/pubsub/src/index.ts):

  • Add pendingSubscriptions: Set<string>.
  • subscribe(topic):
    • add topic to pendingSubscriptions
    • eagerly initializeTopic(topic) (so we track the topic immediately and don't drop inbound subscription traffic)
  • _subscribe():
    • remove topic from pendingSubscriptions once the real subscription is applied
  • unsubscribe(topic):
    • remove topic from pendingSubscriptions
    • if cancelling a debounced subscribe, fully clean up eager topic state (avoid "ghost topics"), including:
      • peerToTopic
      • lastSubscriptionMessages
      • topics / topicsToPeers
  • Strict delivery: for PubSubData with strict: true, treat pending topics as local interest so strict deliveries are not dropped during the debounce window.
  • Subscribe-time handshake: when responding to Subscribe{requestSubscribers:true}, include pending topics in the overlap response.

Why this matters:

  • It closes BUG4 (pending subscribe must receive strict deliveries).
  • It ensures subscribe-time backfill/handshake traffic doesn't miss a peer just because the peer is still inside the debounce window.

2) Shared-log: Replication Handshake More Deterministic Under Load

This PR keeps #593's shared-log correctness work (queueing/buffering replication-info, backfill, v8 conversion/role fixes, etc.) and adds a small but important ordering/hardening layer.

Key changes:

packages/programs/data/shared-log/src/index.ts

  • afterOpen():

    • still calls pubsub.requestSubscribers(this.topic) for backfill
    • processes pre-existing subscribers sequentially via for...of + await this.handleSubscriptionChange(...) (instead of fire-and-forget), then flushes buffered replication-info.
  • handleSubscriptionChange(...):

    • When a peer subscribes, we send:
      • AllReplicatingSegmentsMessage
      • (v8) ResponseRoleMessage
      • RequestReplicationInfoMessage
    • These sends are now best-effort ordered, but also bounded:
      • we await Promise.race([ rpc.send(...), delay(10s) ])
      • avoids hangs if delivery/acks stall
      • reduces ordering races that show up only in full-suite runs

This specifically stabilizes the existing shared-log test:

  • replication.spec.ts: "applies replication segments even if waitFor() fails"

3) Test Flake Mitigations (Targeted)

Two tests were observed to be stable in isolation but flaky in long-suite execution due to timing sensitivity. This PR increases only the tightest waits.

  • packages/programs/data/shared-log/test/leader.spec.ts:

    • aligns waitForResolved timeouts with the test's 120s timeout for reachableOnly cover stability
    • increases polling interval to reduce busy-looping under load
  • packages/programs/data/shared-log/test/sharding.spec.ts:

    • increases the final convergence waits for calculateTotalParticipation() to { timeout: 30_000, delayInterval: 200 }

4) Tooling: Run Tests In Git Worktrees

  • .aegir.js updated to handle git worktrees where .git is a file (not a directory).

This is not a runtime change, but it enables running the new regressions in secondary worktrees (which is how these comparisons were validated).


Why This Is Better Than #593 Alone Or PR Faolain#7 Alone

Compared to #593 alone

Compared to Faolain PR Faolain#7 alone


Tests Added / Updated

PubSub regression suite (added)

  • packages/transport/pubsub/test/bug1-initializeTopic-race.spec.ts
    • proves topics.has(topic) is true immediately after subscribe(topic) (before debounce fires)
  • packages/transport/pubsub/test/bug2-requestSubscribers-pendingSubscribe.spec.ts
    • includes a skipped "design probe" documenting a future enhancement
    • includes an active design guard ensuring we do not start tracking a topic solely due to receiving remote Subscribe traffic
  • packages/transport/pubsub/test/bug3-subscribe-then-unsubscribe-before-debounce.spec.ts
    • proves canceling a debounced subscribe does not leave ghost topic state and does not advertise subscriptions
  • packages/transport/pubsub/test/bug4-pending-subscribe-receives-strict-pubsubdata.spec.ts
    • proves strict SeekDelivery PubSubData is delivered during the pending-subscribe debounce window (BUG4)

Shared-log

  • No new shared-log tests added here, but the PR stabilizes existing suite behavior and adjusts two timing-sensitive waits.

Evidence / Repro (Master vs #593 vs This PR)

1) Master + regressions (proves BUG1 exists)

  • pnpm --filter @peerbit/pubsub test
  • Result: FAIL
  • Failure: BUG 1 ... topics.has(topic) is true immediately after subscribe() (deterministic)

2) #593 + regressions (proves BUG4 remains)

  • pnpm --filter @peerbit/pubsub test
  • Result: FAIL
  • Failure: BUG 4 ... pending subscribe receives strict PubSubData

3) This PR

  • pnpm --filter @peerbit/pubsub test
    • Result: PASS (46 passing, 1 pending)
  • pnpm run test:ci:part-4 (shared-log + shared-log-proxy)
    • Result: PASS, run twice back-to-back
    • @peerbit/shared-log: 1743 passing
    • @peerbit/shared-log-proxy: 1 passing

How To Run

  • PubSub:
    • pnpm --filter @peerbit/pubsub test
  • Shared-log CI slice:
    • pnpm run test:ci:part-4

Behavioral Changes / Risk Notes

  • Strict deliveries during pending subscribe:

    • New behavior: strict PubSubData addressed to a node is processed if the node has either a real subscription or a pending subscription.
    • This matches the intuitive semantics of calling subscribe() (local intent) and is covered by BUG4 regression.
  • Cancel path cleanup is deeper:

    • When unsubscribe() cancels a pending (debounced) subscribe, we now fully undo eager topic initialization to prevent ghost topic state.
    • This only triggers when no real subscription exists.
  • Shared-log RPC send awaits are bounded:

    • We wait up to 10s for key replication handshake messages to reduce ordering races.
    • The wait is bounded (won't hang indefinitely) and is aborted on close.

Claims-to-Tests Matrix

Claim Test(s) / Evidence
subscribe() tracks topic immediately (no debounce window race) bug1-initializeTopic-race.spec.ts (master fails, this PR passes)
Cancelled pending subscribe leaves no ghost topic state bug3-subscribe-then-unsubscribe-before-debounce.spec.ts
Strict deliveries are not dropped during pending subscribe bug4-pending-subscribe-receives-strict-pubsubdata.spec.ts (#593 fails, this PR passes)
Non-subscribers do not start tracking topics just by observing remote Subscribe traffic bug2-requestSubscribers-pendingSubscribe.spec.ts (design guard)
Shared-log applies replication segments even when waitFor() rejects replication.spec.ts (existing) + pnpm run test:ci:part-4 pass
Shared-log + proxy CI slice stable under load pnpm run test:ci:part-4 pass twice

Context

PR dao-xyz#589 (pubsub debounce hardening) tightened peer subscription timing and surfaced latent shared-log races/timeouts in CI (duplicate replicator:join + migration-8-9 replication stalls).

What Was Implemented (see shared-log-debug.md for the full investigation log)

Pubsub (packages/transport/pubsub/src/index.ts)
- Eagerly initialize per-topic state in subscribe() so early remote Subscribe messages aren't dropped.
- Include topics pending in the debounce window when responding to requestSubscribers overlap queries.
- If subscribe() is cancelled via unsubscribe() before the debounce fires, clean up the eager topic state to avoid stale topic entries.

Shared-log (packages/programs/data/shared-log/src/index.ts)
- Serialize replication-info application per peer (promise chain) to eliminate TOCTOU around addReplicationRange() and replicator:join emission.
- Track latest replication-info timestamp per peer and ignore older messages.
- If replication-info can't be applied yet (NotStartedError / index-not-ready), keep the latest per-peer message and flush after open instead of dropping it.
- After open, requestSubscribers(topic) and backfill handleSubscriptionChange() from the current subscriber snapshot to avoid missed subscribe windows.
- Make replicator:join idempotent: emit on the transition "not known replicator -> has segments", including the restart/full-state announce case.

Migration + Role fixes
- v8 compat: always respond with a role; getRole() is best-effort when multiple segments exist, and request handlers correctly await getRole().
- ResponseRoleMessage -> replication-info conversion denormalizes factor/offset into u32 coordinate space.
- Fix RoleReplicationSegment offset encoding bug (offset nominator incorrectly used factor).

Tests (packages/programs/data/shared-log/test/*)
- Fix migration v8 mock to respond using the actually opened instance (db1.log.rpc).
- Make the maturity timing assertion robust by using remaining maturity time based on the observed segment timestamp.

Additional flake hardening (this commit)
- checkReplicas() now uses a slightly longer waitForResolved timeout with lower polling rate to avoid false negatives under heavy load (u32-simple sharding convergence could exceed the default 10s).
  File: packages/programs/data/shared-log/test/utils.ts

Docs
- shared-log-debug-plan.md updated with current verification status, the observed sharding flake, and the test hardening applied.

Verification
- pnpm run build: PASS
- Targeted shared-log tests: PASS
  - replicate:join not emitted on update
  - 8-9, replicates database of 1 entry
  - 9-8, replicates database of 1 entry
  - waitForReplicator waits until maturity
  - handles peer joining and leaving multiple times
- Full CI Part 4 suite: PASS
  - node ./node_modules/aegir/src/index.js run test --roots ./packages/programs/data/shared-log ./packages/programs/data/shared-log/proxy -- -t node --no-build
  - @peerbit/shared-log: 1743 passing
  - @peerbit/shared-log-proxy: 1 passing
Contains terminal transcript / scratch notes from the shared-log + pubsub investigation (with account/session redacted).
@Faolain
Copy link
Contributor Author

Faolain commented Feb 7, 2026

Test Results for Master vs #596

Screenshot 2026-02-06 at 10 09 51 PM Screenshot 2026-02-06 at 10 16 34 PM

What these tests mean

Screenshot 2026-02-06 at 10 23 44 PM

@Faolain
Copy link
Contributor Author

Faolain commented Feb 7, 2026

Visual Summary of Bug Fix:

clip.mp4

Cascade:
Screenshot 2026-02-06 at 10 42 00 PM

Bug 3 Test(Doesn't apply to master but applies to this branch bc cleanup required):
Screenshot 2026-02-06 at 10 42 13 PM

Bug 1 vs Bug 4
Screenshot 2026-02-06 at 10 42 43 PM

@Faolain
Copy link
Contributor Author

Faolain commented Feb 7, 2026

Re: the CI initial Failure (addressed in subsequent commit)

Screenshot 2026-02-06 at 10 56 16 PM Screenshot 2026-02-07 at 12 25 09 AM

The sequential replication-info processing and bounded sends introduced
in this PR can take longer under CI load. Increase the test-level
timeout to 300s and inner waitForResolved to 180s with 500ms polling
to avoid flaky convergence failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Faolain
Copy link
Contributor Author

Faolain commented Feb 7, 2026

AI Review

Screenshot 2026-02-06 at 11 48 57 PM Screenshot 2026-02-07 at 12 06 35 AM Screenshot 2026-02-07 at 12 11 50 AM

Bug 2 Spec description

Will unskip it on this PR but that too passes on this branch and doesn't pass on master
Screenshot 2026-02-06 at 11 58 58 PM

Screenshot 2026-02-07 at 12 01 32 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant