Open
Conversation
Context PR dao-xyz#589 (pubsub debounce hardening) tightened peer subscription timing and surfaced latent shared-log races/timeouts in CI (duplicate replicator:join + migration-8-9 replication stalls). What Was Implemented (see shared-log-debug.md for the full investigation log) Pubsub (packages/transport/pubsub/src/index.ts) - Eagerly initialize per-topic state in subscribe() so early remote Subscribe messages aren't dropped. - Include topics pending in the debounce window when responding to requestSubscribers overlap queries. - If subscribe() is cancelled via unsubscribe() before the debounce fires, clean up the eager topic state to avoid stale topic entries. Shared-log (packages/programs/data/shared-log/src/index.ts) - Serialize replication-info application per peer (promise chain) to eliminate TOCTOU around addReplicationRange() and replicator:join emission. - Track latest replication-info timestamp per peer and ignore older messages. - If replication-info can't be applied yet (NotStartedError / index-not-ready), keep the latest per-peer message and flush after open instead of dropping it. - After open, requestSubscribers(topic) and backfill handleSubscriptionChange() from the current subscriber snapshot to avoid missed subscribe windows. - Make replicator:join idempotent: emit on the transition "not known replicator -> has segments", including the restart/full-state announce case. Migration + Role fixes - v8 compat: always respond with a role; getRole() is best-effort when multiple segments exist, and request handlers correctly await getRole(). - ResponseRoleMessage -> replication-info conversion denormalizes factor/offset into u32 coordinate space. - Fix RoleReplicationSegment offset encoding bug (offset nominator incorrectly used factor). Tests (packages/programs/data/shared-log/test/*) - Fix migration v8 mock to respond using the actually opened instance (db1.log.rpc). - Make the maturity timing assertion robust by using remaining maturity time based on the observed segment timestamp. Additional flake hardening (this commit) - checkReplicas() now uses a slightly longer waitForResolved timeout with lower polling rate to avoid false negatives under heavy load (u32-simple sharding convergence could exceed the default 10s). File: packages/programs/data/shared-log/test/utils.ts Docs - shared-log-debug-plan.md updated with current verification status, the observed sharding flake, and the test hardening applied. Verification - pnpm run build: PASS - Targeted shared-log tests: PASS - replicate:join not emitted on update - 8-9, replicates database of 1 entry - 9-8, replicates database of 1 entry - waitForReplicator waits until maturity - handles peer joining and leaving multiple times - Full CI Part 4 suite: PASS - node ./node_modules/aegir/src/index.js run test --roots ./packages/programs/data/shared-log ./packages/programs/data/shared-log/proxy -- -t node --no-build - @peerbit/shared-log: 1743 passing - @peerbit/shared-log-proxy: 1 passing
Contains terminal transcript / scratch notes from the shared-log + pubsub investigation (with account/session redacted).
Contributor
Author
Contributor
Author
|
@marcus-pousette please review 🙏 I can port my tests from #589 and remove my scrap md's which I used with AIs to track down this problem (feel free to read the notes of course). It seems like this PR would be crucial to ensure #582 is successful/ensure we don't have finicky race conditions going forward. I know you made significant changes to 582 (funnily enough I had no idea you did heh...maybe some of our fixes overlap, hopefully mine are small enough here and in 589 that you can see if an issue still exists in your implementation) Maybe these can go in ahead if you the fanout will take a while? UpdateThe attached PR: #596 should include everything in 593 in addition to better shared-logs additions from Faolain#7 which maybe should close this PR? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


UPDATE: This PR is a subset of #596 , 596 should include the changes from here and other improved changes to shared-log as well with more tests which fail on master but pass on that branch. I had initially worked on 593, and it is separate for investigation/understanding but maybe it can be closed in favor of 596?
Why
Pubsub was flaky and has race conditions with topic subscribing. #589 solved that but that surfaced another already existing race condition within the shared-log itself which should be handled in this PR. Read that PR for the full context of the problem (in the dApp it would surface as inability to make connections with a peer, "flaky connection") Also please skip to the bottom for an additional more targeted approach that was tried with PubSub Faolain#7 which I'm thinking to stack with this PR if you agree.
Context
PR #589 (pubsub debounce hardening) improves subscription discovery timing. In CI it also surfaced a set of latent shared-log races/timeouts (and a couple of migration/role issues) that could previously hide behind slower peer discovery.
The failure signature that motivated this work:
events.spec.ts: duplicatereplicator:join(same peer hash appears twice)migration.spec.ts(migration-8-9): replication never completes withinwaitForResolveddefault timeoutreplicate.spec.ts: timing sensitivity inwaitForReplicator waits until maturityThis PR makes the pubsub subscription handshake resilient to debounce timing, and makes shared-log replication-info application deterministic (per-peer serialization + retry on startup) so those CI failures stop occurring.
Root Causes
1) Pubsub: subscribe debounce window could drop/omit state
DirectSub.subscribe()is debounced; the topic is only fully initialized when_subscribe()runs.Subscribehandling drops the subscription if the topic hasn't been initialized locally yet.requestSubscribers()overlap responses were based only onthis.subscriptions, which can be empty during the debounce window.Net effect: peers can connect during the debounce window and fail to discover each other's subscriptions promptly or at all.
2) Shared-log: replication-info handling was racy / lossy
addReplicationRange()emittedreplicator:joinbased on a TOCTOU check of index state. Two concurrent handlers could both decide a peer is “new” and emit duplicates.3) Migration + role encoding issues (v8 compat)
await, agetRole()implementation that threw in multi-segment cases, and a role->replication-range conversion that used normalized floats rather than u32 coordinate space.RoleReplicationSegmentoffset encoding used the wrong nominator.4) A genuine test flake under heavy load
While re-running the full suite,
u32-simple > sharding > handles peer joining and leaving multiple timesintermittently failed because replica convergence could exceedwaitForResolved’s default 10s timeout on a loaded machine.Fixes
Pubsub (packages/transport/pubsub/src/index.ts)
subscribe()so early remoteSubscribemessages aren’t dropped.requestSubscribers()overlap queries.unsubscribe()before debounce fires, clean up the eagerly created topic state.Shared-log (packages/programs/data/shared-log/src/index.ts)
addReplicationRange().pubsub.requestSubscribers(this.topic)and backfillhandleSubscriptionChange(...)from the current subscriber snapshot.replicator:joinidempotent: emit on transition “not known replicator -> has segments”, including restart/full-state announce cases.Migration + role fixes
getRole()is now best-effort when multiple local segments exist (selects the widest segment).getRole().RoleReplicationSegmentoffset encoding.Tests
db1.log.rpc).waitForReplicator waits until maturityassert on remaining maturity time derived from the observed segment timestamp.checkReplicas()to avoid false negatives under load by using a slightly longer timeout + lower polling rate.Debug Docs
shared-log-debug.md,shared-log-debug-plan.md, andscrap.mdas an audit trail for future debugging; these can be removed later once we’re confident.Verification
pnpm run build(PASS)replicate:join not emitted on update8-9, replicates database of 1 entry9-8, replicates database of 1 entrywaitForReplicator waits until maturityhandles peer joining and leaving multiple timespnpm run test:ci:part-4@peerbit/shared-log:1743 passing@peerbit/shared-log-proxy:1 passingAll Checks passed when running on my repo with no change TBD on this repo 🤞
Commits
35963a7actests: checkpointa2ab6d31dshared-log: harden replication handshake + reduce flakesbb7e7b031docs: record shared-log debug commit hashes67125c13fdocs: add shared-log debugging scrap notesAdditional Notes:
There is another PR at Faolain#7 which the below summarizes vs this PR