Skip to content

Comments

Fix/search convergence under churn#595

Open
Faolain wants to merge 3 commits intodao-xyz:masterfrom
Faolain:fix/search-convergence-under-churn
Open

Fix/search convergence under churn#595
Faolain wants to merge 3 commits intodao-xyz:masterfrom
Faolain:fix/search-convergence-under-churn

Conversation

@Faolain
Copy link
Contributor

@Faolain Faolain commented Feb 7, 2026

Initially this started as research into the ephemeral flakes in ci:part2 seen across runs in this repo as described here: #594 (this is an in-depth exploration as the origin of the bug, what was seen, how often, and even what we should be testing...eventual consistency? strict testing highly recommend reading for context) This ephemeral flake not only is something that can be seen in automated tests but a real issue that could be seen in production.

The initial #594 PR intended to backport the change from the fanout branch which reduced constraints on the tests but that didn't seem to have too significant of an effect on the success of the tests, and more importantly doesn't really test what's happening in a real situation.

Why

ci:part2 is "flaky"
The failing test index > operations > search > redundancy > can search while keeping minimum amount of replicas in packages/programs/data/document/document/test/index.spec.ts was asserting immediate completeness (collected.length === count) while the system is still rebalancing/syncing. In CI, distributed index.search(fetch=count) can transiently short-read due to timing (indexing lag and/or missed remote RPC responses), producing the familiar signature. See more: #594

What this does:

  • When distributed search encounters missing RPC shard responses, the iterator can prematurely conclude it is done, causing silent short-reads (the ci:part2 flake signature: "Failed to collect all messages X < Y. Log lengths: [...]").
  • This change surfaces missing shard info on MissingResponsesError and prevents the search iterator from marking itself done when missing responses occurred during a fetch.

Changes:

  • packages/programs/data/document/document/src/search.ts
    • Align remote.timeout with remote.wait.timeout (if wait provided and remote.timeout not set)
    • Expose onMissingResponses hook internally and ensure missing responses keep the iterator open (unsetDone + hasMore=true)
  • packages/programs/rpc/src/utils.ts
    • MissingResponsesError now carries missingGroups metadata
  • packages/programs/rpc/test/index.spec.ts
    • Adds a narrow regression test for missingGroups

Local verification:

  • PASS: pnpm --filter @peerbit/rpc test -- --grep "reports missing groups on timeout"
  • PASS: (stress) for i in {1..25}; do PEERBIT_TEST_SESSION=mock pnpm --filter @peerbit/document test -- --grep "can search while keeping minimum amount of replicas" || break; done
    • 25/25 passed locally on this branch.

Links:

AI Summary:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant