Skip to content

Fix standby resubscribe when subscription is ongoing.#434

Draft
MrGuin wants to merge 4 commits intomainfrom
fix_resub
Draft

Fix standby resubscribe when subscription is ongoing.#434
MrGuin wants to merge 4 commits intomainfrom
fix_resub

Conversation

@MrGuin
Copy link
Collaborator

@MrGuin MrGuin commented Mar 6, 2026

Fixes https://github.com/eloqdata/project_tracker/issues/224.

Summary by CodeRabbit

  • Documentation
    • Added design and implementation planning documentation for internal subscription handling improvements.

@MrGuin MrGuin changed the title add standby resubscribe design and impl docs Fix standby resubscribe Mar 6, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

Walkthrough

Introduces comprehensive design and implementation documentation for a revised resubscribe mechanism using subscribe_round_id for round-based tracking and follow_req_id for local context preservation. Defines protocol changes, validation rules, state management, and a task breakdown (T1–T10) with dependency graph for phased execution toward deadlock removal and unified round semantics.

Changes

Cohort / File(s) Summary
Resubscribe v2 Design & Implementation
.cursor_knowledge/standby_resubscribe_design.md, .cursor_knowledge/standby_resubscribe_impl_tasks.md
Design documentation for a revised subscription mechanism introducing subscribe_round_id round tracking, follow_req_id local context preservation, de-duplication rules, and strict validation logic; accompanied by detailed task breakdown (T1–T10) with phase priorities, state/function changes, affected file inventory, and test coverage goals.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • liunyl

Poem

🐰 Round and round the subscriptions go,
With follow_req_id's gentle glow,
De-duped and true, no deadlock near,
A protocol perfected, crystal clear!
Standby wisdom in design aligned,
Tasks and timelines neatly lined! 🎯

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete; it only references an external issue link without providing details about changes, testing, documentation, or verification against the template checklist. Complete the PR description using the template: add details of what was changed, document any new design elements, confirm tests were added, reference RFC if applicable, and verify test suite passes.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title 'Fix standby resubscribe when subscription is ongoing' directly aligns with the changeset, which introduces design and implementation specifications for handling resubscription while a previous subscription is in progress.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix_resub

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tx_service/cc_resubscribe_epoch_design_en.md`:
- Around line 36-38: The review warns that round ownership must be centralized,
so move the per-node state latest_subscribe_round_[node_id] and
out_of_sync_ctrl_[node_id] out of CcShard and into the single Sharder owner
(where subscribe_counter_ and round allocation live); update CcShard to keep
only per-shard candidate/subscribed bookkeeping, update Sharder to store and
manage latest_subscribe_round and out_of_sync_ctrl (keyed by node_id), and
adjust CcNodeService::StandbyStartFollowing(...) and any handlers that
read/update these symbols to call into Sharder APIs for reading/updating
subscribe_round_id, latest_subscribe_round, and out_of_sync_ctrl to prevent
cross-shard drift and a single source of truth.

In `@tx_service/cc_resubscribe_v2_impl_tasks_en.md`:
- Line 5: Update the design filename reference in
tx_service/cc_resubscribe_v2_impl_tasks_en.md: replace the incorrect string
"cc_resubscribe_epoch_design.md" with the correct document name
"tx_service/cc_resubscribe_epoch_design_en.md" so the implementation checklist
points to the right spec; ensure any internal links or mentions in the same file
are updated consistently.
- Around line 169-186: Add the missing Sharder implementation referenced by the
out-of-sync flow: create data_substrate/tx_service/include/sharder.h and
data_substrate/tx_service/src/sharder.cpp that declare and define the
out_of_sync_ctrl_ state and its API (the methods and types referenced by
CcShard: MarkOutOfSyncInflight(node_id, round), ClearOutOfSyncInflight(node_id,
round), and any helpers/structs used to check last_progress_ts and last_send_ts
and to allocate rounds), and ensure Sharder exposes the same method signatures
used by CcShard::NotifyStandbyOutOfSync and the StandbyStartFollowing validation
so the calls compile and the inflight/round logic is backed by concrete storage
and thread-safe accessors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e66c240-3525-44b7-bc21-6da250ac3f0d

📥 Commits

Reviewing files that changed from the base of the PR and between 7caae7d and c42c6df.

📒 Files selected for processing (2)
  • tx_service/cc_resubscribe_epoch_design_en.md
  • tx_service/cc_resubscribe_v2_impl_tasks_en.md

Comment on lines +36 to +38
| Primary (`Sharder`) | `subscribe_counter_` (existing) | Round allocator that assigns `subscribe_round_id`. |
| Primary (`CcShard`) | `latest_subscribe_round_[node_id]` | Latest accepted round for this standby; used to reject stale requests. |
| Primary (`CcShard`) | `out_of_sync_ctrl_[node_id]` | Out-of-sync dedup control state (for example `{inflight_round, inflight, last_send_ts, last_progress_ts}`) to avoid repeated triggers continuously allocating new rounds. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not make round ownership shard-local.

latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] should not live in CcShard. CcNodeService::StandbyStartFollowing(...) currently walks shards independently (tx_service/src/remote/cc_node_service.cpp:1590-1665), so storing this state per shard would create multiple sources of truth and allow cross-shard round drift. This state needs a single owner per standby/node, e.g. Sharder, with CcShard keeping only per-shard candidate/subscribed bookkeeping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/cc_resubscribe_epoch_design_en.md` around lines 36 - 38, The
review warns that round ownership must be centralized, so move the per-node
state latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] out of
CcShard and into the single Sharder owner (where subscribe_counter_ and round
allocation live); update CcShard to keep only per-shard candidate/subscribed
bookkeeping, update Sharder to store and manage latest_subscribe_round and
out_of_sync_ctrl (keyed by node_id), and adjust
CcNodeService::StandbyStartFollowing(...) and any handlers that read/update
these symbols to call into Sharder APIs for reading/updating subscribe_round_id,
latest_subscribe_round, and out_of_sync_ctrl to prevent cross-shard drift and a
single source of truth.

Comment on lines +169 to +186
- Files Involved:
- `data_substrate/tx_service/src/cc/cc_shard.cpp`
- `data_substrate/tx_service/include/proto/cc_request.proto`
- `data_substrate/tx_service/include/cc/cc_request.h`
- State Changes:
- Before sending out-of-sync notification, primary checks/updates `out_of_sync_ctrl_`; repeated triggers should reuse inflight round by default.
- Function Changes:
- `CcShard::NotifyStandbyOutOfSync(uint32_t node_id)`
- If this standby already has inflight out-of-sync round:
- if `now-last_progress_ts < 20s` and `now-last_send_ts >= 2s`: resend same round;
- if `now-last_progress_ts < 20s` and `now-last_send_ts < 2s`: ignore;
- if `now-last_progress_ts >= 20s`: allocate newer round.
- Otherwise allocate/update round and set `req->set_subscribe_round_id(...)`.
- After sending, call `MarkOutOfSyncInflight(node_id, round)`.
- Standby out-of-sync handling calls
`OnStartFollowing(..., subscribe_round_id=<msg>, resubscribe_intent=false)`.
- Coordination with `T4`:
- After `StandbyStartFollowing` validation succeeds, call `ClearOutOfSyncInflight(node_id, round)`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

T6 omits the Sharder changes it depends on.

This task updates out_of_sync_ctrl_ and calls MarkOutOfSyncInflight(...) / ClearOutOfSyncInflight(...), but T3 defines that state/API on Sharder (Lines 89-104). Please add data_substrate/tx_service/include/sharder.h and data_substrate/tx_service/src/sharder.cpp here as well, otherwise the out-of-sync flow is only half specified.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/cc_resubscribe_v2_impl_tasks_en.md` around lines 169 - 186, Add
the missing Sharder implementation referenced by the out-of-sync flow: create
data_substrate/tx_service/include/sharder.h and
data_substrate/tx_service/src/sharder.cpp that declare and define the
out_of_sync_ctrl_ state and its API (the methods and types referenced by
CcShard: MarkOutOfSyncInflight(node_id, round), ClearOutOfSyncInflight(node_id,
round), and any helpers/structs used to check last_progress_ts and last_send_ts
and to allocate rounds), and ensure Sharder exposes the same method signatures
used by CcShard::NotifyStandbyOutOfSync and the StandbyStartFollowing validation
so the calls compile and the inflight/round logic is backed by concrete storage
and thread-safe accessors.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (3)
.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md (2)

5-5: ⚠️ Potential issue | 🟡 Minor

Fix the referenced design filename.

Line 5 points to cc_resubscribe_epoch_design.md, but based on the past review, the correct filename should be tx_service/cc_resubscribe_epoch_design_en.md. This mismatch will make it difficult for implementers to locate the correct design specification.

📝 Proposed fix
-This checklist translates `cc_resubscribe_epoch_design.md` (revised v2) into executable implementation tasks.
+This checklist translates `tx_service/cc_resubscribe_epoch_design_en.md` (revised v2) into executable implementation tasks.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md at line 5, Update the
incorrect design filename reference: locate the string
"cc_resubscribe_epoch_design.md" in cc_resubscribe_v2_impl_tasks_en.md and
replace it with the correct path "tx_service/cc_resubscribe_epoch_design_en.md"
so implementers can find the proper design doc; ensure any other occurrences in
the same file are updated consistently.

169-172: ⚠️ Potential issue | 🟡 Minor

T6 omits the Sharder files it depends on.

Lines 182 and 186 reference MarkOutOfSyncInflight(...) and ClearOutOfSyncInflight(...), which T3 defines on Sharder (lines 102-103). The Files Involved list should include data_substrate/tx_service/include/sharder.h and data_substrate/tx_service/src/sharder.cpp.

📝 Proposed fix
 - Files Involved:
   - `data_substrate/tx_service/src/cc/cc_shard.cpp`
   - `data_substrate/tx_service/include/proto/cc_request.proto`
   - `data_substrate/tx_service/include/cc/cc_request.h`
+  - `data_substrate/tx_service/include/sharder.h`
+  - `data_substrate/tx_service/src/sharder.cpp`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 169 - 172,
The Files Involved list omitted the Sharder files required by T6; update the
list to include the Sharder header and implementation (add sharder.h and
sharder.cpp) because T6 references Sharder methods MarkOutOfSyncInflight(...)
and ClearOutOfSyncInflight(...), which are defined on the Sharder type—ensure
those symbols (Sharder::MarkOutOfSyncInflight and
Sharder::ClearOutOfSyncInflight) are covered by adding the Sharder files to the
Files Involved section.
.cursor_knowledge/cc_resubscribe_epoch_design_en.md (1)

36-38: ⚠️ Potential issue | 🔴 Critical

Do not make round ownership shard-local.

As flagged in previous reviews, latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] should not live in CcShard. Since CcNodeService::StandbyStartFollowing(...) walks shards independently, storing this state per shard creates multiple sources of truth and allows cross-shard round drift. This state needs a single owner per standby/node (e.g., Sharder, where subscribe_counter_ already lives), with CcShard keeping only per-shard candidate/subscribed bookkeeping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 36 - 38,
The per-standby state latest_subscribe_round_[node_id] and
out_of_sync_ctrl_[node_id] must be moved out of CcShard into the single owner
(Sharder) alongside subscribe_counter_; update CcShard to only hold per-shard
candidate/subscribed bookkeeping, remove the per-shard arrays, and change all
reads/writes (including in CcNodeService::StandbyStartFollowing and any CcShard
methods) to consult the Sharder-owned structures (e.g., provide accessor APIs on
Sharder like GetLatestSubscribeRound(node_id)/UpdateOutOfSyncCtrl(node_id)) so
there is a single source of truth and no cross-shard drift. Ensure concurrent
access is protected by the same locking used for subscribe_counter_ and adjust
tests/initialization to populate the Sharder maps instead of per-shard fields.
🧹 Nitpick comments (2)
.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md (1)

174-186: Specify cleanup strategy for out_of_sync_ctrl_ entries.

The task describes updating out_of_sync_ctrl_ state when sending out-of-sync notifications (line 174, 182) and clearing inflight flags (line 186), but doesn't specify when entries should be removed from the map entirely. Without cleanup logic, the map will grow indefinitely.

🧹 Suggested addition

Add to the Function Changes section:

- Cleanup policy:
  - Clear `out_of_sync_ctrl_[node_id]` entry when standby successfully subscribes (after ResetStandbySequenceId completes).
  - Or: Clear entry when standby is removed from all subscribed_standby_nodes_ across all shards.
  - Or: Use time-based eviction for entries where `last_progress_ts` is very old (e.g., > 5 minutes).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 174 - 186,
The out_of_sync_ctrl_ map needs an explicit cleanup policy to avoid unbounded
growth: update CcShard::NotifyStandbyOutOfSync / MarkOutOfSyncInflight to ensure
entries for node_id are removed when a standby successfully subscribes (i.e.,
after ResetStandbySequenceId completes) by calling
ClearOutOfSyncInflight(node_id, round) and then erasing
out_of_sync_ctrl_[node_id]; additionally remove the entry when a standby is
removed from subscribed_standby_nodes_ across all shards, and add a time-based
eviction (e.g., if last_progress_ts > 5 minutes) to periodically purge stale
entries so out_of_sync_ctrl_ cannot grow indefinitely.
.cursor_knowledge/cc_resubscribe_epoch_design_en.md (1)

124-127: Clarify "reject" behavior for messages missing round field.

Line 127 states "If a forwarded message does not carry round, reject it directly and record a warning metric." Specify whether "reject" means:

  1. Silently drop the message (no response sent)
  2. Send an error/NACK response to the sender
  3. Close the connection/stream

This is important for debugging and ensuring the sender can detect and handle the protocol violation appropriately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 124 - 127,
Clarify the "reject" semantics for forwarded messages missing the round: update
the text so that when a standby receives a forwarded message without a round
field it sends an explicit error/NACK response back to the sender (not silently
drop or close the connection), logs the protocol violation, and increments the
warning metric; reference the check against active_subscribe_ctx_.round and the
"forwarded message" handling to locate where to document/implement this behavior
and ensure the NACK includes the reason "missing round" for easier debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md:
- Around line 146-159: Add a new Test 12 to the test matrix that exercises
cross-shard concurrency for round state: simulate concurrent
StandbyStartFollowing and ResetStandbySequenceId operations targeting the same
node_id across multiple shards and assert that the shared state
latest_subscribe_round_[node_id] is identical on all shards, that reads from
different shards observe the same round value, and that concurrent round
allocations never produce conflicting or divergent rounds; ensure the test also
covers the case where round ownership is centralized (the code paths that
reference/latest update latest_subscribe_round_[node_id] and any allocator logic
invoked by resubscribe_intent) and fails if any shard reports a different round
or multiple distinct rounds are allocated for the same node.
- Around line 79-91: Clarify and tighten the round-resolution rules in
StandbyStartFollowing: define "no known current round context" as
latest_subscribe_round_[node_id] == 0 and there is no active subscribe_id/state
for that node (explicitly check both latest_subscribe_round_ and the node's
current subscribe context), treat subscribe_round_id >
latest_subscribe_round_[node_id] as a protocol violation — add a logged
error/metric with diagnostics (remote node id, received subscribe_round_id,
expected latest_subscribe_round_[node_id], and request metadata) and reject the
request (do not allocate a new round), and make same-round idempotent handling
(subscribe_round_id == latest_subscribe_round_[node_id] when
resubscribe_intent==false) explicitly return the current subscribe_id and the
node's current state snapshot (not a silent no-op) so callers receive a state
refresh; update code paths referencing GetNextSubscribeId(),
latest_subscribe_round_[node_id], subscribe_round_id, and subscribe_id
accordingly.
- Around line 21-23: Clarify that standby_node_term keeps the same bit layout
and monotonic ordering by explicitly stating in the design that the upper 32
bits remain primary_term and the lower 32 bits are now called subscribe_round_id
but preserve the same monotonic/ordering semantics as the old subscribe_id;
mention that code which extracts the lower 32 bits (e.g.,
SubscribeIdFromStandbyTerm() used in snapshot_manager.cpp at the noted call
sites) can continue to compare those numeric round values directly because
subscribe_round_id is guaranteed to be monotonically increasing and compatible
with existing sentinel/ordering checks.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md:
- Around line 148-159: ResetStandbySequenceId currently calls
AddHeartbeatTargetNode before validating the standby, which can pollute
heartbeat state with stale candidates; update
CcNodeService::ResetStandbySequenceId to perform all validations first (parse
(term, round) from standby_node_term, verify term == current leader term, verify
round == latest_subscribe_round_[node_id], verify candidate still valid and not
stale, and return idempotent success if already subscribed with same (term,
round)) and only call AddHeartbeatTargetNode after those checks succeed; also
add/use the CcShard helper APIs HasCandidateStandby(node_id, standby_term) and
HasSubscribedStandby(node_id, standby_term) (and optionally
PromoteCandidateToSubscribed) to encapsulate candidate/subscribed checks and
make the final heartbeat-addition point unambiguous.
- Around line 120-122: Document that the existing candidate_standby_nodes_
mapping is currently node_id -> start_seq_id (i.e., each entry stores only
start_seq_id), then show the upgraded mapping node_id -> {start_seq_id,
standby_term} and explicitly state that standby_term (round) must be populated
during migration; update any mentions of candidate_standby_nodes_ and related
migration steps to reflect the old schema, the new schema, and the required
transformation for each entry.
- Around line 89-93: Clarify and concretely define the out-of-sync control
state: introduce a typed alias/field out_of_sync_ctrl_ as
std::unordered_map<uint32_t, OutOfSyncControl> and define struct
OutOfSyncControl { uint64_t inflight_round; bool inflight;
std::chrono::steady_clock::time_point last_send_ts;
std::chrono::steady_clock::time_point last_progress_ts; /* add a per-entry mutex
if needed */ }; update references to out_of_sync_ctrl_, latest_subscribe_round_,
and subscribe_counter_ to use these types; for concurrency, prefer a hybrid
strategy: protect the map itself with a single std::shared_mutex for concurrent
reads/writes to map entries, and add a per-OutOfSyncControl std::mutex (or
std::atomic flags for simple fields) to avoid contention during write-heavy
round allocation (i.e., Sharder updates) while allowing concurrent readers to
inspect entries. Ensure all accesses follow this locking scheme (lock shared for
map lookups, upgrade/unique for insert/erase, and lock per-entry mutex for
modifying the OutOfSyncControl fields).

---

Duplicate comments:
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md:
- Around line 36-38: The per-standby state latest_subscribe_round_[node_id] and
out_of_sync_ctrl_[node_id] must be moved out of CcShard into the single owner
(Sharder) alongside subscribe_counter_; update CcShard to only hold per-shard
candidate/subscribed bookkeeping, remove the per-shard arrays, and change all
reads/writes (including in CcNodeService::StandbyStartFollowing and any CcShard
methods) to consult the Sharder-owned structures (e.g., provide accessor APIs on
Sharder like GetLatestSubscribeRound(node_id)/UpdateOutOfSyncCtrl(node_id)) so
there is a single source of truth and no cross-shard drift. Ensure concurrent
access is protected by the same locking used for subscribe_counter_ and adjust
tests/initialization to populate the Sharder maps instead of per-shard fields.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md:
- Line 5: Update the incorrect design filename reference: locate the string
"cc_resubscribe_epoch_design.md" in cc_resubscribe_v2_impl_tasks_en.md and
replace it with the correct path "tx_service/cc_resubscribe_epoch_design_en.md"
so implementers can find the proper design doc; ensure any other occurrences in
the same file are updated consistently.
- Around line 169-172: The Files Involved list omitted the Sharder files
required by T6; update the list to include the Sharder header and implementation
(add sharder.h and sharder.cpp) because T6 references Sharder methods
MarkOutOfSyncInflight(...) and ClearOutOfSyncInflight(...), which are defined on
the Sharder type—ensure those symbols (Sharder::MarkOutOfSyncInflight and
Sharder::ClearOutOfSyncInflight) are covered by adding the Sharder files to the
Files Involved section.

---

Nitpick comments:
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md:
- Around line 124-127: Clarify the "reject" semantics for forwarded messages
missing the round: update the text so that when a standby receives a forwarded
message without a round field it sends an explicit error/NACK response back to
the sender (not silently drop or close the connection), logs the protocol
violation, and increments the warning metric; reference the check against
active_subscribe_ctx_.round and the "forwarded message" handling to locate where
to document/implement this behavior and ensure the NACK includes the reason
"missing round" for easier debugging.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md:
- Around line 174-186: The out_of_sync_ctrl_ map needs an explicit cleanup
policy to avoid unbounded growth: update CcShard::NotifyStandbyOutOfSync /
MarkOutOfSyncInflight to ensure entries for node_id are removed when a standby
successfully subscribes (i.e., after ResetStandbySequenceId completes) by
calling ClearOutOfSyncInflight(node_id, round) and then erasing
out_of_sync_ctrl_[node_id]; additionally remove the entry when a standby is
removed from subscribed_standby_nodes_ across all shards, and add a time-based
eviction (e.g., if last_progress_ts > 5 minutes) to periodically purge stale
entries so out_of_sync_ctrl_ cannot grow indefinitely.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1c6f6789-4b9a-4f61-b7e3-94ffb9218569

📥 Commits

Reviewing files that changed from the base of the PR and between c42c6df and eaef85f.

📒 Files selected for processing (2)
  • .cursor_knowledge/cc_resubscribe_epoch_design_en.md
  • .cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md

Comment on lines +79 to +91
#### 6.3 Round Resolution in `StandbyStartFollowing`

- If `resubscribe_intent=true` (request a newer round):
- If `subscribe_round_id < latest_subscribe_round_[node_id]`, reject as stale.
- If `subscribe_round_id == latest_subscribe_round_[node_id]`, allocate a new round (reuse `GetNextSubscribeId()`) and update latest.
- If `subscribe_round_id == 0` and there is no known current round context, allocate a new round and update latest (initial/join fallback).
- If `subscribe_round_id > latest_subscribe_round_[node_id]`, reject as invalid/out-of-order.
- If `resubscribe_intent=false` (join/retry existing round):
- If `subscribe_round_id < latest_subscribe_round_[node_id]`, reject as stale.
- If `subscribe_round_id == latest_subscribe_round_[node_id]`, handle as idempotent success.
- If `subscribe_round_id > latest_subscribe_round_[node_id]`, reject as invalid/out-of-order.
- Response returns the final round (the current `subscribe_id` field).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Clarify edge cases and error handling in round resolution logic.

The round resolution logic contains several conditional branches that need clarification:

  1. Line 84: "subscribe_round_id == 0 and there is no known current round context" - what precisely defines "no known current round context"? Is this latest_subscribe_round_[node_id] == 0, or are there other conditions?

  2. Line 85: "subscribe_round_id > latest_subscribe_round_[node_id], reject as invalid/out-of-order" - under what scenarios would a standby legitimately send a round ID greater than the primary's latest? This appears to be a protocol violation. Should this case log an error, trigger an alert, or include additional diagnostics to detect bugs or malicious behavior?

  3. The same-round idempotent handling (line 88) should specify whether the response includes any state refresh or is purely a no-op acknowledgment.

🧰 Tools
🪛 LanguageTool

[style] ~84-~84: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...tSubscribeId()) and update latest. - If subscribe_round_id == 0` and there is ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~85-~85: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ate latest (initial/join fallback). - If `subscribe_round_id > latest_subscribe_...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~86-~86: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...id], reject as invalid/out-of-order. - If resubscribe_intent=false` (join/retry ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~89-~89: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...id], handle as idempotent success. - If subscribe_round_id > latest_subscribe_...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 79 - 91,
Clarify and tighten the round-resolution rules in StandbyStartFollowing: define
"no known current round context" as latest_subscribe_round_[node_id] == 0 and
there is no active subscribe_id/state for that node (explicitly check both
latest_subscribe_round_ and the node's current subscribe context), treat
subscribe_round_id > latest_subscribe_round_[node_id] as a protocol violation —
add a logged error/metric with diagnostics (remote node id, received
subscribe_round_id, expected latest_subscribe_round_[node_id], and request
metadata) and reject the request (do not allocate a new round), and make
same-round idempotent handling (subscribe_round_id ==
latest_subscribe_round_[node_id] when resubscribe_intent==false) explicitly
return the current subscribe_id and the node's current state snapshot (not a
silent no-op) so callers receive a state refresh; update code paths referencing
GetNextSubscribeId(), latest_subscribe_round_[node_id], subscribe_round_id, and
subscribe_id accordingly.

Comment on lines +100 to +113
#### 6.3.2 Primary-side Dedup for Repeated Out-of-Sync Triggers (Simplified)

- Default parameters (recommended, configurable):
- `same_round_resend_interval_ms = 2000`: minimum resend interval for the same round.
- `no_progress_timeout_ms = 20000`: no-progress timeout; allow newer-round escalation after timeout.

- If the same standby already has an in-flight out-of-sync notification and `inflight_round == latest_subscribe_round_[node_id]`:
- if `now - last_progress_ts < no_progress_timeout_ms`:
- resend same-round notification only when `now - last_send_ts >= same_round_resend_interval_ms`;
- otherwise ignore repeated trigger;
- if `now - last_progress_ts >= no_progress_timeout_ms`: allocate and send a newer round (escalation).
- By default, repeated triggers reuse the same inflight round; escalation is timeout-based, not retry-count-based.
- After primary receives valid `StandbyStartFollowing` for that assigned round (`resubscribe_intent=false`), update `last_progress_ts` (and clear/reset inflight marker per implementation policy).
- Progress definition (recommended): a validated `StandbyStartFollowing(node_id, round=inflight_round, resubscribe_intent=false)`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Define "valid" StandbyStartFollowing criteria and coordinate with round resolution.

Line 112 states "After primary receives valid StandbyStartFollowing for that assigned round (resubscribe_intent=false), update last_progress_ts" but doesn't define what makes a request "valid." Should this reference the validation logic from section 6.3 (lines 79-91)?

Additionally, the escalation path (line 110: "allocate and send a newer round") needs explicit coordination with the round resolution logic. When allocating a newer round due to timeout, the primary must:

  1. Update latest_subscribe_round_[node_id] atomically with the allocation
  2. Ensure the out-of-sync notification carries the new round
  3. Handle race conditions if the standby concurrently sends its own resubscribe request

Comment on lines +146 to +159
### 9. Test Matrix (Must Be Covered)

1. Two consecutive resubscribe requests in the same term: the second must preempt successfully while the first is still in progress.
2. Concurrent out-of-sync notification + lag trigger: only the latest round becomes effective.
3. Late `StandbyStartFollowingResponse` (old round) is dropped.
4. Late `ResetStandbySequenceId` (old round) is rejected or handled idempotently as expected.
5. Snapshot request coverage and round monotonicity are consistent (no rollback).
6. Host manager `OnStartFollowing` (`current_round + resubscribe_intent=true`) does not conflict with concurrent resubscribe.
7. Forwarding messages missing round field are rejected, and warning metrics are correct.
8. Multiple standby-local triggers in the same term (lag/idle) must result in only one “request-new-round (`resubscribe_intent=true`)” request; all others are locally coalesced.
9. After standby gets round=`R`, all follow-up retries must carry `R` with `resubscribe_intent=false`.
10. Repeated primary out-of-sync triggers in a short interval must not allocate multiple new rounds for the same standby unless escalation policy allows it.
11. Out-of-sync dedup thresholds are enforced: allow same-round resend after 2s; allow newer-round escalation after 20s no progress.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add test coverage for cross-shard consistency of round state.

The test matrix is comprehensive but missing explicit coverage for the cross-shard consistency issue raised in previous reviews. Add a test case:

Test 12: When StandbyStartFollowing or ResetStandbySequenceId operations execute concurrently across multiple shards for the same node, verify that:

  • latest_subscribe_round_[node_id] remains consistent across all shards (no drift)
  • Operations on different shards for the same node see the same round state
  • Concurrent round allocations don't create conflicting round values

This test becomes critical if the state management approach in lines 36-38 is revised to centralize round ownership.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 146 - 159,
Add a new Test 12 to the test matrix that exercises cross-shard concurrency for
round state: simulate concurrent StandbyStartFollowing and
ResetStandbySequenceId operations targeting the same node_id across multiple
shards and assert that the shared state latest_subscribe_round_[node_id] is
identical on all shards, that reads from different shards observe the same round
value, and that concurrent round allocations never produce conflicting or
divergent rounds; ensure the test also covers the case where round ownership is
centralized (the code paths that reference/latest update
latest_subscribe_round_[node_id] and any allocator logic invoked by
resubscribe_intent) and fails if any shard reports a different round or multiple
distinct rounds are allocated for the same node.

Comment on lines +89 to +93
- State Changes:
- Reuse: `subscribe_counter_`.
- Add: `latest_subscribe_round_[node_id]` (recommended in `Sharder` for centralized ownership).
- Add: `out_of_sync_ctrl_[node_id]` (stores `{inflight_round, inflight, last_send_ts, last_progress_ts}` or equivalent) for deduplicating repeated out-of-sync triggers.
- Add: concurrency protection for this map (`mutex`/`shared_mutex`).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Clarify data structures and concurrency strategy.

Lines 92-93 introduce out_of_sync_ctrl_[node_id] and mention "concurrency protection" but lack specificity:

  1. Data structure: Make explicit that out_of_sync_ctrl_ is likely std::unordered_map<uint32_t, OutOfSyncControl> where OutOfSyncControl is a struct containing {inflight_round, inflight, last_send_ts, last_progress_ts}.
  2. Concurrency: Specify whether to use std::shared_mutex or another mechanism. Given that round allocation (write-heavy) and multiple read operations will contend, the choice affects performance.
📋 Suggested clarification
- Add: `out_of_sync_ctrl_[node_id]` (stores `{inflight_round, inflight, last_send_ts, last_progress_ts}` or equivalent) for deduplicating repeated out-of-sync triggers.
- Add: concurrency protection for this map (`mutex`/`shared_mutex`).
+ Add: `std::unordered_map<uint32_t, OutOfSyncControl> out_of_sync_ctrl_` where `OutOfSyncControl` holds `{uint32_t inflight_round, bool inflight, uint64_t last_send_ts, uint64_t last_progress_ts}` for deduplicating repeated out-of-sync triggers.
+ Add: `std::shared_mutex out_of_sync_ctrl_mux_` for protecting concurrent access to the map (readers check inflight state, writers allocate rounds).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 89 - 93,
Clarify and concretely define the out-of-sync control state: introduce a typed
alias/field out_of_sync_ctrl_ as std::unordered_map<uint32_t, OutOfSyncControl>
and define struct OutOfSyncControl { uint64_t inflight_round; bool inflight;
std::chrono::steady_clock::time_point last_send_ts;
std::chrono::steady_clock::time_point last_progress_ts; /* add a per-entry mutex
if needed */ }; update references to out_of_sync_ctrl_, latest_subscribe_round_,
and subscribe_counter_ to use these types; for concurrency, prefer a hybrid
strategy: protect the map itself with a single std::shared_mutex for concurrent
reads/writes to map entries, and add a per-OutOfSyncControl std::mutex (or
std::atomic flags for simple fields) to avoid contention during write-heavy
round allocation (i.e., Sharder updates) while allowing concurrent readers to
inspect entries. Ensure all accesses follow this locking scheme (lock shared for
map lookups, upgrade/unique for insert/erase, and lock per-entry mutex for
modifying the OutOfSyncControl fields).

Comment on lines +120 to +122
- Candidate state must include round/standby_term (cannot remain `start_seq_id` only).
- Recommended upgrade for `candidate_standby_nodes_`:
`node_id -> {start_seq_id, standby_term}`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Specify the current structure of candidate_standby_nodes_ for clarity.

Lines 120-122 recommend upgrading candidate_standby_nodes_ to node_id -> {start_seq_id, standby_term} but don't document the current structure being replaced. This omission makes it difficult for implementers to understand the migration path.

📋 Suggested clarification
- Recommended upgrade for `candidate_standby_nodes_`:
-   `node_id -> {start_seq_id, standby_term}`.
+ Current structure: `candidate_standby_nodes_` is `std::unordered_map<uint32_t, uint64_t>` mapping `node_id -> start_seq_id`.
+ Recommended upgrade: Change to `std::unordered_map<uint32_t, CandidateStandbyInfo>` where `CandidateStandbyInfo` contains `{uint64_t start_seq_id, int64_t standby_term}` to support round-aware candidate tracking.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 120 - 122,
Document that the existing candidate_standby_nodes_ mapping is currently node_id
-> start_seq_id (i.e., each entry stores only start_seq_id), then show the
upgraded mapping node_id -> {start_seq_id, standby_term} and explicitly state
that standby_term (round) must be populated during migration; update any
mentions of candidate_standby_nodes_ and related migration steps to reflect the
old schema, the new schema, and the required transformation for each entry.

Comment on lines +148 to +159
- Function Changes:
- `CcNodeService::ResetStandbySequenceId(...)`
- Parse `(term, round)` from `standby_node_term`.
- Validate `term == current leader term`.
- Validate `round == latest_subscribe_round_[node_id]`.
- Validate candidate is still valid; otherwise reject as stale.
- If already subscribed with same `(term, round)`, return idempotent success.
- Add helper APIs in `CcShard`:
- `HasCandidateStandby(node_id, standby_term)`
- `HasSubscribedStandby(node_id, standby_term)`
- `PromoteCandidateToSubscribed(...)` (optional)
- Definition of Done:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Explicitly document the AddHeartbeatTargetNode ordering fix.

The high-risk section (line 306) identifies that ResetStandbySequenceId currently adds the heartbeat target before validation, which can cause stale pollution. However, T5's function changes don't explicitly mention moving this call.

Context snippet 2 shows the current code adds the heartbeat target at the beginning (before any validation). The fix should move this call to after all triple checks pass.

📋 Suggested addition to Function Changes
  - `CcNodeService::ResetStandbySequenceId(...)`
    - Parse `(term, round)` from `standby_node_term`.
    - Validate `term == current leader term`.
    - Validate `round == latest_subscribe_round_[node_id]`.
    - Validate candidate is still valid; otherwise reject as stale.
    - If already subscribed with same `(term, round)`, return idempotent success.
+   - **Ordering fix**: Move `AddHeartbeatTargetNode(node_id, standby_node_term)` call to after all validation checks pass to prevent stale state pollution.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Function Changes:
- `CcNodeService::ResetStandbySequenceId(...)`
- Parse `(term, round)` from `standby_node_term`.
- Validate `term == current leader term`.
- Validate `round == latest_subscribe_round_[node_id]`.
- Validate candidate is still valid; otherwise reject as stale.
- If already subscribed with same `(term, round)`, return idempotent success.
- Add helper APIs in `CcShard`:
- `HasCandidateStandby(node_id, standby_term)`
- `HasSubscribedStandby(node_id, standby_term)`
- `PromoteCandidateToSubscribed(...)` (optional)
- Definition of Done:
- Function Changes:
- `CcNodeService::ResetStandbySequenceId(...)`
- Parse `(term, round)` from `standby_node_term`.
- Validate `term == current leader term`.
- Validate `round == latest_subscribe_round_[node_id]`.
- Validate candidate is still valid; otherwise reject as stale.
- If already subscribed with same `(term, round)`, return idempotent success.
- **Ordering fix**: Move `AddHeartbeatTargetNode(node_id, standby_node_term)` call to after all validation checks pass to prevent stale state pollution.
- Add helper APIs in `CcShard`:
- `HasCandidateStandby(node_id, standby_term)`
- `HasSubscribedStandby(node_id, standby_term)`
- `PromoteCandidateToSubscribed(...)` (optional)
- Definition of Done:
🧰 Tools
🪛 LanguageTool

[style] ~153-~153: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...atest_subscribe_round_[node_id]`. - Validate candidate is still valid; otherwise rej...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 148 - 159,
ResetStandbySequenceId currently calls AddHeartbeatTargetNode before validating
the standby, which can pollute heartbeat state with stale candidates; update
CcNodeService::ResetStandbySequenceId to perform all validations first (parse
(term, round) from standby_node_term, verify term == current leader term, verify
round == latest_subscribe_round_[node_id], verify candidate still valid and not
stale, and return idempotent success if already subscribed with same (term,
round)) and only call AddHeartbeatTargetNode after those checks succeed; also
add/use the CcShard helper APIs HasCandidateStandby(node_id, standby_term) and
HasSubscribedStandby(node_id, standby_term) (and optionally
PromoteCandidateToSubscribed) to encapsulate candidate/subscribed checks and
make the final heartbeat-addition point unambiguous.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
.cursor_knowledge/cc_resubscribe_epoch_design_en.md (1)

36-38: ⚠️ Potential issue | 🟠 Major

Keep round ownership in Sharder, not CcShard.

CcNodeService::StandbyStartFollowing(...) walks shards independently, so storing latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] in CcShard would create per-shard round drift and multiple sources of truth for the same standby. These tables need a single owner in Sharder, with CcShard keeping only per-shard candidate/subscribed bookkeeping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 36 - 38,
The round ownership fields latest_subscribe_round_[node_id] and
out_of_sync_ctrl_[node_id] are placed in CcShard which leads to per-shard drift
because CcNodeService::StandbyStartFollowing walks shards independently; move
these tables into the single owner Sharder and keep CcShard limited to per-shard
candidate/subscribed bookkeeping (e.g., retain subscribe_counter_ in Sharder
semantics but relocate latest_subscribe_round_ and out_of_sync_ctrl_ state and
any access/update logic from CcShard to Sharder), update any reads/writes in
CcNodeService::StandbyStartFollowing and related methods to call into Sharder
APIs for latest_subscribe_round_ and out_of_sync_ctrl_ so there is one source of
truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md:
- Around line 74-85: Clarify and implement an explicit initial-join rule in
StandbyStartFollowing: when a request arrives with the default shape
(subscribe_round_id == 0 and resubscribe_intent == false) and there is no known
current round context for node_id (latest_subscribe_round_[node_id] is
empty/zero), treat it as an initial join and allocate a new round via
GetNextSubscribeId(), update latest_subscribe_round_[node_id], and return the
new subscribe_id; alternatively, if you prefer to require callers to signal
intent, add validation that rejects subscribe_round_id == 0 unless
resubscribe_intent == true and document that requirement. Ensure the behavior
references the existing branches comparing subscribe_round_id to
latest_subscribe_round_[node_id] and that response populates the current
subscribe_id field.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md:
- Around line 67-73: The host-manager path must explicitly handle initial joins
so a StandbyStartFollowingRequest with subscribe_round_id==0 and
resubscribe_intent==false does not get ignored; update the round-resolution
logic invoked by CcNodeService::OnStartFollowing to treat (subscribe_round_id ==
0 && resubscribe_intent == false) as a valid initial-join that allocates the
first round (same as the T4 path that allocates on resubscribe_intent==true), or
alternatively change the host-manager caller to send resubscribe_intent=true for
the very first join; modify the code that resolves
subscribe_round_id/resubscribe_intent to implement one of these two behaviors so
initial subscribes are accepted and a round is created.
- Around line 182-183: Do not call ClearOutOfSyncInflight from the
StandbyStartFollowing success path; StandbyStartFollowing only proves acceptance
of the round and subsequent reset/snapshot/subscribed-state work may still be in
flight, so clearing out_of_sync_ctrl_ here permits a second out-of-sync trigger
and breaks deduping. Instead, remove the ClearOutOfSyncInflight(node_id, round)
invocation from the StandbyStartFollowing success block and move it to the real
completion checkpoint (e.g., the handler that marks reset/promotion to
subscribed or any terminal "subscribed-complete" callback), ensuring
ClearOutOfSyncInflight only runs when the node reaches a terminal
subscribed/progress state.

---

Duplicate comments:
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md:
- Around line 36-38: The round ownership fields latest_subscribe_round_[node_id]
and out_of_sync_ctrl_[node_id] are placed in CcShard which leads to per-shard
drift because CcNodeService::StandbyStartFollowing walks shards independently;
move these tables into the single owner Sharder and keep CcShard limited to
per-shard candidate/subscribed bookkeeping (e.g., retain subscribe_counter_ in
Sharder semantics but relocate latest_subscribe_round_ and out_of_sync_ctrl_
state and any access/update logic from CcShard to Sharder), update any
reads/writes in CcNodeService::StandbyStartFollowing and related methods to call
into Sharder APIs for latest_subscribe_round_ and out_of_sync_ctrl_ so there is
one source of truth.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a111cdad-49ed-4437-9bb0-4f1758271904

📥 Commits

Reviewing files that changed from the base of the PR and between eaef85f and d1ac798.

📒 Files selected for processing (2)
  • .cursor_knowledge/cc_resubscribe_epoch_design_en.md
  • .cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md

Comment on lines +74 to +85
#### 6.3 Round Resolution in `StandbyStartFollowing`

- If `resubscribe_intent=true` (request a newer round):
- If `subscribe_round_id < latest_subscribe_round_[node_id]`, reject as stale.
- If `subscribe_round_id == latest_subscribe_round_[node_id]`, allocate a new round (reuse `GetNextSubscribeId()`) and update latest.
- If `subscribe_round_id == 0` and there is no known current round context, allocate a new round and update latest (initial/join fallback).
- If `subscribe_round_id > latest_subscribe_round_[node_id]`, reject as invalid/out-of-order.
- If `resubscribe_intent=false` (join/retry existing round):
- If `subscribe_round_id < latest_subscribe_round_[node_id]`, reject as stale.
- If `subscribe_round_id == latest_subscribe_round_[node_id]`, handle as idempotent success.
- If `subscribe_round_id > latest_subscribe_round_[node_id]`, reject as invalid/out-of-order.
- Response returns the final round (the current `subscribe_id` field).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add an explicit initial-join rule for the default request shape.

Section 6.1 says the host-manager path does not carry round/intention, but this section never defines how the primary resolves a StandbyStartFollowingRequest that arrives with the new field defaults. Since first-join traffic will naturally look like (subscribe_round_id=0, resubscribe_intent=false) unless the caller overrides it, the current rules leave initial subscribe undefined. Please either require first join to use 0 + resubscribe_intent=true, or add a dedicated 0 + false + no current round branch that allocates the first round.

🧰 Tools
🪛 LanguageTool

[style] ~79-~79: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...tSubscribeId()) and update latest. - If subscribe_round_id == 0` and there is ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~80-~80: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ate latest (initial/join fallback). - If `subscribe_round_id > latest_subscribe_...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~81-~81: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...id], reject as invalid/out-of-order. - If resubscribe_intent=false` (join/retry ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~84-~84: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...id], handle as idempotent success. - If subscribe_round_id > latest_subscribe_...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 74 - 85,
Clarify and implement an explicit initial-join rule in StandbyStartFollowing:
when a request arrives with the default shape (subscribe_round_id == 0 and
resubscribe_intent == false) and there is no known current round context for
node_id (latest_subscribe_round_[node_id] is empty/zero), treat it as an initial
join and allocate a new round via GetNextSubscribeId(), update
latest_subscribe_round_[node_id], and return the new subscribe_id;
alternatively, if you prefer to require callers to signal intent, add validation
that rejects subscribe_round_id == 0 unless resubscribe_intent == true and
document that requirement. Ensure the behavior references the existing branches
comparing subscribe_round_id to latest_subscribe_round_[node_id] and that
response populates the current subscribe_id field.

Comment on lines +67 to +73
- Keep `CcNodeService::OnStartFollowing(...)` protocol unchanged; it remains a trigger entry.
- Update trigger call sites: lag, stream idle timeout, out-of-sync path.
- Semantic constraints:
- lag / idle timeout: pass `current_round + resubscribe_intent=true`.
- out-of-sync (round pre-assigned by primary): pass `subscribe_round_id=<msg_round> + resubscribe_intent=false`.
- host manager path: no round/intention payload (default values).
- Definition of Done:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Define how the host-manager path creates the first round.

This task leaves the host-manager entry on default values, but T4 only allocates a new round for resubscribe_intent=true. That makes an initial StandbyStartFollowingRequest with (subscribe_round_id=0, resubscribe_intent=false) undefined, so implementers can accidentally reject or no-op the first subscribe. Either require the first join to send 0 + resubscribe_intent=true, or add an explicit initial-join rule for 0 + false in the round-resolution task.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 67 - 73,
The host-manager path must explicitly handle initial joins so a
StandbyStartFollowingRequest with subscribe_round_id==0 and
resubscribe_intent==false does not get ignored; update the round-resolution
logic invoked by CcNodeService::OnStartFollowing to treat (subscribe_round_id ==
0 && resubscribe_intent == false) as a valid initial-join that allocates the
first round (same as the T4 path that allocates on resubscribe_intent==true), or
alternatively change the host-manager caller to send resubscribe_intent=true for
the very first join; modify the code that resolves
subscribe_round_id/resubscribe_intent to implement one of these two behaviors so
initial subscribes are accepted and a round is created.

Comment on lines +182 to +183
- Coordination with `T4`:
- After `StandbyStartFollowing` validation succeeds, call `ClearOutOfSyncInflight(node_id, round)`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't clear the out-of-sync inflight marker this early.

StandbyStartFollowing only proves the standby accepted the round; reset/snapshot/subscribed-state work can still be in flight afterwards. Clearing out_of_sync_ctrl_ here reopens the window for a second out-of-sync trigger to allocate a newer round during the same recovery, which defeats the dedup policy. Keep the marker until a real completion/progress checkpoint, such as successful reset/promotion to subscribed or an equivalent terminal state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 182 - 183,
Do not call ClearOutOfSyncInflight from the StandbyStartFollowing success path;
StandbyStartFollowing only proves acceptance of the round and subsequent
reset/snapshot/subscribed-state work may still be in flight, so clearing
out_of_sync_ctrl_ here permits a second out-of-sync trigger and breaks deduping.
Instead, remove the ClearOutOfSyncInflight(node_id, round) invocation from the
StandbyStartFollowing success block and move it to the real completion
checkpoint (e.g., the handler that marks reset/promotion to subscribed or any
terminal "subscribed-complete" callback), ensuring ClearOutOfSyncInflight only
runs when the node reaches a terminal subscribed/progress state.

@MrGuin MrGuin requested a review from liunyl March 9, 2026 06:30
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (8)
.cursor_knowledge/standby_resubscribe_design.md (3)

107-108: ⚠️ Potential issue | 🟠 Major

Do not clear out-of-sync inflight state on StandbyStartFollowing success.

A valid StandbyStartFollowing only proves the standby accepted the round; reset/snapshot/promotion can still be running afterwards. If the inflight marker is cleared here, a second out-of-sync trigger can allocate a newer round during the same recovery window.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/standby_resubscribe_design.md around lines 107 - 108, The
note incorrectly clears the out-of-sync inflight marker on processing a valid
StandbyStartFollowing; change the behavior so that when handling
StandbyStartFollowing(node_id, round=inflight_round, resubscribe_intent=false)
you only update last_progress_ts (or record the validated acceptance) but do NOT
clear/reset the inflight marker or treat this event as full progress completion
in the logic that manages inflight rounds; ensure any code paths using the
“progress” definition (StandbyStartFollowing with resubscribe_intent=false)
continue to treat reset/snapshot/promotion as separate events and only clear
inflight state when those specific completion signals occur.

65-65: ⚠️ Potential issue | 🟠 Major

Define the initial-join rule for the default request shape.

Section 6.1 says the host-manager path sends default values, but Section 6.3 still only defines allocation for subscribe_round_id == 0 under resubscribe_intent=true. A first StandbyStartFollowingRequest arriving as (0, false) is still underspecified, so please either allocate the first round in that branch or require callers to set resubscribe_intent=true for initial join.

Also applies to: 76-85

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/standby_resubscribe_design.md at line 65, The behavior for
an initial join with default request shape is underspecified: update the host
manager's OnStartFollowing handling or the StandbyStartFollowingRequest contract
so a request with subscribe_round_id == 0 and resubscribe_intent == false is
deterministic; either (A) treat OnStartFollowing as a trigger-only path and
explicitly allocate/assign the first round when receiving
StandbyStartFollowingRequest((0, false)) inside the host-manager logic (e.g., in
the code that processes StandbyStartFollowingRequest) or (B) change the API/docs
to require callers set resubscribe_intent=true for initial joins; pick one
approach and implement it consistently so the allocation branch currently
guarded by resubscribe_intent=true also covers the initial (0, false) case or
the callers are validated to send resubscribe_intent=true.

36-38: ⚠️ Potential issue | 🟠 Major

Keep per-node round ownership in Sharder, not CcShard.

Documenting latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] as shard-local reintroduces multiple sources of truth for the same standby. This PR’s own task breakdown centralizes that state in Sharder; CcShard should keep only per-shard candidate/subscribed bookkeeping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/standby_resubscribe_design.md around lines 36 - 38, The
docs and design incorrectly list latest_subscribe_round_[node_id] and
out_of_sync_ctrl_[node_id] as CcShard-local; instead make Sharder the single
owner of per-node round state: remove those per-node fields from CcShard and
document them under Sharder (alongside subscribe_counter_), update any code
references to access latest_subscribe_round_[node_id] and
out_of_sync_ctrl_[node_id] via Sharder APIs/fields rather than CcShard, and keep
CcShard limited to per-shard candidate/subscribed bookkeeping only.
.cursor_knowledge/standby_resubscribe_impl_tasks.md (5)

67-73: ⚠️ Potential issue | 🟠 Major

T4 still leaves the first join ambiguous.

T2 keeps the host-manager path on default values, but T4 only defines behavior once a latest round already exists. Please add an explicit (subscribe_round_id=0, resubscribe_intent=false, no current round) branch there, or change the host-manager path to request the first round explicitly.

Also applies to: 121-125

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/standby_resubscribe_impl_tasks.md around lines 67 - 73,
The review points out T4 doesn't handle a first-join case; update the logic
around CcNodeService::OnStartFollowing and the host-manager trigger so the first
join is explicit: either add a branch where (subscribe_round_id=0,
resubscribe_intent=false, no current round) is passed by the host-manager path
before any latest round exists, or modify the host-manager path to request the
first round explicitly (i.e., set subscribe_round_id to the initial round and
resubscribe_intent appropriately); ensure all trigger call sites (lag, stream
idle timeout, out-of-sync, host-manager) follow the documented semantic
constraints and that CcNodeService::OnStartFollowing correctly distinguishes the
first-join branch from resubscribe flows.

167-170: ⚠️ Potential issue | 🟡 Minor

Include the Sharder files in T6’s implementation surface.

This task updates out_of_sync_ctrl_ and calls MarkOutOfSyncInflight(...) / ClearOutOfSyncInflight(...), but the listed files stop short of the Sharder state and APIs that back that logic. Add data_substrate/tx_service/include/sharder.h and data_substrate/tx_service/src/sharder.cpp here so T6 is implementable on its own.

Also applies to: 173-180

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/standby_resubscribe_impl_tasks.md around lines 167 - 170,
The file list omitted the Sharder implementation needed by T6; update the task’s
file list to include the Sharder headers and source so the out_of_sync_ctrl_
usages and calls to MarkOutOfSyncInflight(...) and ClearOutOfSyncInflight(...)
are resolvable — specifically add data_substrate/tx_service/include/sharder.h
and data_substrate/tx_service/src/sharder.cpp to the listed files so the Sharder
state/APIs used by out_of_sync_ctrl_, MarkOutOfSyncInflight, and
ClearOutOfSyncInflight are present.

5-5: ⚠️ Potential issue | 🟡 Minor

Point this checklist at the current design doc.

The filename here does not match the design document added in this PR (.cursor_knowledge/standby_resubscribe_design.md), so this task list can send implementers to the wrong spec.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/standby_resubscribe_impl_tasks.md at line 5, The checklist
file .cursor_knowledge/standby_resubscribe_impl_tasks.md currently references
cc_resubscribe_epoch_design.md; update the reference text and any links in this
file to point to the actual design doc added in the PR
(.cursor_knowledge/standby_resubscribe_design.md) so implementers are directed
to the correct spec, and remove or replace any mentions of
cc_resubscribe_epoch_design.md to avoid confusion.

182-183: ⚠️ Potential issue | 🟠 Major

Don’t clear out-of-sync inflight at StandbyStartFollowing success.

That checkpoint only confirms round acceptance; reset/snapshot/subscribed promotion can still be running. Clearing the marker here defeats the dedup window and allows a newer round to be allocated mid-recovery.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/standby_resubscribe_impl_tasks.md around lines 182 - 183,
The current logic calls ClearOutOfSyncInflight(node_id, round) immediately when
StandbyStartFollowing validation succeeds, which is incorrect because recovery
(reset/snapshot/subscribed promotion) may still be in progress; remove that
ClearOutOfSyncInflight call from the StandbyStartFollowing success path and
instead invoke ClearOutOfSyncInflight only after the recovery/promotion has
fully completed (e.g., in the completion callback/handler that signals
reset/snapshot/subscribed-promotion finished), or at the explicit point where
the deduplication window is guaranteed safe; update any references in the
StandbyStartFollowing handler to rely on the recovery-complete event rather than
clearing the marker there.

146-156: ⚠️ Potential issue | 🟠 Major

Call out the heartbeat-ordering fix in T5.

The current CcNodeService::ResetStandbySequenceId(...) adds the heartbeat target before any validation in tx_service/src/remote/cc_node_service.cpp:1847-1888. T5 should explicitly move AddHeartbeatTargetNode(...) to after the term/round/candidate checks, otherwise the stale-state pollution bug is easy to preserve during implementation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor_knowledge/standby_resubscribe_impl_tasks.md around lines 146 - 156,
The heartbeat-target is being added too early in
CcNodeService::ResetStandbySequenceId; move the call to
AddHeartbeatTargetNode(...) so it runs only after you parse and validate (term,
round) from standby_node_term and confirm term == current leader term, round ==
latest_subscribe_round_[node_id], and the candidate is still valid (reject if
stale); also preserve idempotent success when already subscribed with the same
(term, round). Update ResetStandbySequenceId in
tx_service/src/remote/cc_node_service.cpp to perform all term/round/candidate
checks first and only call AddHeartbeatTargetNode after those checks (and after
any Promoting/Subscribe state transitions via CcShard helper APIs like
HasCandidateStandby, HasSubscribedStandby, PromoteCandidateToSubscribed if
used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.cursor_knowledge/standby_resubscribe_design.md:
- Around line 107-108: The note incorrectly clears the out-of-sync inflight
marker on processing a valid StandbyStartFollowing; change the behavior so that
when handling StandbyStartFollowing(node_id, round=inflight_round,
resubscribe_intent=false) you only update last_progress_ts (or record the
validated acceptance) but do NOT clear/reset the inflight marker or treat this
event as full progress completion in the logic that manages inflight rounds;
ensure any code paths using the “progress” definition (StandbyStartFollowing
with resubscribe_intent=false) continue to treat reset/snapshot/promotion as
separate events and only clear inflight state when those specific completion
signals occur.
- Line 65: The behavior for an initial join with default request shape is
underspecified: update the host manager's OnStartFollowing handling or the
StandbyStartFollowingRequest contract so a request with subscribe_round_id == 0
and resubscribe_intent == false is deterministic; either (A) treat
OnStartFollowing as a trigger-only path and explicitly allocate/assign the first
round when receiving StandbyStartFollowingRequest((0, false)) inside the
host-manager logic (e.g., in the code that processes
StandbyStartFollowingRequest) or (B) change the API/docs to require callers set
resubscribe_intent=true for initial joins; pick one approach and implement it
consistently so the allocation branch currently guarded by
resubscribe_intent=true also covers the initial (0, false) case or the callers
are validated to send resubscribe_intent=true.
- Around line 36-38: The docs and design incorrectly list
latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] as
CcShard-local; instead make Sharder the single owner of per-node round state:
remove those per-node fields from CcShard and document them under Sharder
(alongside subscribe_counter_), update any code references to access
latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] via Sharder
APIs/fields rather than CcShard, and keep CcShard limited to per-shard
candidate/subscribed bookkeeping only.

In @.cursor_knowledge/standby_resubscribe_impl_tasks.md:
- Around line 67-73: The review points out T4 doesn't handle a first-join case;
update the logic around CcNodeService::OnStartFollowing and the host-manager
trigger so the first join is explicit: either add a branch where
(subscribe_round_id=0, resubscribe_intent=false, no current round) is passed by
the host-manager path before any latest round exists, or modify the host-manager
path to request the first round explicitly (i.e., set subscribe_round_id to the
initial round and resubscribe_intent appropriately); ensure all trigger call
sites (lag, stream idle timeout, out-of-sync, host-manager) follow the
documented semantic constraints and that CcNodeService::OnStartFollowing
correctly distinguishes the first-join branch from resubscribe flows.
- Around line 167-170: The file list omitted the Sharder implementation needed
by T6; update the task’s file list to include the Sharder headers and source so
the out_of_sync_ctrl_ usages and calls to MarkOutOfSyncInflight(...) and
ClearOutOfSyncInflight(...) are resolvable — specifically add
data_substrate/tx_service/include/sharder.h and
data_substrate/tx_service/src/sharder.cpp to the listed files so the Sharder
state/APIs used by out_of_sync_ctrl_, MarkOutOfSyncInflight, and
ClearOutOfSyncInflight are present.
- Line 5: The checklist file .cursor_knowledge/standby_resubscribe_impl_tasks.md
currently references cc_resubscribe_epoch_design.md; update the reference text
and any links in this file to point to the actual design doc added in the PR
(.cursor_knowledge/standby_resubscribe_design.md) so implementers are directed
to the correct spec, and remove or replace any mentions of
cc_resubscribe_epoch_design.md to avoid confusion.
- Around line 182-183: The current logic calls ClearOutOfSyncInflight(node_id,
round) immediately when StandbyStartFollowing validation succeeds, which is
incorrect because recovery (reset/snapshot/subscribed promotion) may still be in
progress; remove that ClearOutOfSyncInflight call from the StandbyStartFollowing
success path and instead invoke ClearOutOfSyncInflight only after the
recovery/promotion has fully completed (e.g., in the completion callback/handler
that signals reset/snapshot/subscribed-promotion finished), or at the explicit
point where the deduplication window is guaranteed safe; update any references
in the StandbyStartFollowing handler to rely on the recovery-complete event
rather than clearing the marker there.
- Around line 146-156: The heartbeat-target is being added too early in
CcNodeService::ResetStandbySequenceId; move the call to
AddHeartbeatTargetNode(...) so it runs only after you parse and validate (term,
round) from standby_node_term and confirm term == current leader term, round ==
latest_subscribe_round_[node_id], and the candidate is still valid (reject if
stale); also preserve idempotent success when already subscribed with the same
(term, round). Update ResetStandbySequenceId in
tx_service/src/remote/cc_node_service.cpp to perform all term/round/candidate
checks first and only call AddHeartbeatTargetNode after those checks (and after
any Promoting/Subscribe state transitions via CcShard helper APIs like
HasCandidateStandby, HasSubscribedStandby, PromoteCandidateToSubscribed if
used).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3ecb577-9f95-41dd-b7e5-71039e49d666

📥 Commits

Reviewing files that changed from the base of the PR and between d1ac798 and 4fca194.

📒 Files selected for processing (2)
  • .cursor_knowledge/standby_resubscribe_design.md
  • .cursor_knowledge/standby_resubscribe_impl_tasks.md

@MrGuin MrGuin marked this pull request as draft March 9, 2026 07:36
@MrGuin MrGuin changed the title Fix standby resubscribe Fix standby resubscribe when subscription is ongoing. Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant