Conversation
WalkthroughIntroduces comprehensive design and implementation documentation for a revised resubscribe mechanism using subscribe_round_id for round-based tracking and follow_req_id for local context preservation. Defines protocol changes, validation rules, state management, and a task breakdown (T1–T10) with dependency graph for phased execution toward deadlock removal and unified round semantics. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tx_service/cc_resubscribe_epoch_design_en.md`:
- Around line 36-38: The review warns that round ownership must be centralized,
so move the per-node state latest_subscribe_round_[node_id] and
out_of_sync_ctrl_[node_id] out of CcShard and into the single Sharder owner
(where subscribe_counter_ and round allocation live); update CcShard to keep
only per-shard candidate/subscribed bookkeeping, update Sharder to store and
manage latest_subscribe_round and out_of_sync_ctrl (keyed by node_id), and
adjust CcNodeService::StandbyStartFollowing(...) and any handlers that
read/update these symbols to call into Sharder APIs for reading/updating
subscribe_round_id, latest_subscribe_round, and out_of_sync_ctrl to prevent
cross-shard drift and a single source of truth.
In `@tx_service/cc_resubscribe_v2_impl_tasks_en.md`:
- Line 5: Update the design filename reference in
tx_service/cc_resubscribe_v2_impl_tasks_en.md: replace the incorrect string
"cc_resubscribe_epoch_design.md" with the correct document name
"tx_service/cc_resubscribe_epoch_design_en.md" so the implementation checklist
points to the right spec; ensure any internal links or mentions in the same file
are updated consistently.
- Around line 169-186: Add the missing Sharder implementation referenced by the
out-of-sync flow: create data_substrate/tx_service/include/sharder.h and
data_substrate/tx_service/src/sharder.cpp that declare and define the
out_of_sync_ctrl_ state and its API (the methods and types referenced by
CcShard: MarkOutOfSyncInflight(node_id, round), ClearOutOfSyncInflight(node_id,
round), and any helpers/structs used to check last_progress_ts and last_send_ts
and to allocate rounds), and ensure Sharder exposes the same method signatures
used by CcShard::NotifyStandbyOutOfSync and the StandbyStartFollowing validation
so the calls compile and the inflight/round logic is backed by concrete storage
and thread-safe accessors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2e66c240-3525-44b7-bc21-6da250ac3f0d
📒 Files selected for processing (2)
tx_service/cc_resubscribe_epoch_design_en.mdtx_service/cc_resubscribe_v2_impl_tasks_en.md
| | Primary (`Sharder`) | `subscribe_counter_` (existing) | Round allocator that assigns `subscribe_round_id`. | | ||
| | Primary (`CcShard`) | `latest_subscribe_round_[node_id]` | Latest accepted round for this standby; used to reject stale requests. | | ||
| | Primary (`CcShard`) | `out_of_sync_ctrl_[node_id]` | Out-of-sync dedup control state (for example `{inflight_round, inflight, last_send_ts, last_progress_ts}`) to avoid repeated triggers continuously allocating new rounds. | |
There was a problem hiding this comment.
Do not make round ownership shard-local.
latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] should not live in CcShard. CcNodeService::StandbyStartFollowing(...) currently walks shards independently (tx_service/src/remote/cc_node_service.cpp:1590-1665), so storing this state per shard would create multiple sources of truth and allow cross-shard round drift. This state needs a single owner per standby/node, e.g. Sharder, with CcShard keeping only per-shard candidate/subscribed bookkeeping.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tx_service/cc_resubscribe_epoch_design_en.md` around lines 36 - 38, The
review warns that round ownership must be centralized, so move the per-node
state latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] out of
CcShard and into the single Sharder owner (where subscribe_counter_ and round
allocation live); update CcShard to keep only per-shard candidate/subscribed
bookkeeping, update Sharder to store and manage latest_subscribe_round and
out_of_sync_ctrl (keyed by node_id), and adjust
CcNodeService::StandbyStartFollowing(...) and any handlers that read/update
these symbols to call into Sharder APIs for reading/updating subscribe_round_id,
latest_subscribe_round, and out_of_sync_ctrl to prevent cross-shard drift and a
single source of truth.
| - Files Involved: | ||
| - `data_substrate/tx_service/src/cc/cc_shard.cpp` | ||
| - `data_substrate/tx_service/include/proto/cc_request.proto` | ||
| - `data_substrate/tx_service/include/cc/cc_request.h` | ||
| - State Changes: | ||
| - Before sending out-of-sync notification, primary checks/updates `out_of_sync_ctrl_`; repeated triggers should reuse inflight round by default. | ||
| - Function Changes: | ||
| - `CcShard::NotifyStandbyOutOfSync(uint32_t node_id)` | ||
| - If this standby already has inflight out-of-sync round: | ||
| - if `now-last_progress_ts < 20s` and `now-last_send_ts >= 2s`: resend same round; | ||
| - if `now-last_progress_ts < 20s` and `now-last_send_ts < 2s`: ignore; | ||
| - if `now-last_progress_ts >= 20s`: allocate newer round. | ||
| - Otherwise allocate/update round and set `req->set_subscribe_round_id(...)`. | ||
| - After sending, call `MarkOutOfSyncInflight(node_id, round)`. | ||
| - Standby out-of-sync handling calls | ||
| `OnStartFollowing(..., subscribe_round_id=<msg>, resubscribe_intent=false)`. | ||
| - Coordination with `T4`: | ||
| - After `StandbyStartFollowing` validation succeeds, call `ClearOutOfSyncInflight(node_id, round)`. |
There was a problem hiding this comment.
T6 omits the Sharder changes it depends on.
This task updates out_of_sync_ctrl_ and calls MarkOutOfSyncInflight(...) / ClearOutOfSyncInflight(...), but T3 defines that state/API on Sharder (Lines 89-104). Please add data_substrate/tx_service/include/sharder.h and data_substrate/tx_service/src/sharder.cpp here as well, otherwise the out-of-sync flow is only half specified.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tx_service/cc_resubscribe_v2_impl_tasks_en.md` around lines 169 - 186, Add
the missing Sharder implementation referenced by the out-of-sync flow: create
data_substrate/tx_service/include/sharder.h and
data_substrate/tx_service/src/sharder.cpp that declare and define the
out_of_sync_ctrl_ state and its API (the methods and types referenced by
CcShard: MarkOutOfSyncInflight(node_id, round), ClearOutOfSyncInflight(node_id,
round), and any helpers/structs used to check last_progress_ts and last_send_ts
and to allocate rounds), and ensure Sharder exposes the same method signatures
used by CcShard::NotifyStandbyOutOfSync and the StandbyStartFollowing validation
so the calls compile and the inflight/round logic is backed by concrete storage
and thread-safe accessors.
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (3)
.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md (2)
5-5:⚠️ Potential issue | 🟡 MinorFix the referenced design filename.
Line 5 points to
cc_resubscribe_epoch_design.md, but based on the past review, the correct filename should betx_service/cc_resubscribe_epoch_design_en.md. This mismatch will make it difficult for implementers to locate the correct design specification.📝 Proposed fix
-This checklist translates `cc_resubscribe_epoch_design.md` (revised v2) into executable implementation tasks. +This checklist translates `tx_service/cc_resubscribe_epoch_design_en.md` (revised v2) into executable implementation tasks.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md at line 5, Update the incorrect design filename reference: locate the string "cc_resubscribe_epoch_design.md" in cc_resubscribe_v2_impl_tasks_en.md and replace it with the correct path "tx_service/cc_resubscribe_epoch_design_en.md" so implementers can find the proper design doc; ensure any other occurrences in the same file are updated consistently.
169-172:⚠️ Potential issue | 🟡 MinorT6 omits the Sharder files it depends on.
Lines 182 and 186 reference
MarkOutOfSyncInflight(...)andClearOutOfSyncInflight(...), which T3 defines onSharder(lines 102-103). The Files Involved list should includedata_substrate/tx_service/include/sharder.handdata_substrate/tx_service/src/sharder.cpp.📝 Proposed fix
- Files Involved: - `data_substrate/tx_service/src/cc/cc_shard.cpp` - `data_substrate/tx_service/include/proto/cc_request.proto` - `data_substrate/tx_service/include/cc/cc_request.h` + - `data_substrate/tx_service/include/sharder.h` + - `data_substrate/tx_service/src/sharder.cpp`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 169 - 172, The Files Involved list omitted the Sharder files required by T6; update the list to include the Sharder header and implementation (add sharder.h and sharder.cpp) because T6 references Sharder methods MarkOutOfSyncInflight(...) and ClearOutOfSyncInflight(...), which are defined on the Sharder type—ensure those symbols (Sharder::MarkOutOfSyncInflight and Sharder::ClearOutOfSyncInflight) are covered by adding the Sharder files to the Files Involved section..cursor_knowledge/cc_resubscribe_epoch_design_en.md (1)
36-38:⚠️ Potential issue | 🔴 CriticalDo not make round ownership shard-local.
As flagged in previous reviews,
latest_subscribe_round_[node_id]andout_of_sync_ctrl_[node_id]should not live inCcShard. SinceCcNodeService::StandbyStartFollowing(...)walks shards independently, storing this state per shard creates multiple sources of truth and allows cross-shard round drift. This state needs a single owner per standby/node (e.g.,Sharder, wheresubscribe_counter_already lives), withCcShardkeeping only per-shard candidate/subscribed bookkeeping.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 36 - 38, The per-standby state latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] must be moved out of CcShard into the single owner (Sharder) alongside subscribe_counter_; update CcShard to only hold per-shard candidate/subscribed bookkeeping, remove the per-shard arrays, and change all reads/writes (including in CcNodeService::StandbyStartFollowing and any CcShard methods) to consult the Sharder-owned structures (e.g., provide accessor APIs on Sharder like GetLatestSubscribeRound(node_id)/UpdateOutOfSyncCtrl(node_id)) so there is a single source of truth and no cross-shard drift. Ensure concurrent access is protected by the same locking used for subscribe_counter_ and adjust tests/initialization to populate the Sharder maps instead of per-shard fields.
🧹 Nitpick comments (2)
.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md (1)
174-186: Specify cleanup strategy forout_of_sync_ctrl_entries.The task describes updating
out_of_sync_ctrl_state when sending out-of-sync notifications (line 174, 182) and clearing inflight flags (line 186), but doesn't specify when entries should be removed from the map entirely. Without cleanup logic, the map will grow indefinitely.🧹 Suggested addition
Add to the Function Changes section:
- Cleanup policy: - Clear `out_of_sync_ctrl_[node_id]` entry when standby successfully subscribes (after ResetStandbySequenceId completes). - Or: Clear entry when standby is removed from all subscribed_standby_nodes_ across all shards. - Or: Use time-based eviction for entries where `last_progress_ts` is very old (e.g., > 5 minutes).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 174 - 186, The out_of_sync_ctrl_ map needs an explicit cleanup policy to avoid unbounded growth: update CcShard::NotifyStandbyOutOfSync / MarkOutOfSyncInflight to ensure entries for node_id are removed when a standby successfully subscribes (i.e., after ResetStandbySequenceId completes) by calling ClearOutOfSyncInflight(node_id, round) and then erasing out_of_sync_ctrl_[node_id]; additionally remove the entry when a standby is removed from subscribed_standby_nodes_ across all shards, and add a time-based eviction (e.g., if last_progress_ts > 5 minutes) to periodically purge stale entries so out_of_sync_ctrl_ cannot grow indefinitely..cursor_knowledge/cc_resubscribe_epoch_design_en.md (1)
124-127: Clarify "reject" behavior for messages missing round field.Line 127 states "If a forwarded message does not carry round, reject it directly and record a warning metric." Specify whether "reject" means:
- Silently drop the message (no response sent)
- Send an error/NACK response to the sender
- Close the connection/stream
This is important for debugging and ensuring the sender can detect and handle the protocol violation appropriately.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 124 - 127, Clarify the "reject" semantics for forwarded messages missing the round: update the text so that when a standby receives a forwarded message without a round field it sends an explicit error/NACK response back to the sender (not silently drop or close the connection), logs the protocol violation, and increments the warning metric; reference the check against active_subscribe_ctx_.round and the "forwarded message" handling to locate where to document/implement this behavior and ensure the NACK includes the reason "missing round" for easier debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md:
- Around line 146-159: Add a new Test 12 to the test matrix that exercises
cross-shard concurrency for round state: simulate concurrent
StandbyStartFollowing and ResetStandbySequenceId operations targeting the same
node_id across multiple shards and assert that the shared state
latest_subscribe_round_[node_id] is identical on all shards, that reads from
different shards observe the same round value, and that concurrent round
allocations never produce conflicting or divergent rounds; ensure the test also
covers the case where round ownership is centralized (the code paths that
reference/latest update latest_subscribe_round_[node_id] and any allocator logic
invoked by resubscribe_intent) and fails if any shard reports a different round
or multiple distinct rounds are allocated for the same node.
- Around line 79-91: Clarify and tighten the round-resolution rules in
StandbyStartFollowing: define "no known current round context" as
latest_subscribe_round_[node_id] == 0 and there is no active subscribe_id/state
for that node (explicitly check both latest_subscribe_round_ and the node's
current subscribe context), treat subscribe_round_id >
latest_subscribe_round_[node_id] as a protocol violation — add a logged
error/metric with diagnostics (remote node id, received subscribe_round_id,
expected latest_subscribe_round_[node_id], and request metadata) and reject the
request (do not allocate a new round), and make same-round idempotent handling
(subscribe_round_id == latest_subscribe_round_[node_id] when
resubscribe_intent==false) explicitly return the current subscribe_id and the
node's current state snapshot (not a silent no-op) so callers receive a state
refresh; update code paths referencing GetNextSubscribeId(),
latest_subscribe_round_[node_id], subscribe_round_id, and subscribe_id
accordingly.
- Around line 21-23: Clarify that standby_node_term keeps the same bit layout
and monotonic ordering by explicitly stating in the design that the upper 32
bits remain primary_term and the lower 32 bits are now called subscribe_round_id
but preserve the same monotonic/ordering semantics as the old subscribe_id;
mention that code which extracts the lower 32 bits (e.g.,
SubscribeIdFromStandbyTerm() used in snapshot_manager.cpp at the noted call
sites) can continue to compare those numeric round values directly because
subscribe_round_id is guaranteed to be monotonically increasing and compatible
with existing sentinel/ordering checks.
In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md:
- Around line 148-159: ResetStandbySequenceId currently calls
AddHeartbeatTargetNode before validating the standby, which can pollute
heartbeat state with stale candidates; update
CcNodeService::ResetStandbySequenceId to perform all validations first (parse
(term, round) from standby_node_term, verify term == current leader term, verify
round == latest_subscribe_round_[node_id], verify candidate still valid and not
stale, and return idempotent success if already subscribed with same (term,
round)) and only call AddHeartbeatTargetNode after those checks succeed; also
add/use the CcShard helper APIs HasCandidateStandby(node_id, standby_term) and
HasSubscribedStandby(node_id, standby_term) (and optionally
PromoteCandidateToSubscribed) to encapsulate candidate/subscribed checks and
make the final heartbeat-addition point unambiguous.
- Around line 120-122: Document that the existing candidate_standby_nodes_
mapping is currently node_id -> start_seq_id (i.e., each entry stores only
start_seq_id), then show the upgraded mapping node_id -> {start_seq_id,
standby_term} and explicitly state that standby_term (round) must be populated
during migration; update any mentions of candidate_standby_nodes_ and related
migration steps to reflect the old schema, the new schema, and the required
transformation for each entry.
- Around line 89-93: Clarify and concretely define the out-of-sync control
state: introduce a typed alias/field out_of_sync_ctrl_ as
std::unordered_map<uint32_t, OutOfSyncControl> and define struct
OutOfSyncControl { uint64_t inflight_round; bool inflight;
std::chrono::steady_clock::time_point last_send_ts;
std::chrono::steady_clock::time_point last_progress_ts; /* add a per-entry mutex
if needed */ }; update references to out_of_sync_ctrl_, latest_subscribe_round_,
and subscribe_counter_ to use these types; for concurrency, prefer a hybrid
strategy: protect the map itself with a single std::shared_mutex for concurrent
reads/writes to map entries, and add a per-OutOfSyncControl std::mutex (or
std::atomic flags for simple fields) to avoid contention during write-heavy
round allocation (i.e., Sharder updates) while allowing concurrent readers to
inspect entries. Ensure all accesses follow this locking scheme (lock shared for
map lookups, upgrade/unique for insert/erase, and lock per-entry mutex for
modifying the OutOfSyncControl fields).
---
Duplicate comments:
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md:
- Around line 36-38: The per-standby state latest_subscribe_round_[node_id] and
out_of_sync_ctrl_[node_id] must be moved out of CcShard into the single owner
(Sharder) alongside subscribe_counter_; update CcShard to only hold per-shard
candidate/subscribed bookkeeping, remove the per-shard arrays, and change all
reads/writes (including in CcNodeService::StandbyStartFollowing and any CcShard
methods) to consult the Sharder-owned structures (e.g., provide accessor APIs on
Sharder like GetLatestSubscribeRound(node_id)/UpdateOutOfSyncCtrl(node_id)) so
there is a single source of truth and no cross-shard drift. Ensure concurrent
access is protected by the same locking used for subscribe_counter_ and adjust
tests/initialization to populate the Sharder maps instead of per-shard fields.
In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md:
- Line 5: Update the incorrect design filename reference: locate the string
"cc_resubscribe_epoch_design.md" in cc_resubscribe_v2_impl_tasks_en.md and
replace it with the correct path "tx_service/cc_resubscribe_epoch_design_en.md"
so implementers can find the proper design doc; ensure any other occurrences in
the same file are updated consistently.
- Around line 169-172: The Files Involved list omitted the Sharder files
required by T6; update the list to include the Sharder header and implementation
(add sharder.h and sharder.cpp) because T6 references Sharder methods
MarkOutOfSyncInflight(...) and ClearOutOfSyncInflight(...), which are defined on
the Sharder type—ensure those symbols (Sharder::MarkOutOfSyncInflight and
Sharder::ClearOutOfSyncInflight) are covered by adding the Sharder files to the
Files Involved section.
---
Nitpick comments:
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md:
- Around line 124-127: Clarify the "reject" semantics for forwarded messages
missing the round: update the text so that when a standby receives a forwarded
message without a round field it sends an explicit error/NACK response back to
the sender (not silently drop or close the connection), logs the protocol
violation, and increments the warning metric; reference the check against
active_subscribe_ctx_.round and the "forwarded message" handling to locate where
to document/implement this behavior and ensure the NACK includes the reason
"missing round" for easier debugging.
In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md:
- Around line 174-186: The out_of_sync_ctrl_ map needs an explicit cleanup
policy to avoid unbounded growth: update CcShard::NotifyStandbyOutOfSync /
MarkOutOfSyncInflight to ensure entries for node_id are removed when a standby
successfully subscribes (i.e., after ResetStandbySequenceId completes) by
calling ClearOutOfSyncInflight(node_id, round) and then erasing
out_of_sync_ctrl_[node_id]; additionally remove the entry when a standby is
removed from subscribed_standby_nodes_ across all shards, and add a time-based
eviction (e.g., if last_progress_ts > 5 minutes) to periodically purge stale
entries so out_of_sync_ctrl_ cannot grow indefinitely.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1c6f6789-4b9a-4f61-b7e3-94ffb9218569
📒 Files selected for processing (2)
.cursor_knowledge/cc_resubscribe_epoch_design_en.md.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md
| #### 6.3 Round Resolution in `StandbyStartFollowing` | ||
|
|
||
| - If `resubscribe_intent=true` (request a newer round): | ||
| - If `subscribe_round_id < latest_subscribe_round_[node_id]`, reject as stale. | ||
| - If `subscribe_round_id == latest_subscribe_round_[node_id]`, allocate a new round (reuse `GetNextSubscribeId()`) and update latest. | ||
| - If `subscribe_round_id == 0` and there is no known current round context, allocate a new round and update latest (initial/join fallback). | ||
| - If `subscribe_round_id > latest_subscribe_round_[node_id]`, reject as invalid/out-of-order. | ||
| - If `resubscribe_intent=false` (join/retry existing round): | ||
| - If `subscribe_round_id < latest_subscribe_round_[node_id]`, reject as stale. | ||
| - If `subscribe_round_id == latest_subscribe_round_[node_id]`, handle as idempotent success. | ||
| - If `subscribe_round_id > latest_subscribe_round_[node_id]`, reject as invalid/out-of-order. | ||
| - Response returns the final round (the current `subscribe_id` field). | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Clarify edge cases and error handling in round resolution logic.
The round resolution logic contains several conditional branches that need clarification:
-
Line 84: "subscribe_round_id == 0 and there is no known current round context" - what precisely defines "no known current round context"? Is this
latest_subscribe_round_[node_id] == 0, or are there other conditions? -
Line 85: "subscribe_round_id > latest_subscribe_round_[node_id], reject as invalid/out-of-order" - under what scenarios would a standby legitimately send a round ID greater than the primary's latest? This appears to be a protocol violation. Should this case log an error, trigger an alert, or include additional diagnostics to detect bugs or malicious behavior?
-
The same-round idempotent handling (line 88) should specify whether the response includes any state refresh or is purely a no-op acknowledgment.
🧰 Tools
🪛 LanguageTool
[style] ~84-~84: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...tSubscribeId()) and update latest. - If subscribe_round_id == 0` and there is ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~85-~85: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ate latest (initial/join fallback). - If `subscribe_round_id > latest_subscribe_...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~86-~86: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...id], reject as invalid/out-of-order. - If resubscribe_intent=false` (join/retry ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~89-~89: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...id], handle as idempotent success. - If subscribe_round_id > latest_subscribe_...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 79 - 91,
Clarify and tighten the round-resolution rules in StandbyStartFollowing: define
"no known current round context" as latest_subscribe_round_[node_id] == 0 and
there is no active subscribe_id/state for that node (explicitly check both
latest_subscribe_round_ and the node's current subscribe context), treat
subscribe_round_id > latest_subscribe_round_[node_id] as a protocol violation —
add a logged error/metric with diagnostics (remote node id, received
subscribe_round_id, expected latest_subscribe_round_[node_id], and request
metadata) and reject the request (do not allocate a new round), and make
same-round idempotent handling (subscribe_round_id ==
latest_subscribe_round_[node_id] when resubscribe_intent==false) explicitly
return the current subscribe_id and the node's current state snapshot (not a
silent no-op) so callers receive a state refresh; update code paths referencing
GetNextSubscribeId(), latest_subscribe_round_[node_id], subscribe_round_id, and
subscribe_id accordingly.
| #### 6.3.2 Primary-side Dedup for Repeated Out-of-Sync Triggers (Simplified) | ||
|
|
||
| - Default parameters (recommended, configurable): | ||
| - `same_round_resend_interval_ms = 2000`: minimum resend interval for the same round. | ||
| - `no_progress_timeout_ms = 20000`: no-progress timeout; allow newer-round escalation after timeout. | ||
|
|
||
| - If the same standby already has an in-flight out-of-sync notification and `inflight_round == latest_subscribe_round_[node_id]`: | ||
| - if `now - last_progress_ts < no_progress_timeout_ms`: | ||
| - resend same-round notification only when `now - last_send_ts >= same_round_resend_interval_ms`; | ||
| - otherwise ignore repeated trigger; | ||
| - if `now - last_progress_ts >= no_progress_timeout_ms`: allocate and send a newer round (escalation). | ||
| - By default, repeated triggers reuse the same inflight round; escalation is timeout-based, not retry-count-based. | ||
| - After primary receives valid `StandbyStartFollowing` for that assigned round (`resubscribe_intent=false`), update `last_progress_ts` (and clear/reset inflight marker per implementation policy). | ||
| - Progress definition (recommended): a validated `StandbyStartFollowing(node_id, round=inflight_round, resubscribe_intent=false)`. |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Define "valid" StandbyStartFollowing criteria and coordinate with round resolution.
Line 112 states "After primary receives valid StandbyStartFollowing for that assigned round (resubscribe_intent=false), update last_progress_ts" but doesn't define what makes a request "valid." Should this reference the validation logic from section 6.3 (lines 79-91)?
Additionally, the escalation path (line 110: "allocate and send a newer round") needs explicit coordination with the round resolution logic. When allocating a newer round due to timeout, the primary must:
- Update
latest_subscribe_round_[node_id]atomically with the allocation - Ensure the out-of-sync notification carries the new round
- Handle race conditions if the standby concurrently sends its own resubscribe request
| ### 9. Test Matrix (Must Be Covered) | ||
|
|
||
| 1. Two consecutive resubscribe requests in the same term: the second must preempt successfully while the first is still in progress. | ||
| 2. Concurrent out-of-sync notification + lag trigger: only the latest round becomes effective. | ||
| 3. Late `StandbyStartFollowingResponse` (old round) is dropped. | ||
| 4. Late `ResetStandbySequenceId` (old round) is rejected or handled idempotently as expected. | ||
| 5. Snapshot request coverage and round monotonicity are consistent (no rollback). | ||
| 6. Host manager `OnStartFollowing` (`current_round + resubscribe_intent=true`) does not conflict with concurrent resubscribe. | ||
| 7. Forwarding messages missing round field are rejected, and warning metrics are correct. | ||
| 8. Multiple standby-local triggers in the same term (lag/idle) must result in only one “request-new-round (`resubscribe_intent=true`)” request; all others are locally coalesced. | ||
| 9. After standby gets round=`R`, all follow-up retries must carry `R` with `resubscribe_intent=false`. | ||
| 10. Repeated primary out-of-sync triggers in a short interval must not allocate multiple new rounds for the same standby unless escalation policy allows it. | ||
| 11. Out-of-sync dedup thresholds are enforced: allow same-round resend after 2s; allow newer-round escalation after 20s no progress. | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add test coverage for cross-shard consistency of round state.
The test matrix is comprehensive but missing explicit coverage for the cross-shard consistency issue raised in previous reviews. Add a test case:
Test 12: When StandbyStartFollowing or ResetStandbySequenceId operations execute concurrently across multiple shards for the same node, verify that:
latest_subscribe_round_[node_id]remains consistent across all shards (no drift)- Operations on different shards for the same node see the same round state
- Concurrent round allocations don't create conflicting round values
This test becomes critical if the state management approach in lines 36-38 is revised to centralize round ownership.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 146 - 159,
Add a new Test 12 to the test matrix that exercises cross-shard concurrency for
round state: simulate concurrent StandbyStartFollowing and
ResetStandbySequenceId operations targeting the same node_id across multiple
shards and assert that the shared state latest_subscribe_round_[node_id] is
identical on all shards, that reads from different shards observe the same round
value, and that concurrent round allocations never produce conflicting or
divergent rounds; ensure the test also covers the case where round ownership is
centralized (the code paths that reference/latest update
latest_subscribe_round_[node_id] and any allocator logic invoked by
resubscribe_intent) and fails if any shard reports a different round or multiple
distinct rounds are allocated for the same node.
| - State Changes: | ||
| - Reuse: `subscribe_counter_`. | ||
| - Add: `latest_subscribe_round_[node_id]` (recommended in `Sharder` for centralized ownership). | ||
| - Add: `out_of_sync_ctrl_[node_id]` (stores `{inflight_round, inflight, last_send_ts, last_progress_ts}` or equivalent) for deduplicating repeated out-of-sync triggers. | ||
| - Add: concurrency protection for this map (`mutex`/`shared_mutex`). |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Clarify data structures and concurrency strategy.
Lines 92-93 introduce out_of_sync_ctrl_[node_id] and mention "concurrency protection" but lack specificity:
- Data structure: Make explicit that
out_of_sync_ctrl_is likelystd::unordered_map<uint32_t, OutOfSyncControl>whereOutOfSyncControlis a struct containing{inflight_round, inflight, last_send_ts, last_progress_ts}. - Concurrency: Specify whether to use
std::shared_mutexor another mechanism. Given that round allocation (write-heavy) and multiple read operations will contend, the choice affects performance.
📋 Suggested clarification
- Add: `out_of_sync_ctrl_[node_id]` (stores `{inflight_round, inflight, last_send_ts, last_progress_ts}` or equivalent) for deduplicating repeated out-of-sync triggers.
- Add: concurrency protection for this map (`mutex`/`shared_mutex`).
+ Add: `std::unordered_map<uint32_t, OutOfSyncControl> out_of_sync_ctrl_` where `OutOfSyncControl` holds `{uint32_t inflight_round, bool inflight, uint64_t last_send_ts, uint64_t last_progress_ts}` for deduplicating repeated out-of-sync triggers.
+ Add: `std::shared_mutex out_of_sync_ctrl_mux_` for protecting concurrent access to the map (readers check inflight state, writers allocate rounds).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 89 - 93,
Clarify and concretely define the out-of-sync control state: introduce a typed
alias/field out_of_sync_ctrl_ as std::unordered_map<uint32_t, OutOfSyncControl>
and define struct OutOfSyncControl { uint64_t inflight_round; bool inflight;
std::chrono::steady_clock::time_point last_send_ts;
std::chrono::steady_clock::time_point last_progress_ts; /* add a per-entry mutex
if needed */ }; update references to out_of_sync_ctrl_, latest_subscribe_round_,
and subscribe_counter_ to use these types; for concurrency, prefer a hybrid
strategy: protect the map itself with a single std::shared_mutex for concurrent
reads/writes to map entries, and add a per-OutOfSyncControl std::mutex (or
std::atomic flags for simple fields) to avoid contention during write-heavy
round allocation (i.e., Sharder updates) while allowing concurrent readers to
inspect entries. Ensure all accesses follow this locking scheme (lock shared for
map lookups, upgrade/unique for insert/erase, and lock per-entry mutex for
modifying the OutOfSyncControl fields).
| - Candidate state must include round/standby_term (cannot remain `start_seq_id` only). | ||
| - Recommended upgrade for `candidate_standby_nodes_`: | ||
| `node_id -> {start_seq_id, standby_term}`. |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Specify the current structure of candidate_standby_nodes_ for clarity.
Lines 120-122 recommend upgrading candidate_standby_nodes_ to node_id -> {start_seq_id, standby_term} but don't document the current structure being replaced. This omission makes it difficult for implementers to understand the migration path.
📋 Suggested clarification
- Recommended upgrade for `candidate_standby_nodes_`:
- `node_id -> {start_seq_id, standby_term}`.
+ Current structure: `candidate_standby_nodes_` is `std::unordered_map<uint32_t, uint64_t>` mapping `node_id -> start_seq_id`.
+ Recommended upgrade: Change to `std::unordered_map<uint32_t, CandidateStandbyInfo>` where `CandidateStandbyInfo` contains `{uint64_t start_seq_id, int64_t standby_term}` to support round-aware candidate tracking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 120 - 122,
Document that the existing candidate_standby_nodes_ mapping is currently node_id
-> start_seq_id (i.e., each entry stores only start_seq_id), then show the
upgraded mapping node_id -> {start_seq_id, standby_term} and explicitly state
that standby_term (round) must be populated during migration; update any
mentions of candidate_standby_nodes_ and related migration steps to reflect the
old schema, the new schema, and the required transformation for each entry.
| - Function Changes: | ||
| - `CcNodeService::ResetStandbySequenceId(...)` | ||
| - Parse `(term, round)` from `standby_node_term`. | ||
| - Validate `term == current leader term`. | ||
| - Validate `round == latest_subscribe_round_[node_id]`. | ||
| - Validate candidate is still valid; otherwise reject as stale. | ||
| - If already subscribed with same `(term, round)`, return idempotent success. | ||
| - Add helper APIs in `CcShard`: | ||
| - `HasCandidateStandby(node_id, standby_term)` | ||
| - `HasSubscribedStandby(node_id, standby_term)` | ||
| - `PromoteCandidateToSubscribed(...)` (optional) | ||
| - Definition of Done: |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Explicitly document the AddHeartbeatTargetNode ordering fix.
The high-risk section (line 306) identifies that ResetStandbySequenceId currently adds the heartbeat target before validation, which can cause stale pollution. However, T5's function changes don't explicitly mention moving this call.
Context snippet 2 shows the current code adds the heartbeat target at the beginning (before any validation). The fix should move this call to after all triple checks pass.
📋 Suggested addition to Function Changes
- `CcNodeService::ResetStandbySequenceId(...)`
- Parse `(term, round)` from `standby_node_term`.
- Validate `term == current leader term`.
- Validate `round == latest_subscribe_round_[node_id]`.
- Validate candidate is still valid; otherwise reject as stale.
- If already subscribed with same `(term, round)`, return idempotent success.
+ - **Ordering fix**: Move `AddHeartbeatTargetNode(node_id, standby_node_term)` call to after all validation checks pass to prevent stale state pollution.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - Function Changes: | |
| - `CcNodeService::ResetStandbySequenceId(...)` | |
| - Parse `(term, round)` from `standby_node_term`. | |
| - Validate `term == current leader term`. | |
| - Validate `round == latest_subscribe_round_[node_id]`. | |
| - Validate candidate is still valid; otherwise reject as stale. | |
| - If already subscribed with same `(term, round)`, return idempotent success. | |
| - Add helper APIs in `CcShard`: | |
| - `HasCandidateStandby(node_id, standby_term)` | |
| - `HasSubscribedStandby(node_id, standby_term)` | |
| - `PromoteCandidateToSubscribed(...)` (optional) | |
| - Definition of Done: | |
| - Function Changes: | |
| - `CcNodeService::ResetStandbySequenceId(...)` | |
| - Parse `(term, round)` from `standby_node_term`. | |
| - Validate `term == current leader term`. | |
| - Validate `round == latest_subscribe_round_[node_id]`. | |
| - Validate candidate is still valid; otherwise reject as stale. | |
| - If already subscribed with same `(term, round)`, return idempotent success. | |
| - **Ordering fix**: Move `AddHeartbeatTargetNode(node_id, standby_node_term)` call to after all validation checks pass to prevent stale state pollution. | |
| - Add helper APIs in `CcShard`: | |
| - `HasCandidateStandby(node_id, standby_term)` | |
| - `HasSubscribedStandby(node_id, standby_term)` | |
| - `PromoteCandidateToSubscribed(...)` (optional) | |
| - Definition of Done: |
🧰 Tools
🪛 LanguageTool
[style] ~153-~153: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...atest_subscribe_round_[node_id]`. - Validate candidate is still valid; otherwise rej...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 148 - 159,
ResetStandbySequenceId currently calls AddHeartbeatTargetNode before validating
the standby, which can pollute heartbeat state with stale candidates; update
CcNodeService::ResetStandbySequenceId to perform all validations first (parse
(term, round) from standby_node_term, verify term == current leader term, verify
round == latest_subscribe_round_[node_id], verify candidate still valid and not
stale, and return idempotent success if already subscribed with same (term,
round)) and only call AddHeartbeatTargetNode after those checks succeed; also
add/use the CcShard helper APIs HasCandidateStandby(node_id, standby_term) and
HasSubscribedStandby(node_id, standby_term) (and optionally
PromoteCandidateToSubscribed) to encapsulate candidate/subscribed checks and
make the final heartbeat-addition point unambiguous.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
.cursor_knowledge/cc_resubscribe_epoch_design_en.md (1)
36-38:⚠️ Potential issue | 🟠 MajorKeep round ownership in
Sharder, notCcShard.
CcNodeService::StandbyStartFollowing(...)walks shards independently, so storinglatest_subscribe_round_[node_id]andout_of_sync_ctrl_[node_id]inCcShardwould create per-shard round drift and multiple sources of truth for the same standby. These tables need a single owner inSharder, withCcShardkeeping only per-shard candidate/subscribed bookkeeping.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 36 - 38, The round ownership fields latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] are placed in CcShard which leads to per-shard drift because CcNodeService::StandbyStartFollowing walks shards independently; move these tables into the single owner Sharder and keep CcShard limited to per-shard candidate/subscribed bookkeeping (e.g., retain subscribe_counter_ in Sharder semantics but relocate latest_subscribe_round_ and out_of_sync_ctrl_ state and any access/update logic from CcShard to Sharder), update any reads/writes in CcNodeService::StandbyStartFollowing and related methods to call into Sharder APIs for latest_subscribe_round_ and out_of_sync_ctrl_ so there is one source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md:
- Around line 74-85: Clarify and implement an explicit initial-join rule in
StandbyStartFollowing: when a request arrives with the default shape
(subscribe_round_id == 0 and resubscribe_intent == false) and there is no known
current round context for node_id (latest_subscribe_round_[node_id] is
empty/zero), treat it as an initial join and allocate a new round via
GetNextSubscribeId(), update latest_subscribe_round_[node_id], and return the
new subscribe_id; alternatively, if you prefer to require callers to signal
intent, add validation that rejects subscribe_round_id == 0 unless
resubscribe_intent == true and document that requirement. Ensure the behavior
references the existing branches comparing subscribe_round_id to
latest_subscribe_round_[node_id] and that response populates the current
subscribe_id field.
In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md:
- Around line 67-73: The host-manager path must explicitly handle initial joins
so a StandbyStartFollowingRequest with subscribe_round_id==0 and
resubscribe_intent==false does not get ignored; update the round-resolution
logic invoked by CcNodeService::OnStartFollowing to treat (subscribe_round_id ==
0 && resubscribe_intent == false) as a valid initial-join that allocates the
first round (same as the T4 path that allocates on resubscribe_intent==true), or
alternatively change the host-manager caller to send resubscribe_intent=true for
the very first join; modify the code that resolves
subscribe_round_id/resubscribe_intent to implement one of these two behaviors so
initial subscribes are accepted and a round is created.
- Around line 182-183: Do not call ClearOutOfSyncInflight from the
StandbyStartFollowing success path; StandbyStartFollowing only proves acceptance
of the round and subsequent reset/snapshot/subscribed-state work may still be in
flight, so clearing out_of_sync_ctrl_ here permits a second out-of-sync trigger
and breaks deduping. Instead, remove the ClearOutOfSyncInflight(node_id, round)
invocation from the StandbyStartFollowing success block and move it to the real
completion checkpoint (e.g., the handler that marks reset/promotion to
subscribed or any terminal "subscribed-complete" callback), ensuring
ClearOutOfSyncInflight only runs when the node reaches a terminal
subscribed/progress state.
---
Duplicate comments:
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md:
- Around line 36-38: The round ownership fields latest_subscribe_round_[node_id]
and out_of_sync_ctrl_[node_id] are placed in CcShard which leads to per-shard
drift because CcNodeService::StandbyStartFollowing walks shards independently;
move these tables into the single owner Sharder and keep CcShard limited to
per-shard candidate/subscribed bookkeeping (e.g., retain subscribe_counter_ in
Sharder semantics but relocate latest_subscribe_round_ and out_of_sync_ctrl_
state and any access/update logic from CcShard to Sharder), update any
reads/writes in CcNodeService::StandbyStartFollowing and related methods to call
into Sharder APIs for latest_subscribe_round_ and out_of_sync_ctrl_ so there is
one source of truth.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a111cdad-49ed-4437-9bb0-4f1758271904
📒 Files selected for processing (2)
.cursor_knowledge/cc_resubscribe_epoch_design_en.md.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md
| #### 6.3 Round Resolution in `StandbyStartFollowing` | ||
|
|
||
| - If `resubscribe_intent=true` (request a newer round): | ||
| - If `subscribe_round_id < latest_subscribe_round_[node_id]`, reject as stale. | ||
| - If `subscribe_round_id == latest_subscribe_round_[node_id]`, allocate a new round (reuse `GetNextSubscribeId()`) and update latest. | ||
| - If `subscribe_round_id == 0` and there is no known current round context, allocate a new round and update latest (initial/join fallback). | ||
| - If `subscribe_round_id > latest_subscribe_round_[node_id]`, reject as invalid/out-of-order. | ||
| - If `resubscribe_intent=false` (join/retry existing round): | ||
| - If `subscribe_round_id < latest_subscribe_round_[node_id]`, reject as stale. | ||
| - If `subscribe_round_id == latest_subscribe_round_[node_id]`, handle as idempotent success. | ||
| - If `subscribe_round_id > latest_subscribe_round_[node_id]`, reject as invalid/out-of-order. | ||
| - Response returns the final round (the current `subscribe_id` field). |
There was a problem hiding this comment.
Add an explicit initial-join rule for the default request shape.
Section 6.1 says the host-manager path does not carry round/intention, but this section never defines how the primary resolves a StandbyStartFollowingRequest that arrives with the new field defaults. Since first-join traffic will naturally look like (subscribe_round_id=0, resubscribe_intent=false) unless the caller overrides it, the current rules leave initial subscribe undefined. Please either require first join to use 0 + resubscribe_intent=true, or add a dedicated 0 + false + no current round branch that allocates the first round.
🧰 Tools
🪛 LanguageTool
[style] ~79-~79: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...tSubscribeId()) and update latest. - If subscribe_round_id == 0` and there is ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~80-~80: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ate latest (initial/join fallback). - If `subscribe_round_id > latest_subscribe_...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~81-~81: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...id], reject as invalid/out-of-order. - If resubscribe_intent=false` (join/retry ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~84-~84: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...id], handle as idempotent success. - If subscribe_round_id > latest_subscribe_...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.cursor_knowledge/cc_resubscribe_epoch_design_en.md around lines 74 - 85,
Clarify and implement an explicit initial-join rule in StandbyStartFollowing:
when a request arrives with the default shape (subscribe_round_id == 0 and
resubscribe_intent == false) and there is no known current round context for
node_id (latest_subscribe_round_[node_id] is empty/zero), treat it as an initial
join and allocate a new round via GetNextSubscribeId(), update
latest_subscribe_round_[node_id], and return the new subscribe_id;
alternatively, if you prefer to require callers to signal intent, add validation
that rejects subscribe_round_id == 0 unless resubscribe_intent == true and
document that requirement. Ensure the behavior references the existing branches
comparing subscribe_round_id to latest_subscribe_round_[node_id] and that
response populates the current subscribe_id field.
| - Keep `CcNodeService::OnStartFollowing(...)` protocol unchanged; it remains a trigger entry. | ||
| - Update trigger call sites: lag, stream idle timeout, out-of-sync path. | ||
| - Semantic constraints: | ||
| - lag / idle timeout: pass `current_round + resubscribe_intent=true`. | ||
| - out-of-sync (round pre-assigned by primary): pass `subscribe_round_id=<msg_round> + resubscribe_intent=false`. | ||
| - host manager path: no round/intention payload (default values). | ||
| - Definition of Done: |
There was a problem hiding this comment.
Define how the host-manager path creates the first round.
This task leaves the host-manager entry on default values, but T4 only allocates a new round for resubscribe_intent=true. That makes an initial StandbyStartFollowingRequest with (subscribe_round_id=0, resubscribe_intent=false) undefined, so implementers can accidentally reject or no-op the first subscribe. Either require the first join to send 0 + resubscribe_intent=true, or add an explicit initial-join rule for 0 + false in the round-resolution task.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 67 - 73,
The host-manager path must explicitly handle initial joins so a
StandbyStartFollowingRequest with subscribe_round_id==0 and
resubscribe_intent==false does not get ignored; update the round-resolution
logic invoked by CcNodeService::OnStartFollowing to treat (subscribe_round_id ==
0 && resubscribe_intent == false) as a valid initial-join that allocates the
first round (same as the T4 path that allocates on resubscribe_intent==true), or
alternatively change the host-manager caller to send resubscribe_intent=true for
the very first join; modify the code that resolves
subscribe_round_id/resubscribe_intent to implement one of these two behaviors so
initial subscribes are accepted and a round is created.
| - Coordination with `T4`: | ||
| - After `StandbyStartFollowing` validation succeeds, call `ClearOutOfSyncInflight(node_id, round)`. |
There was a problem hiding this comment.
Don't clear the out-of-sync inflight marker this early.
StandbyStartFollowing only proves the standby accepted the round; reset/snapshot/subscribed-state work can still be in flight afterwards. Clearing out_of_sync_ctrl_ here reopens the window for a second out-of-sync trigger to allocate a newer round during the same recovery, which defeats the dedup policy. Keep the marker until a real completion/progress checkpoint, such as successful reset/promotion to subscribed or an equivalent terminal state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md around lines 182 - 183,
Do not call ClearOutOfSyncInflight from the StandbyStartFollowing success path;
StandbyStartFollowing only proves acceptance of the round and subsequent
reset/snapshot/subscribed-state work may still be in flight, so clearing
out_of_sync_ctrl_ here permits a second out-of-sync trigger and breaks deduping.
Instead, remove the ClearOutOfSyncInflight(node_id, round) invocation from the
StandbyStartFollowing success block and move it to the real completion
checkpoint (e.g., the handler that marks reset/promotion to subscribed or any
terminal "subscribed-complete" callback), ensuring ClearOutOfSyncInflight only
runs when the node reaches a terminal subscribed/progress state.
There was a problem hiding this comment.
♻️ Duplicate comments (8)
.cursor_knowledge/standby_resubscribe_design.md (3)
107-108:⚠️ Potential issue | 🟠 MajorDo not clear out-of-sync inflight state on
StandbyStartFollowingsuccess.A valid
StandbyStartFollowingonly proves the standby accepted the round; reset/snapshot/promotion can still be running afterwards. If the inflight marker is cleared here, a second out-of-sync trigger can allocate a newer round during the same recovery window.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/standby_resubscribe_design.md around lines 107 - 108, The note incorrectly clears the out-of-sync inflight marker on processing a valid StandbyStartFollowing; change the behavior so that when handling StandbyStartFollowing(node_id, round=inflight_round, resubscribe_intent=false) you only update last_progress_ts (or record the validated acceptance) but do NOT clear/reset the inflight marker or treat this event as full progress completion in the logic that manages inflight rounds; ensure any code paths using the “progress” definition (StandbyStartFollowing with resubscribe_intent=false) continue to treat reset/snapshot/promotion as separate events and only clear inflight state when those specific completion signals occur.
65-65:⚠️ Potential issue | 🟠 MajorDefine the initial-join rule for the default request shape.
Section 6.1 says the host-manager path sends default values, but Section 6.3 still only defines allocation for
subscribe_round_id == 0underresubscribe_intent=true. A firstStandbyStartFollowingRequestarriving as(0, false)is still underspecified, so please either allocate the first round in that branch or require callers to setresubscribe_intent=truefor initial join.Also applies to: 76-85
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/standby_resubscribe_design.md at line 65, The behavior for an initial join with default request shape is underspecified: update the host manager's OnStartFollowing handling or the StandbyStartFollowingRequest contract so a request with subscribe_round_id == 0 and resubscribe_intent == false is deterministic; either (A) treat OnStartFollowing as a trigger-only path and explicitly allocate/assign the first round when receiving StandbyStartFollowingRequest((0, false)) inside the host-manager logic (e.g., in the code that processes StandbyStartFollowingRequest) or (B) change the API/docs to require callers set resubscribe_intent=true for initial joins; pick one approach and implement it consistently so the allocation branch currently guarded by resubscribe_intent=true also covers the initial (0, false) case or the callers are validated to send resubscribe_intent=true.
36-38:⚠️ Potential issue | 🟠 MajorKeep per-node round ownership in
Sharder, notCcShard.Documenting
latest_subscribe_round_[node_id]andout_of_sync_ctrl_[node_id]as shard-local reintroduces multiple sources of truth for the same standby. This PR’s own task breakdown centralizes that state inSharder;CcShardshould keep only per-shard candidate/subscribed bookkeeping.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/standby_resubscribe_design.md around lines 36 - 38, The docs and design incorrectly list latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] as CcShard-local; instead make Sharder the single owner of per-node round state: remove those per-node fields from CcShard and document them under Sharder (alongside subscribe_counter_), update any code references to access latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] via Sharder APIs/fields rather than CcShard, and keep CcShard limited to per-shard candidate/subscribed bookkeeping only..cursor_knowledge/standby_resubscribe_impl_tasks.md (5)
67-73:⚠️ Potential issue | 🟠 MajorT4 still leaves the first join ambiguous.
T2 keeps the host-manager path on default values, but T4 only defines behavior once a latest round already exists. Please add an explicit
(subscribe_round_id=0, resubscribe_intent=false, no current round)branch there, or change the host-manager path to request the first round explicitly.Also applies to: 121-125
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/standby_resubscribe_impl_tasks.md around lines 67 - 73, The review points out T4 doesn't handle a first-join case; update the logic around CcNodeService::OnStartFollowing and the host-manager trigger so the first join is explicit: either add a branch where (subscribe_round_id=0, resubscribe_intent=false, no current round) is passed by the host-manager path before any latest round exists, or modify the host-manager path to request the first round explicitly (i.e., set subscribe_round_id to the initial round and resubscribe_intent appropriately); ensure all trigger call sites (lag, stream idle timeout, out-of-sync, host-manager) follow the documented semantic constraints and that CcNodeService::OnStartFollowing correctly distinguishes the first-join branch from resubscribe flows.
167-170:⚠️ Potential issue | 🟡 MinorInclude the
Sharderfiles in T6’s implementation surface.This task updates
out_of_sync_ctrl_and callsMarkOutOfSyncInflight(...)/ClearOutOfSyncInflight(...), but the listed files stop short of theSharderstate and APIs that back that logic. Adddata_substrate/tx_service/include/sharder.handdata_substrate/tx_service/src/sharder.cpphere so T6 is implementable on its own.Also applies to: 173-180
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/standby_resubscribe_impl_tasks.md around lines 167 - 170, The file list omitted the Sharder implementation needed by T6; update the task’s file list to include the Sharder headers and source so the out_of_sync_ctrl_ usages and calls to MarkOutOfSyncInflight(...) and ClearOutOfSyncInflight(...) are resolvable — specifically add data_substrate/tx_service/include/sharder.h and data_substrate/tx_service/src/sharder.cpp to the listed files so the Sharder state/APIs used by out_of_sync_ctrl_, MarkOutOfSyncInflight, and ClearOutOfSyncInflight are present.
5-5:⚠️ Potential issue | 🟡 MinorPoint this checklist at the current design doc.
The filename here does not match the design document added in this PR (
.cursor_knowledge/standby_resubscribe_design.md), so this task list can send implementers to the wrong spec.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/standby_resubscribe_impl_tasks.md at line 5, The checklist file .cursor_knowledge/standby_resubscribe_impl_tasks.md currently references cc_resubscribe_epoch_design.md; update the reference text and any links in this file to point to the actual design doc added in the PR (.cursor_knowledge/standby_resubscribe_design.md) so implementers are directed to the correct spec, and remove or replace any mentions of cc_resubscribe_epoch_design.md to avoid confusion.
182-183:⚠️ Potential issue | 🟠 MajorDon’t clear out-of-sync inflight at
StandbyStartFollowingsuccess.That checkpoint only confirms round acceptance; reset/snapshot/subscribed promotion can still be running. Clearing the marker here defeats the dedup window and allows a newer round to be allocated mid-recovery.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/standby_resubscribe_impl_tasks.md around lines 182 - 183, The current logic calls ClearOutOfSyncInflight(node_id, round) immediately when StandbyStartFollowing validation succeeds, which is incorrect because recovery (reset/snapshot/subscribed promotion) may still be in progress; remove that ClearOutOfSyncInflight call from the StandbyStartFollowing success path and instead invoke ClearOutOfSyncInflight only after the recovery/promotion has fully completed (e.g., in the completion callback/handler that signals reset/snapshot/subscribed-promotion finished), or at the explicit point where the deduplication window is guaranteed safe; update any references in the StandbyStartFollowing handler to rely on the recovery-complete event rather than clearing the marker there.
146-156:⚠️ Potential issue | 🟠 MajorCall out the heartbeat-ordering fix in T5.
The current
CcNodeService::ResetStandbySequenceId(...)adds the heartbeat target before any validation intx_service/src/remote/cc_node_service.cpp:1847-1888. T5 should explicitly moveAddHeartbeatTargetNode(...)to after the term/round/candidate checks, otherwise the stale-state pollution bug is easy to preserve during implementation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor_knowledge/standby_resubscribe_impl_tasks.md around lines 146 - 156, The heartbeat-target is being added too early in CcNodeService::ResetStandbySequenceId; move the call to AddHeartbeatTargetNode(...) so it runs only after you parse and validate (term, round) from standby_node_term and confirm term == current leader term, round == latest_subscribe_round_[node_id], and the candidate is still valid (reject if stale); also preserve idempotent success when already subscribed with the same (term, round). Update ResetStandbySequenceId in tx_service/src/remote/cc_node_service.cpp to perform all term/round/candidate checks first and only call AddHeartbeatTargetNode after those checks (and after any Promoting/Subscribe state transitions via CcShard helper APIs like HasCandidateStandby, HasSubscribedStandby, PromoteCandidateToSubscribed if used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.cursor_knowledge/standby_resubscribe_design.md:
- Around line 107-108: The note incorrectly clears the out-of-sync inflight
marker on processing a valid StandbyStartFollowing; change the behavior so that
when handling StandbyStartFollowing(node_id, round=inflight_round,
resubscribe_intent=false) you only update last_progress_ts (or record the
validated acceptance) but do NOT clear/reset the inflight marker or treat this
event as full progress completion in the logic that manages inflight rounds;
ensure any code paths using the “progress” definition (StandbyStartFollowing
with resubscribe_intent=false) continue to treat reset/snapshot/promotion as
separate events and only clear inflight state when those specific completion
signals occur.
- Line 65: The behavior for an initial join with default request shape is
underspecified: update the host manager's OnStartFollowing handling or the
StandbyStartFollowingRequest contract so a request with subscribe_round_id == 0
and resubscribe_intent == false is deterministic; either (A) treat
OnStartFollowing as a trigger-only path and explicitly allocate/assign the first
round when receiving StandbyStartFollowingRequest((0, false)) inside the
host-manager logic (e.g., in the code that processes
StandbyStartFollowingRequest) or (B) change the API/docs to require callers set
resubscribe_intent=true for initial joins; pick one approach and implement it
consistently so the allocation branch currently guarded by
resubscribe_intent=true also covers the initial (0, false) case or the callers
are validated to send resubscribe_intent=true.
- Around line 36-38: The docs and design incorrectly list
latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] as
CcShard-local; instead make Sharder the single owner of per-node round state:
remove those per-node fields from CcShard and document them under Sharder
(alongside subscribe_counter_), update any code references to access
latest_subscribe_round_[node_id] and out_of_sync_ctrl_[node_id] via Sharder
APIs/fields rather than CcShard, and keep CcShard limited to per-shard
candidate/subscribed bookkeeping only.
In @.cursor_knowledge/standby_resubscribe_impl_tasks.md:
- Around line 67-73: The review points out T4 doesn't handle a first-join case;
update the logic around CcNodeService::OnStartFollowing and the host-manager
trigger so the first join is explicit: either add a branch where
(subscribe_round_id=0, resubscribe_intent=false, no current round) is passed by
the host-manager path before any latest round exists, or modify the host-manager
path to request the first round explicitly (i.e., set subscribe_round_id to the
initial round and resubscribe_intent appropriately); ensure all trigger call
sites (lag, stream idle timeout, out-of-sync, host-manager) follow the
documented semantic constraints and that CcNodeService::OnStartFollowing
correctly distinguishes the first-join branch from resubscribe flows.
- Around line 167-170: The file list omitted the Sharder implementation needed
by T6; update the task’s file list to include the Sharder headers and source so
the out_of_sync_ctrl_ usages and calls to MarkOutOfSyncInflight(...) and
ClearOutOfSyncInflight(...) are resolvable — specifically add
data_substrate/tx_service/include/sharder.h and
data_substrate/tx_service/src/sharder.cpp to the listed files so the Sharder
state/APIs used by out_of_sync_ctrl_, MarkOutOfSyncInflight, and
ClearOutOfSyncInflight are present.
- Line 5: The checklist file .cursor_knowledge/standby_resubscribe_impl_tasks.md
currently references cc_resubscribe_epoch_design.md; update the reference text
and any links in this file to point to the actual design doc added in the PR
(.cursor_knowledge/standby_resubscribe_design.md) so implementers are directed
to the correct spec, and remove or replace any mentions of
cc_resubscribe_epoch_design.md to avoid confusion.
- Around line 182-183: The current logic calls ClearOutOfSyncInflight(node_id,
round) immediately when StandbyStartFollowing validation succeeds, which is
incorrect because recovery (reset/snapshot/subscribed promotion) may still be in
progress; remove that ClearOutOfSyncInflight call from the StandbyStartFollowing
success path and instead invoke ClearOutOfSyncInflight only after the
recovery/promotion has fully completed (e.g., in the completion callback/handler
that signals reset/snapshot/subscribed-promotion finished), or at the explicit
point where the deduplication window is guaranteed safe; update any references
in the StandbyStartFollowing handler to rely on the recovery-complete event
rather than clearing the marker there.
- Around line 146-156: The heartbeat-target is being added too early in
CcNodeService::ResetStandbySequenceId; move the call to
AddHeartbeatTargetNode(...) so it runs only after you parse and validate (term,
round) from standby_node_term and confirm term == current leader term, round ==
latest_subscribe_round_[node_id], and the candidate is still valid (reject if
stale); also preserve idempotent success when already subscribed with the same
(term, round). Update ResetStandbySequenceId in
tx_service/src/remote/cc_node_service.cpp to perform all term/round/candidate
checks first and only call AddHeartbeatTargetNode after those checks (and after
any Promoting/Subscribe state transitions via CcShard helper APIs like
HasCandidateStandby, HasSubscribedStandby, PromoteCandidateToSubscribed if
used).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e3ecb577-9f95-41dd-b7e5-71039e49d666
📒 Files selected for processing (2)
.cursor_knowledge/standby_resubscribe_design.md.cursor_knowledge/standby_resubscribe_impl_tasks.md
Fixes https://github.com/eloqdata/project_tracker/issues/224.
Summary by CodeRabbit