From c42c6df7c49c405b2262b8f6514929ae47b44c16 Mon Sep 17 00:00:00 2001 From: Kevin Chou Date: Fri, 6 Mar 2026 17:00:46 +0800 Subject: [PATCH 1/4] add standby resubscribe design and impl docs --- tx_service/cc_resubscribe_epoch_design_en.md | 164 +++++++++ tx_service/cc_resubscribe_v2_impl_tasks_en.md | 319 ++++++++++++++++++ 2 files changed, 483 insertions(+) create mode 100644 tx_service/cc_resubscribe_epoch_design_en.md create mode 100644 tx_service/cc_resubscribe_v2_impl_tasks_en.md diff --git a/tx_service/cc_resubscribe_epoch_design_en.md b/tx_service/cc_resubscribe_epoch_design_en.md new file mode 100644 index 00000000..ca04746c --- /dev/null +++ b/tx_service/cc_resubscribe_epoch_design_en.md @@ -0,0 +1,164 @@ +## Resubscribe While Previous Subscription Is Still In Progress (Revised v2) + +### 1. Background and Existing Problems + +- The current `CcNode::OnStartFollowing` only deduplicates by `requested_subscribe_primary_term_`; if a resubscribe is triggered again in the same term while the previous subscription is still running, the new request is dropped directly. +- Resubscribe is not triggered only by out-of-sync. It can also be triggered by lag threshold and stream idle timeout; all of these can happen repeatedly within the same term. +- The current system already has `subscribe_id` (encoded in the lower 32 bits of `standby_node_term`). Introducing an additional independent epoch can easily create dual sources of truth and state divergence. + +### 2. Revision Goals + +1. Fix the deadlock where a new resubscribe in the same term is dropped while an old subscription has not finished. +2. Define a single round identifier and avoid ambiguity from having both `subscribe_id` and `subscribe_epoch`. +3. Cover all trigger paths (out-of-sync / lag / stream timeout / host manager OnStartFollowing). +4. Establish strict stale-round filtering and idempotent handling for key RPCs. +5. Keep the single-version implementation simple, verifiable, and observable. + +### 3. Core Design Decisions + +#### 3.1 Single Source of Truth + +- Do not introduce a new independent round system. +- Define the resubscribe round uniformly as `subscribe_round_id`, directly reusing existing `subscribe_id` semantics. +- Keep `standby_node_term` unchanged: upper 32 bits are `primary_term`, lower 32 bits are `subscribe_round_id`. + +#### 3.2 Separation of Two Context Types + +- **Protocol-level round (cross-node):** `(primary_term, subscribe_round_id)`, used for primary/standby consistency. +- **Local request sequence (standby-local only):** `follow_req_id`, used to preempt/cancel a local in-flight subscription workflow. + +Note: `follow_req_id` is not a protocol field and does not participate in cross-node consistency checks; it only solves local concurrent preemption. + +### 4. State to Maintain + +| Role | State | Description | +| --- | --- | --- | +| Primary (`Sharder`) | `subscribe_counter_` (existing) | Round allocator that assigns `subscribe_round_id`. | +| Primary (`CcShard`) | `latest_subscribe_round_[node_id]` | Latest accepted round for this standby; used to reject stale requests. | +| Primary (`CcShard`) | `out_of_sync_ctrl_[node_id]` | Out-of-sync dedup control state (for example `{inflight_round, inflight, last_send_ts, last_progress_ts}`) to avoid repeated triggers continuously allocating new rounds. | +| Standby (`CcNode`) | `requested_follow_req_id_` | Latest local request sequence id. | +| Standby (`CcNode`) | `active_follow_req_id_` | Currently executing local request sequence id. | +| Standby (`CcNode`) | `active_subscribe_ctx_` | Current effective `(term, round)` for filtering stale messages/RPCs. | + +### 5. Protocol and Message Changes + +1. `OnStartFollowingRequest` + - Add optional `uint32 subscribe_round_id` (default `0`). + - Add optional `bool resubscribe_intent` (default `false`). + - Semantics: `subscribe_round_id` is the caller-known current round; `resubscribe_intent=true` means “request primary to allocate a newer round.” + +2. `KeyObjectStandbyForwardRequest` + - Add optional `uint32 subscribe_round_id`. + - Both out-of-sync notifications and regular forwarding messages must carry round. + +3. `StandbyStartFollowingRequest / Response` + - Request adds optional `subscribe_round_id` hint. + - Request adds optional `resubscribe_intent` (default `false`). + - Response keeps `subscribe_id`, with semantics explicitly defined as `subscribe_round_id`. + +4. `ResetStandbySequenceIdRequest / Response` + - No new mandatory field. Continue using `standby_node_term` to carry `(term, round)`. + - Server must explicitly parse and validate round. + +### 6. Key Flows + +#### 6.1 Unified Trigger Entrypoints + +- out-of-sync: primary first checks `out_of_sync_ctrl_[node_id]` for dedup, then decides whether to allocate/reuse `subscribe_round_id` before notifying standby. +- lag threshold (standby local trigger): `OnStartFollowing(..., subscribe_round_id=current_round, resubscribe_intent=true, resubscribe=true)`. +- stream idle timeout (standby local trigger): same as above. +- host manager `OnStartFollowing`: can also trigger a new round with `current_round + resubscribe_intent=true`. + +#### 6.2 Standby Local Preemption + +- Every `OnStartFollowing` generates a larger `follow_req_id` and overrides the target request. +- In-flight `SubscribePrimaryNode` checks `follow_req_id` before and after each blocking point: + - If it is no longer the latest request, exit immediately without committing state. +- This allows a new request in the same term to preempt old flow instead of being dropped by term-level deduplication. + +#### 6.3 Round Resolution in `StandbyStartFollowing` + +- If `resubscribe_intent=true` (request a newer round): + - If `subscribe_round_id < latest_subscribe_round_[node_id]`, reject as stale. + - If `subscribe_round_id == latest_subscribe_round_[node_id]`, allocate a new round (reuse `GetNextSubscribeId()`) and update latest. + - If `subscribe_round_id == 0` and there is no known current round context, allocate a new round and update latest (initial/join fallback). + - If `subscribe_round_id > latest_subscribe_round_[node_id]`, reject as invalid/out-of-order. +- If `resubscribe_intent=false` (join/retry existing round): + - If `subscribe_round_id < latest_subscribe_round_[node_id]`, reject as stale. + - If `subscribe_round_id == latest_subscribe_round_[node_id]`, handle as idempotent success. + - If `subscribe_round_id > latest_subscribe_round_[node_id]`, reject as invalid/out-of-order. +- Response returns the final round (the current `subscribe_id` field). + +#### 6.3.1 Dedup and Round-Carry Rules for Standby Self-Initiated Resubscribe + +- When standby is self-triggered by lag / stream idle timeout, it should send “current round + `resubscribe_intent=true`” to request a newer round. +- If standby has no valid current round context, it may use `subscribe_round_id=0 + resubscribe_intent=true`. +- Standby must deduplicate locally: if there is already an in-flight “request-new-round” workflow or a newly resolved round still in progress in the same term, new local triggers must be coalesced/ignored. +- Once `StandbyStartFollowing` returns round=`R_new`, standby must switch request context to `(term, R_new)`; all retries in that round must carry `R_new` with `resubscribe_intent=false`. +- Primary applies a unified rule for round-bearing requests: reject if `< latest`, process by intent if `== latest` (allocate new or idempotent), reject if `> latest`. + +#### 6.3.2 Primary-side Dedup for Repeated Out-of-Sync Triggers (Simplified) + +- Default parameters (recommended, configurable): + - `same_round_resend_interval_ms = 2000`: minimum resend interval for the same round. + - `no_progress_timeout_ms = 20000`: no-progress timeout; allow newer-round escalation after timeout. + +- If the same standby already has an in-flight out-of-sync notification and `inflight_round == latest_subscribe_round_[node_id]`: + - if `now - last_progress_ts < no_progress_timeout_ms`: + - resend same-round notification only when `now - last_send_ts >= same_round_resend_interval_ms`; + - otherwise ignore repeated trigger; + - if `now - last_progress_ts >= no_progress_timeout_ms`: allocate and send a newer round (escalation). +- By default, repeated triggers reuse the same inflight round; escalation is timeout-based, not retry-count-based. +- After primary receives valid `StandbyStartFollowing` for that assigned round (`resubscribe_intent=false`), update `last_progress_ts` (and clear/reset inflight marker per implementation policy). +- Progress definition (recommended): a validated `StandbyStartFollowing(node_id, round=inflight_round, resubscribe_intent=false)`. + +#### 6.4 Strict Validation in `ResetStandbySequenceId` + +- Parse `(term, round)` from `standby_node_term`. +- Execute candidate -> subscribed only if all conditions hold: + - `term == current leader term` + - `round == latest_subscribe_round_[node_id]` + - corresponding candidate is still valid +- If already the same `(term, round)`, return idempotent success; if older, reject. + +#### 6.5 Stale Message Filtering + +- When standby handles forwarded messages, if message carries round and it does not equal `active_subscribe_ctx_.round`, drop directly. +- If a forwarded message does not carry round, reject it directly and record a warning metric. + +### 7. Concurrency and Consistency Constraints + +1. Any step that commits effective state (setting `CandidateStandbyNodeTerm`, `StandbyNodeTerm`, Reset completion) must re-check that `follow_req_id` is still the latest. +2. Any RPC response (StartFollowing/Reset/Snapshot) must validate `(term, round)` against current request context before applying. +3. `latest_subscribe_round_[node_id]` must be monotonic increasing only. +4. `resubscribe_intent=true` is only for requesting a newer round; retries in the same round must use resolved round with `resubscribe_intent=false`. +5. Repeated out-of-sync triggers should reuse the same in-flight round by default and must not allocate unbounded newer rounds. + +### 8. Parameters and Runtime Policy + +1. Default parameters: + - `same_round_resend_interval_ms = 2000` + - `no_progress_timeout_ms = 20000` +2. Parameter requirements: + - `same_round_resend_interval_ms < no_progress_timeout_ms` + - both parameters should be configurable and observable by metrics. + +### 9. Test Matrix (Must Be Covered) + +1. Two consecutive resubscribe requests in the same term: the second must preempt successfully while the first is still in progress. +2. Concurrent out-of-sync notification + lag trigger: only the latest round becomes effective. +3. Late `StandbyStartFollowingResponse` (old round) is dropped. +4. Late `ResetStandbySequenceId` (old round) is rejected or handled idempotently as expected. +5. Snapshot request coverage and round monotonicity are consistent (no rollback). +6. Host manager `OnStartFollowing` (`current_round + resubscribe_intent=true`) does not conflict with concurrent resubscribe. +7. Forwarding messages missing round field are rejected, and warning metrics are correct. +8. Multiple standby-local triggers in the same term (lag/idle) must result in only one “request-new-round (`resubscribe_intent=true`)” request; all others are locally coalesced. +9. After standby gets round=`R`, all follow-up retries must carry `R` with `resubscribe_intent=false`. +10. Repeated primary out-of-sync triggers in a short interval must not allocate multiple new rounds for the same standby unless escalation policy allows it. +11. Out-of-sync dedup thresholds are enforced: allow same-round resend after 2s; allow newer-round escalation after 20s no progress. + +### 10. Expected Benefits + +- Remove the deadlock point where a same-term resubscribe is dropped while subscription is in progress. +- Unify round semantics and avoid dual-source-of-truth conflict between `subscribe_id` and a newly introduced epoch. +- Define explicit stale/new round rules across all trigger paths and key RPCs, reducing race-condition pollution risk. diff --git a/tx_service/cc_resubscribe_v2_impl_tasks_en.md b/tx_service/cc_resubscribe_v2_impl_tasks_en.md new file mode 100644 index 00000000..06ff7f54 --- /dev/null +++ b/tx_service/cc_resubscribe_v2_impl_tasks_en.md @@ -0,0 +1,319 @@ +## Resubscribe v2 Implementation Task Breakdown + +### 0. Purpose + +This checklist translates `cc_resubscribe_epoch_design.md` (revised v2) into executable implementation tasks. +Each task includes: + +- `Goal` +- `Status` (`TODO` / `IN_PROGRESS` / `DONE` / `BLOCKED`) +- `Files Involved` +- `State Changes` +- `Function Changes` +- `Definition of Done` + +--- + +### 1. Recommended Execution Order + +1. Complete protocol fields and end-to-end propagation first (T1/T2/T3). +2. Then complete primary-side round source-of-truth and strict validation (T4/T5/T6/T7). +3. Finally complete standby local preemption and strict message filtering (T8/T9). +4. Use the test matrix as the regression gate (T10). + +--- + +### T1. Proto Field Extensions and Semantic Comments + +- Status: `TODO` +- Goal: provide protocol carriers for round propagation and validation. +- Files Involved: + - `data_substrate/tx_service/include/proto/cc_request.proto` + - Generated pb outputs (build-generated) +- State Changes: + - No runtime state additions. +- Function Changes: + - No direct function logic changes in this task (protocol-only). +- Detailed Changes: + - Add `uint32 subscribe_round_id` to `OnStartFollowingRequest` (optional, default 0). + - Add `bool resubscribe_intent` to `OnStartFollowingRequest` (optional, default false). + - Add `uint32 subscribe_round_id` to `StandbyStartFollowingRequest` (optional hint). + - Add `bool resubscribe_intent` to `StandbyStartFollowingRequest` (optional, default false). + - Add `uint32 subscribe_round_id` to `KeyObjectStandbyForwardRequest` (optional). + - Keep `StandbyStartFollowingResponse.subscribe_id`, and clarify in comments that its semantic is `subscribe_round_id`. +- Definition of Done: + - Build passes. + - New fields are correctly propagated and used in validation. + +--- + +### T2. End-to-End `subscribe_round_id + resubscribe_intent` Propagation in `OnStartFollowing` + +- Status: `TODO` +- Goal: allow host manager triggers, standby-local triggers, and out-of-sync triggers to enter the unified path with round hints. +- Files Involved: + - `data_substrate/tx_service/include/sharder.h` + - `data_substrate/tx_service/src/sharder.cpp` + - `data_substrate/tx_service/include/fault/cc_node.h` + - `data_substrate/tx_service/src/fault/cc_node.cpp` + - `data_substrate/tx_service/include/remote/cc_node_service.h` + - `data_substrate/tx_service/src/remote/cc_node_service.cpp` + - `data_substrate/tx_service/src/cc/cc_shard.cpp` + - `data_substrate/tx_service/src/remote/cc_stream_receiver.cpp` + - `data_substrate/tx_service/include/cc/catalog_cc_map.h` (fault-inject path) +- State Changes: + - No new global state (parameter propagation only). +- Function Changes: + - Add `subscribe_round_id` parameter to `Sharder::OnStartFollowing(...)` (default 0). + - Add `resubscribe_intent` parameter to `Sharder::OnStartFollowing(...)` (default false). + - Add `subscribe_round_id` parameter to `CcNode::OnStartFollowing(...)`. + - Add `resubscribe_intent` parameter to `CcNode::OnStartFollowing(...)`. + - Update `CcNodeService::OnStartFollowing(...)` to read and pass through new field. + - Update trigger call sites: lag, stream idle timeout, out-of-sync, host manager path. + - Semantic constraints: + - lag / idle timeout: pass `current_round + resubscribe_intent=true`. + - out-of-sync (round pre-assigned by primary): pass `subscribe_round_id= + resubscribe_intent=false`. +- Definition of Done: + - All compile-time call sites updated. + - `subscribe_round_id + resubscribe_intent` are propagated correctly end-to-end. + +--- + +### T3. Primary Single Source of Truth: Round Allocation and Latest-Round Table + +- Status: `TODO` +- Goal: unify `subscribe_round_id` source of truth and prevent cross-shard drift. +- Files Involved: + - `data_substrate/tx_service/include/sharder.h` + - `data_substrate/tx_service/src/sharder.cpp` +- State Changes: + - Reuse: `subscribe_counter_`. + - Add: `latest_subscribe_round_[node_id]` (recommended in `Sharder` for centralized ownership). + - Add: `out_of_sync_ctrl_[node_id]` (stores `{inflight_round, inflight, last_send_ts, last_progress_ts}` or equivalent) for deduplicating repeated out-of-sync triggers. + - Add: concurrency protection for this map (`mutex`/`shared_mutex`). +- Config parameters (default): + - `same_round_resend_interval_ms=2000` + - `no_progress_timeout_ms=20000` +- Function Changes (recommended new APIs): + - `uint32_t ResolveSubscribeRound(uint32_t node_id, uint32_t hint_round)` + - `uint32_t LatestSubscribeRound(uint32_t node_id) const` + - `bool IsLatestSubscribeRound(uint32_t node_id, uint32_t round) const` + - `bool ShouldReuseOutOfSyncRound(uint32_t node_id, uint32_t latest_round) const` + - `void MarkOutOfSyncInflight(uint32_t node_id, uint32_t round)` + - `void ClearOutOfSyncInflight(uint32_t node_id, uint32_t round)` + - `bool ShouldEscalateOutOfSyncRound(uint32_t node_id, uint64_t now_ms) const` +- Definition of Done: + - Round values are monotonic increasing only. + - Stale hint rounds are detectable as stale. + +--- + +### T4. Strict Round Resolution and Return in `StandbyStartFollowing` + +- Status: `TODO` +- Goal: centralize round decision logic in primary RPC entry and make “requested round -> final round” traceable. +- Files Involved: + - `data_substrate/tx_service/src/remote/cc_node_service.cpp` + - `data_substrate/tx_service/include/cc/cc_shard.h` + - `data_substrate/tx_service/src/cc/cc_shard.cpp` +- State Changes: + - Candidate state must include round/standby_term (cannot remain `start_seq_id` only). + - Recommended upgrade for `candidate_standby_nodes_`: + `node_id -> {start_seq_id, standby_term}`. +- Function Changes: + - `CcNodeService::StandbyStartFollowing(...)` + - Resolve round from `subscribe_round_id + resubscribe_intent`. + - If `resubscribe_intent=true` and `hint_round==latest`, allocate a newer round. + - If `resubscribe_intent=false` and `hint_round==latest`, return idempotent success. + - Build `standby_term=(term<<32)|round` and pass it into shard candidate registration. + - Extend `CcShard::AddCandidateStandby(...)` signature to include `standby_term` (or round). + - Add precise removal capability in `CcShard::RemoveCandidateStandby(...)` by `standby_term`. +- Definition of Done: + - Stale-round requests are rejected. + - Returned `subscribe_id` matches primary-side record. + +--- + +### T5. Strict Validation + Idempotency in `ResetStandbySequenceId` + +- Status: `TODO` +- Goal: prevent late reset requests from polluting newer rounds; enforce triple checks: term + round + candidate validity. +- Files Involved: + - `data_substrate/tx_service/src/remote/cc_node_service.cpp` + - `data_substrate/tx_service/include/cc/cc_shard.h` + - `data_substrate/tx_service/src/cc/cc_shard.cpp` + - `data_substrate/tx_service/src/cc/local_cc_shards.cpp` (heartbeat target ordering) +- State Changes: + - Both subscribed/candidate states must support round-aware (`standby_term`) checks. +- Function Changes: + - `CcNodeService::ResetStandbySequenceId(...)` + - Parse `(term, round)` from `standby_node_term`. + - Validate `term == current leader term`. + - Validate `round == latest_subscribe_round_[node_id]`. + - Validate candidate is still valid; otherwise reject as stale. + - If already subscribed with same `(term, round)`, return idempotent success. + - Add helper APIs in `CcShard`: + - `HasCandidateStandby(node_id, standby_term)` + - `HasSubscribedStandby(node_id, standby_term)` + - `PromoteCandidateToSubscribed(...)` (optional) +- Definition of Done: + - Old-round reset cannot override new-round state. + - Repeated same-round reset returns idempotent success. + +--- + +### T6. Explicit Round in Out-of-Sync Path + +- Status: `TODO` +- Goal: make the key passive-resubscribe path round-directed, so standby does not retry under stale context. +- Files Involved: + - `data_substrate/tx_service/src/cc/cc_shard.cpp` + - `data_substrate/tx_service/include/proto/cc_request.proto` + - `data_substrate/tx_service/include/cc/cc_request.h` +- State Changes: + - Before sending out-of-sync notification, primary checks/updates `out_of_sync_ctrl_`; repeated triggers should reuse inflight round by default. +- Function Changes: + - `CcShard::NotifyStandbyOutOfSync(uint32_t node_id)` + - If this standby already has inflight out-of-sync round: + - if `now-last_progress_ts < 20s` and `now-last_send_ts >= 2s`: resend same round; + - if `now-last_progress_ts < 20s` and `now-last_send_ts < 2s`: ignore; + - if `now-last_progress_ts >= 20s`: allocate newer round. + - Otherwise allocate/update round and set `req->set_subscribe_round_id(...)`. + - After sending, call `MarkOutOfSyncInflight(node_id, round)`. + - Standby out-of-sync handling calls + `OnStartFollowing(..., subscribe_round_id=, resubscribe_intent=false)`. +- Coordination with `T4`: + - After `StandbyStartFollowing` validation succeeds, call `ClearOutOfSyncInflight(node_id, round)`. +- Definition of Done: + - Round used by standby follow-up flow matches primary-side round after out-of-sync trigger. + - Repeated out-of-sync triggers for the same standby during inflight do not allocate multiple newer rounds. + - Threshold behavior is correct: same-round resend after >=2s; newer-round escalation after >=20s no progress. + +--- + +### T7. Standby Local Preemption: `follow_req_id` Framework + +- Status: `TODO` +- Goal: fix “old flow not finished, new same-term request dropped.” +- Files Involved: + - `data_substrate/tx_service/include/fault/cc_node.h` + - `data_substrate/tx_service/src/fault/cc_node.cpp` +- State Changes: + - Add `requested_follow_req_id_`. + - Add `active_follow_req_id_`. + - Add `active_subscribe_ctx_` (packed term+round). + - Add `pending_resubscribe_intent_` (or equivalent) to coalesce repeated same-term local “request-new-round” triggers. + - Remove `requested_subscribe_primary_term_` after refactor completion. +- Function Changes: + - `CcNode::OnStartFollowing(...)` + - Generate a larger `follow_req_id` on every call. + - Stop relying only on term-based suppression for same-term requests. + - Coalesce repeated same-term local triggers (if a “request-new-round in progress” or “new round resolved but still running” flow already exists, do not issue another `resubscribe_intent=true` request). + - `CcNode::SubscribePrimaryNode(...)` + - Extend signature with `follow_req_id`, `subscribe_round_id_hint`, and `resubscribe_intent`. + - Check “still latest follow_req_id” before/after every blocking point. + - After `StandbyStartFollowing` returns round, update request context and force all follow-up retries to use that round with `resubscribe_intent=false`. + - Non-latest request exits immediately without committing state. +- Definition of Done: + - Later same-term request can preempt earlier in-flight request. + +--- + +### T8. Context-Consistency Checks in `SubscribePrimaryNode` + +- Status: `TODO` +- Goal: enforce `(term, round, follow_req_id)` consistency checks before applying any key RPC response. +- Files Involved: + - `data_substrate/tx_service/src/fault/cc_node.cpp` + - `data_substrate/tx_service/src/remote/cc_node_service.cpp` + - `data_substrate/tx_service/include/store/data_store_handler.h` (interface remains reused) +- State Changes: + - Update `active_subscribe_ctx_` only when request is confirmed effective. +- Function Changes: + - In `SubscribePrimaryNode`: + - Send `StandbyStartFollowingRequest.subscribe_round_id` and `resubscribe_intent`. + - After response, verify request is still current before setting `standby_term` and before candidate/snapshot/reset state commits. + - Ensure reset/snapshot follow-up requests always carry the resolved round (retry semantics should use `resubscribe_intent=false`). + - Add follow-request checks to reset/snapshot retry-loop exit conditions. +- Definition of Done: + - Late responses cannot overwrite newer request state. + +--- + +### T9. Strict Forward Message Filtering + +- Status: `TODO` +- Goal: enforce strict round filtering and reject stale/legacy forwarding messages directly. +- Files Involved: + - `data_substrate/tx_service/include/cc/cc_request.h` + - `data_substrate/tx_service/src/cc/cc_shard.cpp` + - `data_substrate/tx_service/include/standby.h` +- State Changes: + - No new core state (reuse `active_subscribe_ctx_` / `standby_term`). +- Function Changes: + - `KeyObjectStandbyForwardCc::ValidTermCheck()` + - If message carries round, compare with active round. + - If message has no round, reject it as invalid. + - `CcShard::ForwardStandbyMessage(...)` + - Always include round for both normal forwarding messages and out-of-sync notifications. +- Definition of Done: + - Stale-round or no-round forwarding messages are never applied. + +--- + +### T10. Test Work Items (Matrix Implementation) + +- Status: `TODO` +- Goal: cover same-term preemption, late responses, and concurrent triggers. +- Files Involved: + - Relevant integration tests under `tests/` + - Fault-injection scenarios (`CODE_FAULT_INJECTOR` points) +- State Changes: + - None. +- Function Changes: + - Mainly add/adjust tests; no production-logic changes required. +- Minimum Test Cases: + 1. Two same-term resubscribe requests; the second preempts the first in-flight request. + 2. Concurrent out-of-sync + lag trigger; only latest round becomes effective. + 3. Late `StandbyStartFollowingResponse` is dropped. + 4. Late `ResetStandbySequenceId` is rejected or handled idempotently. + 5. Snapshot coverage remains monotonic by round (no rollback). + 6. Host manager trigger using `current_round + resubscribe_intent=true` does not conflict with concurrent resubscribe. + 7. Forwarding messages that lack round fields are rejected. + 8. Repeated same-term standby-local triggers (lag/idle) produce only one `resubscribe_intent=true` request; all others are coalesced. + 9. After round=`R` is resolved, all follow-up retries carry `R` with `resubscribe_intent=false`. + 10. Repeated primary out-of-sync triggers for the same standby in a short interval reuse inflight round and do not continuously allocate newer rounds. + 11. Repeated out-of-sync triggers with `now-last_send_ts < 2s` are ignored and do not send duplicate notifications. + 12. After 2s and while `now-last_progress_ts < 20s`, same inflight round is resent. + 13. After 20s no progress, a newer round is allocated for escalation. +- Definition of Done: + - All cases pass, with no new deadlocks or stalls. + +--- + +### 2. Recommended Task Dependency Graph + +1. `T1 -> T2 -> T7 -> T8` +2. `T1 -> T3 -> T4 -> T5` +3. `T1 -> T6 -> T9` +4. `T10` is final gate. + +--- + +### 3. High-Risk Points to Confirm Before Coding + +1. `candidate_standby_nodes_` currently has no round dimension; without upgrade, cross-round wrong remove/promote can happen. +2. `ResetStandbySequenceId` currently adds heartbeat target before validation; ordering should be adjusted to avoid stale pollution. +3. `SubscribePrimaryNode` has multiple blocking retry loops; missing follow_req checks in any one loop leaves a race-commit window. +4. In multi-standby forwarding, round must be stamped and validated per standby strictly to avoid cross-standby round pollution. +5. Wrong cleanup timing of out-of-sync dedup state can cause either “never allocate newer round again” or “allocate too many newer rounds”. + +--- + +### 4. Current Execution Status + +- Global Status: `READY_FOR_IMPLEMENTATION` +- Completed Documents: + - Design (Chinese v2) + - Design (English version) + - This implementation task breakdown From eaef85f9e3e0de9c86ce9bd4a0344752a8ac3d39 Mon Sep 17 00:00:00 2001 From: Kevin Chou Date: Fri, 6 Mar 2026 17:13:53 +0800 Subject: [PATCH 2/4] move doc location --- .../cc_resubscribe_epoch_design_en.md | 0 .../cc_resubscribe_v2_impl_tasks_en.md | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename {tx_service => .cursor_knowledge}/cc_resubscribe_epoch_design_en.md (100%) rename {tx_service => .cursor_knowledge}/cc_resubscribe_v2_impl_tasks_en.md (100%) diff --git a/tx_service/cc_resubscribe_epoch_design_en.md b/.cursor_knowledge/cc_resubscribe_epoch_design_en.md similarity index 100% rename from tx_service/cc_resubscribe_epoch_design_en.md rename to .cursor_knowledge/cc_resubscribe_epoch_design_en.md diff --git a/tx_service/cc_resubscribe_v2_impl_tasks_en.md b/.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md similarity index 100% rename from tx_service/cc_resubscribe_v2_impl_tasks_en.md rename to .cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md From d1ac798d75c39460ac4ba9ec92f6236fde6edfab Mon Sep 17 00:00:00 2001 From: Kevin Chou Date: Fri, 6 Mar 2026 19:35:53 +0800 Subject: [PATCH 3/4] update doc --- .../cc_resubscribe_epoch_design_en.md | 17 ++++++----------- .../cc_resubscribe_v2_impl_tasks_en.md | 17 +++++++---------- 2 files changed, 13 insertions(+), 21 deletions(-) diff --git a/.cursor_knowledge/cc_resubscribe_epoch_design_en.md b/.cursor_knowledge/cc_resubscribe_epoch_design_en.md index ca04746c..5c77d787 100644 --- a/.cursor_knowledge/cc_resubscribe_epoch_design_en.md +++ b/.cursor_knowledge/cc_resubscribe_epoch_design_en.md @@ -42,21 +42,16 @@ Note: `follow_req_id` is not a protocol field and does not participate in cross- ### 5. Protocol and Message Changes -1. `OnStartFollowingRequest` - - Add optional `uint32 subscribe_round_id` (default `0`). - - Add optional `bool resubscribe_intent` (default `false`). - - Semantics: `subscribe_round_id` is the caller-known current round; `resubscribe_intent=true` means “request primary to allocate a newer round.” - -2. `KeyObjectStandbyForwardRequest` +1. `KeyObjectStandbyForwardRequest` - Add optional `uint32 subscribe_round_id`. - Both out-of-sync notifications and regular forwarding messages must carry round. -3. `StandbyStartFollowingRequest / Response` +2. `StandbyStartFollowingRequest / Response` - Request adds optional `subscribe_round_id` hint. - Request adds optional `resubscribe_intent` (default `false`). - Response keeps `subscribe_id`, with semantics explicitly defined as `subscribe_round_id`. -4. `ResetStandbySequenceIdRequest / Response` +3. `ResetStandbySequenceIdRequest / Response` - No new mandatory field. Continue using `standby_node_term` to carry `(term, round)`. - Server must explicitly parse and validate round. @@ -65,9 +60,9 @@ Note: `follow_req_id` is not a protocol field and does not participate in cross- #### 6.1 Unified Trigger Entrypoints - out-of-sync: primary first checks `out_of_sync_ctrl_[node_id]` for dedup, then decides whether to allocate/reuse `subscribe_round_id` before notifying standby. -- lag threshold (standby local trigger): `OnStartFollowing(..., subscribe_round_id=current_round, resubscribe_intent=true, resubscribe=true)`. +- lag threshold (standby local trigger): call `OnStartFollowing(..., resubscribe=true)` to start the resubscribe workflow; carry `current_round + resubscribe_intent=true` in subsequent `StandbyStartFollowingRequest`. - stream idle timeout (standby local trigger): same as above. -- host manager `OnStartFollowing`: can also trigger a new round with `current_round + resubscribe_intent=true`. +- host manager `OnStartFollowing`: trigger-only entry; it does not carry round/intention, and standby still provides them in `StandbyStartFollowingRequest`. #### 6.2 Standby Local Preemption @@ -150,7 +145,7 @@ Note: `follow_req_id` is not a protocol field and does not participate in cross- 3. Late `StandbyStartFollowingResponse` (old round) is dropped. 4. Late `ResetStandbySequenceId` (old round) is rejected or handled idempotently as expected. 5. Snapshot request coverage and round monotonicity are consistent (no rollback). -6. Host manager `OnStartFollowing` (`current_round + resubscribe_intent=true`) does not conflict with concurrent resubscribe. +6. Host manager `OnStartFollowing` (without round/intention payload) does not conflict with concurrent resubscribe. 7. Forwarding messages missing round field are rejected, and warning metrics are correct. 8. Multiple standby-local triggers in the same term (lag/idle) must result in only one “request-new-round (`resubscribe_intent=true`)” request; all others are locally coalesced. 9. After standby gets round=`R`, all follow-up retries must carry `R` with `resubscribe_intent=false`. diff --git a/.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md b/.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md index 06ff7f54..b1edff71 100644 --- a/.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md +++ b/.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md @@ -35,8 +35,6 @@ Each task includes: - Function Changes: - No direct function logic changes in this task (protocol-only). - Detailed Changes: - - Add `uint32 subscribe_round_id` to `OnStartFollowingRequest` (optional, default 0). - - Add `bool resubscribe_intent` to `OnStartFollowingRequest` (optional, default false). - Add `uint32 subscribe_round_id` to `StandbyStartFollowingRequest` (optional hint). - Add `bool resubscribe_intent` to `StandbyStartFollowingRequest` (optional, default false). - Add `uint32 subscribe_round_id` to `KeyObjectStandbyForwardRequest` (optional). @@ -47,17 +45,15 @@ Each task includes: --- -### T2. End-to-End `subscribe_round_id + resubscribe_intent` Propagation in `OnStartFollowing` +### T2. Local Trigger Propagation of `subscribe_round_id + resubscribe_intent` - Status: `TODO` -- Goal: allow host manager triggers, standby-local triggers, and out-of-sync triggers to enter the unified path with round hints. +- Goal: ensure standby-local and out-of-sync triggers can carry round hints and land uniformly in `StandbyStartFollowingRequest`; keep host manager entry semantics unchanged. - Files Involved: - `data_substrate/tx_service/include/sharder.h` - `data_substrate/tx_service/src/sharder.cpp` - `data_substrate/tx_service/include/fault/cc_node.h` - `data_substrate/tx_service/src/fault/cc_node.cpp` - - `data_substrate/tx_service/include/remote/cc_node_service.h` - - `data_substrate/tx_service/src/remote/cc_node_service.cpp` - `data_substrate/tx_service/src/cc/cc_shard.cpp` - `data_substrate/tx_service/src/remote/cc_stream_receiver.cpp` - `data_substrate/tx_service/include/cc/catalog_cc_map.h` (fault-inject path) @@ -68,14 +64,15 @@ Each task includes: - Add `resubscribe_intent` parameter to `Sharder::OnStartFollowing(...)` (default false). - Add `subscribe_round_id` parameter to `CcNode::OnStartFollowing(...)`. - Add `resubscribe_intent` parameter to `CcNode::OnStartFollowing(...)`. - - Update `CcNodeService::OnStartFollowing(...)` to read and pass through new field. - - Update trigger call sites: lag, stream idle timeout, out-of-sync, host manager path. + - Keep `CcNodeService::OnStartFollowing(...)` protocol unchanged; it remains a trigger entry. + - Update trigger call sites: lag, stream idle timeout, out-of-sync path. - Semantic constraints: - lag / idle timeout: pass `current_round + resubscribe_intent=true`. - out-of-sync (round pre-assigned by primary): pass `subscribe_round_id= + resubscribe_intent=false`. + - host manager path: no round/intention payload (default values). - Definition of Done: - All compile-time call sites updated. - - `subscribe_round_id + resubscribe_intent` are propagated correctly end-to-end. + - `subscribe_round_id + resubscribe_intent` are propagated correctly along the “local trigger -> StandbyStartFollowingRequest” path. --- @@ -278,7 +275,7 @@ Each task includes: 3. Late `StandbyStartFollowingResponse` is dropped. 4. Late `ResetStandbySequenceId` is rejected or handled idempotently. 5. Snapshot coverage remains monotonic by round (no rollback). - 6. Host manager trigger using `current_round + resubscribe_intent=true` does not conflict with concurrent resubscribe. + 6. Host manager `OnStartFollowing` trigger (without round/intention payload) does not conflict with concurrent resubscribe. 7. Forwarding messages that lack round fields are rejected. 8. Repeated same-term standby-local triggers (lag/idle) produce only one `resubscribe_intent=true` request; all others are coalesced. 9. After round=`R` is resolved, all follow-up retries carry `R` with `resubscribe_intent=false`. From 4fca194bc601200d3967f4ee1356c8bfcc6c56df Mon Sep 17 00:00:00 2001 From: Kevin Chou Date: Mon, 9 Mar 2026 14:28:57 +0800 Subject: [PATCH 4/4] rename --- ...subscribe_epoch_design_en.md => standby_resubscribe_design.md} | 0 ...ribe_v2_impl_tasks_en.md => standby_resubscribe_impl_tasks.md} | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename .cursor_knowledge/{cc_resubscribe_epoch_design_en.md => standby_resubscribe_design.md} (100%) rename .cursor_knowledge/{cc_resubscribe_v2_impl_tasks_en.md => standby_resubscribe_impl_tasks.md} (100%) diff --git a/.cursor_knowledge/cc_resubscribe_epoch_design_en.md b/.cursor_knowledge/standby_resubscribe_design.md similarity index 100% rename from .cursor_knowledge/cc_resubscribe_epoch_design_en.md rename to .cursor_knowledge/standby_resubscribe_design.md diff --git a/.cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md b/.cursor_knowledge/standby_resubscribe_impl_tasks.md similarity index 100% rename from .cursor_knowledge/cc_resubscribe_v2_impl_tasks_en.md rename to .cursor_knowledge/standby_resubscribe_impl_tasks.md