Allow run on candidate by yi-xmu · Pull Request #392 · eloqdata/tx_service

yi-xmu · 2026-02-02T04:09:05Z

To prevent OUT_OF_MEMORY from causing log replay can not completed during ccnode recovery, DataSync operations should be allowed under candidate leader conditions. Additionally:

In this case we can flush data and kickout cce, but we cannot truncate redo log based on this data sync ts since we might miss the data that has not been recovered yet.
we cannot update last sync ts of the range/catalog.
The term check is uniformly placed in the Execute method of ccrequest, using the ValidTermCheck method.
Add a flag named allow_run_on_candidate_ to ccrequest to determine whether to check the candidate term when performing term checks.

Summary by CodeRabbit

Refactor
- Improved internal leadership validation and term management to enhance system stability during recovery and cluster transitions.
- Enhanced transaction execution logic to better handle candidate node scenarios during failover.
- Refined data synchronization and term resolution to ensure consistency and reliability across cluster members.
- Strengthened error handling and diagnostics for cluster operations.

fix release metadata readlocc fix update last sync ts update debug log debug log and debug print ccpage function debug log for range split datasyncscan and debug print ccmap function disable debug log fix slice size mismatch

coderabbitai · 2026-02-02T04:09:25Z

Walkthrough

This pull request extends transaction service components to support running operations on candidate leader nodes, in addition to current leaders. It refactors term resolution across multiple CC request types and handlers to use request-provided terms with candidate/standby fallbacks, updates method signatures to propagate an allow_run_on_candidate flag, and adds a check_candidate parameter to leader-term verification.

Changes

Cohort / File(s)	Summary
Term Resolution Refactoring `cc_catalog_cc_map.h`, `cluster_config_cc_map.h`, `range_bucket_cc_map.h`, `range_cc_map.h`	Replaced direct `Sharder::Instance().LeaderTerm()` calls with `req.NodeGroupTerm()` and introduced multi-step fallback logic: if term is negative, attempt `CandidateLeaderTerm()` when allowed, then fall back to `LeaderTerm()` and `StandbyNodeTerm()`, with assertions enforcing positive terms.
Template CC Map Updates `template_cc_map.h`	Refactored term derivation across Execute paths in multiple CC types (AcquireCc, AcquireAllCc, PostWriteCc, ReadCc, ScanNextBatchCc) to use request-sourced terms and candidate-aware logic; removed cascading external leader-term validation blocks.
CC Request Infrastructure `cc_request.h`, `cc_shard.h`, `cc_req_misc.h`	Extended request classes with new term accessors (`NodeGroupTerm()`, `GetNodeGroupLeaderTerm()`), added `ValidTermCheck()` overrides incorporating candidate terms, and introduced `allow_run_on_candidate` flag propagation through constructors and reset methods.
Handler Method Signatures `cc_handler.h`, `local_cc_handler.h`, `remote_cc_handler.h`	Added `bool allow_run_on_candidate` parameter to public methods including AcquireWrite, PostWrite, PostRead, Read, ReadLocal, and NewTxn across all handler implementations, with defaults preserving compatibility.
Transaction Execution `tx_execution.h`, `tx_execution.cpp`, `tx_request.h`, `tx_util.h`	Extended InitTx signature and TransactionExecution with `allow_run_on_candidate` parameter; removed `is_recovering` flag from ReadTxRequest; integrated new flag into candidate-aware term checking via `CheckLeaderTerm()`.
Sharder and Support `sharder.h`, `sharder.cpp`	Extended `CheckLeaderTerm()` with optional `check_candidate` parameter to incorporate candidate leader terms when validating leadership, altering error handling for non-leader conditions.
Partition and Sequence Management `local_cc_shards.h`, `local_cc_shards.cpp`, `sequences.h`, `sequences.cpp`	Added `NodeGroupId ng_id` parameter to partition ID allocation (GetNextRangePartitionId, ApplyIdOfTableRangePartition); updated call sites to pass ng_id and extended NewTxInit invocations with additional transaction context parameters.
Data Sync Infrastructure `data_sync_task.h`, `data_sync_task.cpp`, `checkpointer.cpp`	Added `run_on_leader_node_` member flag to DataSyncTask; updated CheckLeaderTerm usages to include new boolean parameter for candidate-aware validation.
Remote CC and Proto `remote_cc_handler.h`, `remote_cc_handler.cpp`, `remote_cc_request.cpp`, `cc_request.proto`	Added `allow_run_on_candidate` parameter to RemoteCcHandler::Read and propagated into ReadRequest; added corresponding proto field to ReadRequest message.
Logging and Diagnostics `cc_req_misc.h`, `tx_operation.cpp`	Enhanced abort logging in FillStoreSliceCc with table name; improved error diagnostics in SplitFlushRangeOp by appending error codes to log messages.

Sequence Diagram

sequenceDiagram
    actor Client
    participant TransactionExecution
    participant CcHandler as CcHandler<br/>(LocalCcHandler)
    participant Sharder
    participant CcRequest as CcRequest<br/>(e.g., ReadCc)
    participant CcShard

    Client->>TransactionExecution: InitTx(allow_run_on_candidate=true)
    TransactionExecution->>TransactionExecution: Store allow_run_on_candidate_
    TransactionExecution->>CcHandler: NewTxn(..., allow_run_on_candidate=true)
    CcHandler->>CcRequest: Reset(..., allow_run_on_candidate=true)
    CcRequest->>CcRequest: Store allow_run_on_candidate flag

    Client->>TransactionExecution: Read(...)
    TransactionExecution->>CcHandler: Read(..., allow_run_on_candidate_=true)
    CcHandler->>CcRequest: Reset(..., allow_run_on_candidate=true)
    
    CcRequest->>Sharder: CandidateLeaderTerm(ng_id)
    Sharder-->>CcRequest: candidate_term (if available)
    
    alt candidate_term available and allowed
        CcRequest->>CcRequest: ng_term = candidate_term
    else fallback
        CcRequest->>Sharder: LeaderTerm(ng_id)
        Sharder-->>CcRequest: leader_term
        CcRequest->>Sharder: StandbyNodeTerm()
        Sharder-->>CcRequest: standby_term
        CcRequest->>CcRequest: ng_term = max(leader_term, standby_term)
    end
    
    CcRequest->>CcShard: Execute(ng_term)
    CcShard->>CcShard: Use ng_term for lock/data operations
    CcShard-->>CcRequest: Result
    CcRequest-->>CcHandler: Result
    CcHandler-->>TransactionExecution: Result
    TransactionExecution-->>Client: Data

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Store ng leader term with ng_leader_cache. #129: Modifies Sharder leader-term propagation and term handling, overlapping with this PR's refactoring of leader-term checks and introduction of candidate-term logic.
Fix eloqkv report oom #203: Updates the same CcHandler/Read/Acquire/PostWrite method signatures, adding boolean parameters to parallel interfaces affected by this candidate-aware execution feature.

Suggested reviewers

liunyl
thweetkomputer

🐰 In meadows of leaders, a shift takes place,
Candidates rise to join the race,
No longer waiting on the side,
They execute with fallback pride,
Terms resolve with grace and care—
Distributed consensus everywhere! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Allow run on candidate' is vague and generic, using non-descriptive terminology that doesn't convey the specific technical change or its business purpose.	Consider a more specific title such as 'Allow DataSync operations on candidate leaders during recovery' or 'Support candidate-leader execution to prevent OOM during log replay'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description explains the motivation and implementation approach well, but the required template checklist items are not addressed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch allow_run_on_candidate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)

tx_service/src/remote/remote_cc_request.cpp (1)
478-495: ⚠️ Potential issue | 🔴 Critical

Fix ReadCc::Reset argument order for outside reads.
After inserting allow_run_on_candidate, the outside‑read call shifts arguments (-1 → bool, abort_if_oom → partition_id). Pass point_read_on_miss explicitly to keep parameters aligned.
🛠️ Suggested fix
         ReadCc::Reset(&remote_table_name_,
                       req.schema_version(),
                       &req.key(),
                       req.key_shard_code(),
                       out_record,
                       read_type,
                       input_msg->tx_number(),
                       input_msg->tx_term(),
                       req.ts(),
                       &cc_res_,
                       ToLocalType::ConvertIsolation(req.iso_level()),
                       ToLocalType::ConvertProtocol(req.protocol()),
                       req.is_for_write(),
                       req.is_covering_keys(),
                       nullptr,
-                      false,
-                      -1,
-                      req.abort_if_oom());
+                      false,  // allow_run_on_candidate
+                      false,  // point_read_on_miss
+                      -1,     // partition_id
+                      req.abort_if_oom());
tx_service/src/tx_execution.cpp (1)
2006-2014: ⚠️ Potential issue | 🟠 Major

Avoid stale bucket-lock state on cache hit.
Marking lock_range_bucket_result_ finished here can cause later ops (e.g., object commands) to skip locking and reuse a stale bucket_rec_ when FastToGetBucket() misses, which can misroute shard codes. Prefer clearing the lock result on cache hits instead of setting it finished.
Proposed fix
-                    // Set the lock range bucket result
-                    lock_range_bucket_result_.SetFinished();
+                    // Cache hit doesn't populate bucket_rec_; avoid leaking
+                    // prior lock state into later ops.
+                    lock_range_bucket_result_.Value().Reset();
+                    lock_range_bucket_result_.Reset();
tx_service/include/cc/template_cc_map.h (2)
504-512: ⚠️ Potential issue | 🔴 Critical

Add the trailing semicolon after CODE_FAULT_INJECTOR.

Line 512 now ends the macro invocation with }). If the macro expands to a do { ... } while (0) block (common pattern), omitting the semicolon breaks compilation. Other call sites use });.
🛠️ Proposed fix
-        })
+        });
6053-6087: ⚠️ Potential issue | 🔴 Critical

Replace the leader_term > 0 check with CheckLeaderTerm() to properly verify this node is the leader.

GetNodeGroupLeaderTerm() returns std::max(LeaderTerm(ng_id), CandidateLeaderTerm(ng_id)), not an indicator of whether the local node is leader. A candidate with a higher term would cause leader_term > 0 even though this node is not the leader, triggering the assert incorrectly. Use Sharder::Instance().CheckLeaderTerm(cc_ng_id_, current_term) instead to verify the node actually holds leadership before asserting buffered commands are invalid. Also verify that current_term (from req.NodeGroupTerm()) is guaranteed positive at this point.
tx_service/include/cc/cc_request.h (1)
402-551: ⚠️ Potential issue | 🟠 Major

Reset allow_run_on_candidate_ in the string-key Reset to avoid stale state.

These requests are pooled; if a prior call set the flag to true, the string-key Reset currently leaves it unchanged and can unintentionally run remote acquires on candidates.
🔧 Suggested fix
         is_local_ = false;
         block_by_lock_ = false;
         abort_if_oom_ = abort_if_oom;
+        allow_run_on_candidate_ = false;
tx_service/src/cc/local_cc_shards.cpp (1)
3390-3497: ⚠️ Potential issue | 🟠 Major

Unpin should also run when term == 0.

With candidate-enabled flows, TryPinNodeGroupData can return 0 on success. The current if (ng_term > 0) skips unpinning in that case and can leak a pin. Switch to >= 0 to match the pin-success contract.
🛠️ Suggested fix
-    if (ng_term > 0)
+    if (ng_term >= 0)
     {
         Sharder::Instance().UnpinNodeGroupData(task->node_group_id_);
     }
Based on learnings: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded).
tx_service/src/cc/local_cc_handler.cpp (4)
108-123: ⚠️ Potential issue | 🟠 Major

Propagate allow_run_on_candidate to remote AcquireWrite.
Local requests pass the flag into AcquireCc::Reset, but the remote path drops it, so candidate leaders on other nodes may still be rejected.
Proposed fix
         remote_hd_.AcquireWrite(cc_shards_.node_id_,
                                 ng_id,
                                 table_name,
                                 schema_version,
                                 key,
                                 key_shard_code,
                                 tx_number,
                                 tx_term,
                                 command_id,
                                 ts,
                                 is_insert,
                                 hres,
                                 hd_res_idx,
                                 proto,
                                 iso_level,
-                                abort_if_oom);
+                                abort_if_oom,
+                                allow_run_on_candidate);
310-319: ⚠️ Potential issue | 🟠 Major

Propagate allow_run_on_candidate to remote PostWrite.
The local path forwards the flag, but the remote call omits it, creating inconsistent candidate-leader support.
Proposed fix
         remote_hd_.PostWrite(cc_shards_.node_id_,
                              tx_number,
                              tx_term,
                              command_id,
                              commit_ts,
                              cce_addr,
                              record,
                              operation_type,
                              key_shard_code,
-                             hres);
+                             hres,
+                             allow_run_on_candidate);
413-423: ⚠️ Potential issue | 🟠 Major

Propagate allow_run_on_candidate to remote PostRead.
PostReadCc::Reset gets the flag locally, but the remote call omits it, so candidate-term validation may fail remotely.
Proposed fix
         remote_hd_.PostRead(cc_shards_.node_id_,
                             tx_number,
                             tx_term,
                             command_id,
                             key_ts,
                             gap_ts,
                             commit_ts,
                             cce_addr,
                             hres,
                             need_remote_resp,
-                            post_read_type);
+                            post_read_type,
+                            allow_run_on_candidate);
792-807: ⚠️ Potential issue | 🟠 Major

Pass allow_run_on_candidate into the string-key ReadCc::Reset.
The TxKey overload forwards the flag, but the string-key overload stops at the archive pointer; candidate runs may still be rejected in request execution.
Proposed fix
     read_req->Reset(&table_name,
                     0,
                     key_str,
                     shard_code,
                     &record,
                     read_type,
                     tx_number,
                     tx_term,
                     ts,
                     &hres,
                     iso_level,
                     proto,
                     is_for_write,
                     false,
-                    nullptr);
+                    nullptr,
+                    allow_run_on_candidate);

🤖 Fix all issues with AI agents

In `@tx_service/include/cc/catalog_cc_map.h`:
- Around line 1143-1160: The code currently asserts ng_term > 0 after falling
back via Sharder::Instance() calls; replace this assertion with a runtime error
path: if ng_term <= 0, log a clear error referencing req.NodeGroupId() and the
failed lookups (NodeGroupTerm/StandbyNodeTerm/CandidateLeaderTerm), set the
function's error/Status return (or populate the response error fields used by
this module) and return early instead of proceeding to use ng_term; remove the
assert(ng_term > 0) and ensure no lock metadata is written when ng_term is
invalid.
- Around line 197-199: Replace the assert on ng_term with a runtime guard: check
req.NodeGroupTerm() into ng_term and if ng_term <= 0 return a proper error (or
take the same fallback path used by ReadCc) instead of crashing; update the
block around ng_term,
CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", ...) and
subsequent logic to handle the invalid term via the error/fallback path so
downstream code never receives a negative term.

In `@tx_service/include/cc/cluster_config_cc_map.h`:
- Around line 69-71: The assert on ng_term should be replaced with a graceful
validation like other cc_map implementations: after reading uint32_t ng_id =
req.NodeGroupId(); int64_t ng_term = req.NodeGroupTerm(); remove assert(ng_term
> 0) and instead check if (ng_term < 0) and return REQUESTED_NODE_NOT_LEADER (or
the function's error return consistent with catalog_cc_map.h / range_cc_map.h
patterns); ensure you reference ng_term and REQUESTED_NODE_NOT_LEADER so the
function exits cleanly when term is invalid rather than crashing.

In `@tx_service/include/cc/local_cc_handler.h`:
- Around line 110-121: The remote PostRead declaration and plumbing must be
updated to accept and forward the allow_run_on_candidate flag: modify the remote
handler's PostRead signature in remote_cc_handler.h to include the bool
allow_run_on_candidate parameter (matching local_cc_handler.h), update the
implementation that constructs the ValidateRequest RPC message to set the
corresponding field on ValidateRequest, and propagate the flag through any
intermediate call sites and RPC stubs (e.g., methods that build/send
ValidateRequest and the RPC handler) so remote PostRead honors
allow_run_on_candidate. Ensure symbols updated include PostRead,
allow_run_on_candidate, and ValidateRequest.

In `@tx_service/include/cc/local_cc_shards.h`:
- Around line 2450-2454: The compilation fails because
BigTableHandler::GetNextRangePartitionId and
DynamoHandler::GetNextRangePartitionId use override but no matching virtual
exists in DataStoreHandler, and their parameter lists differ from
LocalCcShards::GetNextRangePartitionId; fix by adding a virtual declaration in
DataStoreHandler with the canonical signature used by LocalCcShards — bool
GetNextRangePartitionId(const TableName &tablename, const TableSchema
*table_schema, NodeGroupId ng_id, uint32_t range_cnt, int32_t
&out_next_partition_id) — or alternatively remove override from the handlers;
then update BigTableHandler::GetNextRangePartitionId and
DynamoHandler::GetNextRangePartitionId to match that canonical parameter list
(add table_schema and ng_id, remove retry_count) so all declarations/definitions
are consistent.

In `@tx_service/include/cc/range_bucket_cc_map.h`:
- Around line 92-108: The code currently relies on assert(ng_term > 0) after
trying req.NodeGroupTerm(), CandidateLeaderTerm(), LeaderTerm() and
StandbyNodeTerm(); replace this assert with a runtime error path: after
computing ng_term check if ng_term <= 0 and if so log or construct a descriptive
error including req.NodeGroupId(), the original req.NodeGroupTerm(),
CandidateLeaderTerm(...), LeaderTerm(...), and StandbyNodeTerm(), then throw a
std::runtime_error (or return a failure status according to your error model)
instead of asserting so callers receive a deterministic error rather than
crashing or continuing with invalid metadata.

In `@tx_service/include/cc/range_cc_map.h`:
- Around line 220-236: The code currently asserts ng_term > 0 after trying
CandidateLeaderTerm, LeaderTerm and StandbyNodeTerm; replace the assert with a
proper early return that fails fast when no valid term exists: check if ng_term
<= 0 after those fallbacks and return an appropriate NOT_LEADER or RETRY error
code (the same error path the caller expects) instead of proceeding to call
AcquireCceKeyLock and constructing a CcEntryAddr; reference the
Sharder::Instance() calls and methods CandidateLeaderTerm, LeaderTerm,
StandbyNodeTerm and ensure the function returns the error before any use of
AcquireCceKeyLock/CcEntryAddr.

In `@tx_service/include/cc/template_cc_map.h`:
- Around line 202-203: Replace the assert-based term validation with a runtime
check: in the locations where you do "int64_t ng_term = req.NodeGroupTerm();"
(including the Execute() path and the other site around the current asserts), if
ng_term <= 0 return the same error used by ReadCc (REQUESTED_NODE_NOT_LEADER)
instead of asserting; update the function that returns the RPC/Status to
propagate REQUESTED_NODE_NOT_LEADER so callers handle it like ReadCc (see ReadCc
handling around lines ~1391-1406 for the exact behavior) and remove the
assert(ng_term > 0) usages.

In `@tx_service/include/data_sync_task.h`:
- Around line 186-189: The SetSyncOnLeader() setter and unsynchronized reads of
sync_on_leader_ in DataSyncTask can race; change the DataSyncTask member
sync_on_leader_ to std::atomic<bool> and update SetSyncOnLeader() to call
sync_on_leader_.store(sync_on_leader, std::memory_order_release) and replace all
unsynchronized reads (e.g., in finalization expressions like
task->sync_on_leader_ ? task->data_sync_ts_ : 0) with
sync_on_leader_.load(std::memory_order_acquire) so reads and writes are properly
synchronized.

In `@tx_service/src/cc/cc_req_misc.cpp`:
- Around line 1171-1199: The const method UpdateCceCkptTsCc::ValidTermCheck() is
missing the assert guard used by other const variants; add assert(term_ > 0);
near the start of ValidTermCheck (before the comparison against current_term) so
uninitialized/invalid term_ values are caught early; ensure the assertion is
placed in the body of UpdateCceCkptTsCc::ValidTermCheck() prior to the if
(current_term < 0 || current_term != term_) check and that <cassert> is
available in the translation unit.

🧹 Nitpick comments (2)

tx_service/src/cc/cc_shard.cpp (1)

1072-1075: Resolve the candidate-leader recovery TODO before merge.

Please either implement the intended candidate-term behavior here or document why recovery must remain leader-only. I can help draft the fix or open a tracking issue if needed.

tx_service/include/cc/range_cc_map.h (1)

615-618: Keep term resolution consistent for key‑cache init

Line 615‑618: this path now relies solely on req.NodeGroupTerm() with a hard assert(ng_term > 0) but doesn’t reuse the candidate/leader/standby fallback from ReadCc. If NodeGroupTerm isn’t populated for this path, the cache init will assert and bail. Consider reusing the same resolution helper or gracefully skipping cache init when the term is invalid.

coderabbitai · 2026-02-02T04:20:19Z

tx_service/include/cc/catalog_cc_map.h

+        int64_t ng_term = req.NodeGroupTerm();
+        assert(ng_term > 0);
        CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", {


⚠️ Potential issue | 🟠 Major

Guard invalid NodeGroupTerm() instead of assert‑only.

If a request arrives with a negative term, this assert can crash debug builds; in release builds the negative term is used downstream. Prefer a runtime error path (optionally mirroring the ReadCc fallback) so invalid terms are rejected cleanly.

🔧 Proposed fix (robust guard + fallback)

- int64_t ng_term = req.NodeGroupTerm(); - assert(ng_term > 0); + int64_t ng_term = req.NodeGroupTerm(); + if (ng_term < 0) + { + if (req.AllowRunOnCandidate()) + { + ng_term = Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId()); + } + if (ng_term < 0) + { + ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId()); + int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm(); + ng_term = std::max(ng_term, standby_node_term); + } + } + if (ng_term <= 0) + { + req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER); + return true; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

int64_t ng_term = req.NodeGroupTerm();

assert(ng_term > 0);

CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", {

int64_t ng_term = req.NodeGroupTerm();

if (ng_term < 0)

{

if (req.AllowRunOnCandidate())

{

ng_term = Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId());

}

if (ng_term < 0)

{

ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId());

int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();

ng_term = std::max(ng_term, standby_node_term);

}

}

if (ng_term <= 0)

{

req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);

return true;

}

CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", {

🤖 Prompt for AI Agents

In `@tx_service/include/cc/catalog_cc_map.h` around lines 197 - 199, Replace the assert on ng_term with a runtime guard: check req.NodeGroupTerm() into ng_term and if ng_term <= 0 return a proper error (or take the same fallback path used by ReadCc) instead of crashing; update the block around ng_term, CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", ...) and subsequent logic to handle the invalid term via the error/fallback path so downstream code never receives a negative term.

coderabbitai · 2026-02-02T04:20:19Z

tx_service/include/cc/catalog_cc_map.h

        uint32_t ng_id = req.NodeGroupId();
-        int64_t ng_term = Sharder::Instance().LeaderTerm(ng_id);
-        ng_term = std::max(ng_term, Sharder::Instance().StandbyNodeTerm());
-
-        if (req.IsInRecovering())
+        int64_t ng_term = req.NodeGroupTerm();
+        if (ng_term < 0)
        {
-            ng_term = ng_term > 0
-                          ? ng_term
-                          : Sharder::Instance().CandidateLeaderTerm(ng_id);
+            if (req.AllowRunOnCandidate())
+            {
+                ng_term = Sharder::Instance().CandidateLeaderTerm(ng_id);
+            }
+            if (ng_term < 0)
+            {
+                ng_term = Sharder::Instance().LeaderTerm(ng_id);
+                int64_t standby_node_term =
+                    Sharder::Instance().StandbyNodeTerm();
+                ng_term = std::max(ng_term, standby_node_term);
+            }
        }
+        assert(ng_term > 0);



⚠️ Potential issue | 🟠 Major

Handle unresolved term after fallback instead of asserting.

If the leader/standby/candidate lookup still yields a negative term, the current assert can crash and negative terms can leak into lock metadata in release builds. Return a proper error instead.

🔧 Proposed fix (runtime error on invalid term)

- assert(ng_term > 0); + if (ng_term <= 0) + { + req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER); + return true; + }

🤖 Prompt for AI Agents

In `@tx_service/include/cc/catalog_cc_map.h` around lines 1143 - 1160, The code currently asserts ng_term > 0 after falling back via Sharder::Instance() calls; replace this assertion with a runtime error path: if ng_term <= 0, log a clear error referencing req.NodeGroupId() and the failed lookups (NodeGroupTerm/StandbyNodeTerm/CandidateLeaderTerm), set the function's error/Status return (or populate the response error fields used by this module) and return early instead of proceeding to use ng_term; remove the assert(ng_term > 0) and ensure no lock metadata is written when ng_term is invalid.

coderabbitai · 2026-02-02T04:20:19Z

tx_service/include/cc/cluster_config_cc_map.h

        uint32_t ng_id = req.NodeGroupId();
-        int64_t ng_term = Sharder::Instance().LeaderTerm(ng_id);
-
-        if (ng_term < 0)
-        {
-            hd_res->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
-            return true;
-        }
+        int64_t ng_term = req.NodeGroupTerm();
+        assert(ng_term > 0);


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the file and context around lines 69-71 fd -e h -e cc -e cpp "cluster_config_cc_map" | head -5

Repository: eloqdata/tx_service

Length of output: 108

🏁 Script executed:

# Search for ValidTermCheck usage and NodeGroupTerm patterns rg -n "ValidTermCheck" --max-count=20 | head -30

Repository: eloqdata/tx_service

Length of output: 2440

🏁 Script executed:

# Search for NodeGroupTerm usage and see the context rg -n "NodeGroupTerm" -B2 -A2 --max-count=30 | head -50

Repository: eloqdata/tx_service

Length of output: 3600

🏁 Script executed:

# Look for AcquireAllCc method and surrounding validation rg -n "AcquireAllCc" -B5 -A15 | head -60

Repository: eloqdata/tx_service

Length of output: 5083

🏁 Script executed:

# Read the cluster_config_cc_map.h file around lines 69-71 with broader context cat tx_service/include/cc/cluster_config_cc_map.h | head -100

Repository: eloqdata/tx_service

Length of output: 4212

🏁 Script executed:

# Search for TryPinNodeGroupData usage in cluster_config_cc_map.h rg -n "TryPinNodeGroupData" tx_service/include/cc/cluster_config_cc_map.h -B5 -A5

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check if there's any validation on NodeGroupTerm in the request class rg -n "class.*AcquireAllCc\|struct.*AcquireAllCc" -A30 | head -80

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Look for where NodeGroupTerm is set/validated in AcquireAll request rg -n "NodeGroupTerm" -B10 -A2 tx_service/include/cc/cluster_config_cc_map.h | head -50

Repository: eloqdata/tx_service

Length of output: 605

🏁 Script executed:

# Read more of the Execute method to see how ng_term is used sed -n '61,150p' tx_service/include/cc/cluster_config_cc_map.h

Repository: eloqdata/tx_service

Length of output: 4187

🏁 Script executed:

# Check if AcquireAllCc has ValidTermCheck method rg -n "class AcquireAllCc\|struct AcquireAllCc" -A40 | head -100

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Search for where NodeGroupTerm is set/populated in the request creation rg -n "NodeGroupTerm\(" --max-count=20 | grep -E "Set|Reset|="

Repository: eloqdata/tx_service

Length of output: 1216

🏁 Script executed:

# Check remote_cc_request.h for RemoteAcquireAll::Reset and ValidTermCheck sed -n '85,200p' tx_service/include/remote/remote_cc_request.h

Repository: eloqdata/tx_service

Length of output: 2884

🏁 Script executed:

# Search for where ClusterConfigCcMap::Execute is called and what validation happens before rg -n "ClusterConfigCcMap.*Execute\|cluster_config_ccm_\|cluster_config_ccm\|AcquireAllCc" tx_service/src/cc/local_cc_handler.cpp -B5 -A5 | head -100

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check the critical finding: template_cc_map.h line 2379 checks ng_term < 0 sed -n '2375,2385p' tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 365

🏁 Script executed:

# Check what happens when term < 0 in other methods rg -n "if.*ng_term.*<.*0\|if.*term.*<.*0" tx_service/include/cc/*.h | head -30

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Search for AcquireAllCc class definition to see if ValidTermCheck exists rg -n "class AcquireAllCc\|struct AcquireAllCc" tx_service/include/cc/cc_request.h -A50 | head -100

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check if NodeGroupTerm() comes from the protobuf request message and what values it can have rg -n "NodeGroupTerm" tx_service/include/cc/cc_request.h -B2 -A5 | head -40

Repository: eloqdata/tx_service

Length of output: 761

🏁 Script executed:

# Search for the base class AcquireAllCc and its NodeGroupTerm implementation rg -n "class AcquireAllCc" tx_service/include/cc/cc_request.h -A80 | head -120

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check similar patterns in other cc_maps for comparison rg -n "req\.NodeGroupTerm" tx_service/include/cc/*.h -B1 -A1 | grep -E "assert|if.*<|if.*error" | head -20

Repository: eloqdata/tx_service

Length of output: 1098

🏁 Script executed:

# Get the full AcquireAllCc class definition rg -n "struct AcquireAllCc\|class AcquireAllCc" tx_service/include/cc/cc_request.h -A100 | head -150

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check if the NodeGroupTerm_ is set from protobuf message without validation rg -n "ng_term_\s*=" tx_service/include/cc/cc_request.h -B2 -A2 | head -40

Repository: eloqdata/tx_service

Length of output: 857

🏁 Script executed:

# Verify: Does Sharder::Instance().LeaderTerm() return negative on invalid state? rg -n "LeaderTerm" tx_service/src/cc/local_cc_shards.cpp -B3 -A10 | head -60

Repository: eloqdata/tx_service

Length of output: 2533

Replace assert with graceful term validation to match other cc_map implementations.

This assert will hard-crash if ng_term is ≤ 0, but the codebase shows ng_term can legitimately be negative: (1) ng_term_ is initialized to -1 in cc_request.h and conditionally patched later, (2) Sharder::Instance().LeaderTerm() can return negative values (see local_cc_shards.cpp:1594), and (3) other cc_maps like catalog_cc_map.h, range_bucket_cc_map.h, and range_cc_map.h use graceful if (ng_term < 0) checks instead. Replace the assert with a proper validation check that returns REQUESTED_NODE_NOT_LEADER on invalid term, consistent with the pattern used elsewhere in the codebase.

🤖 Prompt for AI Agents

In `@tx_service/include/cc/cluster_config_cc_map.h` around lines 69 - 71, The assert on ng_term should be replaced with a graceful validation like other cc_map implementations: after reading uint32_t ng_id = req.NodeGroupId(); int64_t ng_term = req.NodeGroupTerm(); remove assert(ng_term > 0) and instead check if (ng_term < 0) and return REQUESTED_NODE_NOT_LEADER (or the function's error return consistent with catalog_cc_map.h / range_cc_map.h patterns); ensure you reference ng_term and REQUESTED_NODE_NOT_LEADER so the function exits cleanly when term is invalid rather than crashing.

coderabbitai · 2026-02-02T04:20:19Z

tx_service/include/cc/local_cc_handler.h

+    CcReqStatus PostRead(uint64_t tx_number,
+                         int64_t tx_term,
+                         uint16_t command_id,
+                         uint64_t key_ts,
+                         uint64_t gap_ts,
+                         uint64_t commit_ts,
+                         const CcEntryAddr &ccentry_addr,
+                         CcHandlerResult<PostProcessResult> &hres,
+                         bool is_local = false,
+                         bool need_remote_resp = true,
+                         PostReadType post_read_type = PostReadType::Release,
+                         bool allow_run_on_candidate = false) override;


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Locate RemoteCcHandler::PostRead and candidate flag handling rg -n --type=cpp 'RemoteCcHandler::PostRead' -A 15 rg -n --type=cpp 'allow_run_on_candidate' --type=cpp

Repository: eloqdata/tx_service

Length of output: 9489

🏁 Script executed:

#!/bin/bash # Check the header file for all PostRead overloads in RemoteCcHandler echo "=== RemoteCcHandler header (remote_cc_handler.h) ===" rg -n -B2 -A12 'class RemoteCcHandler' tx_service/include/remote/remote_cc_handler.h | head -50 echo "" echo "=== All PostRead declarations in header ===" rg -n -B2 -A10 'PostRead' tx_service/include/remote/remote_cc_handler.h echo "" echo "=== Check line 329 and context in remote_cc_handler.cpp ===" sed -n '320,395p' tx_service/src/remote/remote_cc_handler.cpp

Repository: eloqdata/tx_service

Length of output: 4203

🏁 Script executed:

#!/bin/bash echo "=== Full RemoteCcHandler::PostRead implementation ===" sed -n '272,305p' tx_service/src/remote/remote_cc_handler.cpp echo "" echo "=== LocalCcHandler::PostRead signature from header ===" sed -n '110,121p' tx_service/include/cc/local_cc_handler.h echo "" echo "=== Check if there are other PostRead overloads in remote handler ===" rg -n 'PostRead' tx_service/include/remote/remote_cc_handler.h </parameter>

Repository: eloqdata/tx_service

Length of output: 2415

Remote PostRead must propagate the allow_run_on_candidate parameter.

The local PostRead signature now includes allow_run_on_candidate, but the remote PostRead (declared in tx_service/include/remote/remote_cc_handler.h lines 103-114) is missing this parameter. The implementation does not set it on the ValidateRequest message. This means candidate-enabled PostReads routed remotely silently drop the flag, potentially allowing invalid operations on candidate nodes. Update the remote PostRead signature and RPC message plumbing to match the local API.

🤖 Prompt for AI Agents

In `@tx_service/include/cc/local_cc_handler.h` around lines 110 - 121, The remote PostRead declaration and plumbing must be updated to accept and forward the allow_run_on_candidate flag: modify the remote handler's PostRead signature in remote_cc_handler.h to include the bool allow_run_on_candidate parameter (matching local_cc_handler.h), update the implementation that constructs the ValidateRequest RPC message to set the corresponding field on ValidateRequest, and propagate the flag through any intermediate call sites and RPC stubs (e.g., methods that build/send ValidateRequest and the RPC handler) so remote PostRead honors allow_run_on_candidate. Ensure symbols updated include PostRead, allow_run_on_candidate, and ValidateRequest.

coderabbitai · 2026-02-02T04:20:19Z

tx_service/include/cc/local_cc_shards.h

    bool GetNextRangePartitionId(const TableName &tablename,
                                 const TableSchema *table_schema,
+                                 NodeGroupId ng_id,
                                 uint32_t range_cnt,
                                 int32_t &out_next_partition_id);


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, verify the function signature at the specified location echo "=== Function signature at lines 2450-2454 ===" sed -n '2450,2454p' tx_service/include/cc/local_cc_shards.h echo -e "\n=== All occurrences of GetNextRangePartitionId ===" rg -n -C3 'GetNextRangePartitionId\s*\(' --no-heading

Repository: eloqdata/tx_service

Length of output: 3970

🏁 Script executed:

#!/bin/bash # Check if GetNextRangePartitionId is a virtual method in a base class echo "=== Search for base class or interface definitions ===" rg -n 'class.*Handler.*{' store_handler/ -A 5 echo -e "\n=== Check for virtual keyword in declarations ===" rg -n 'virtual.*GetNextRangePartitionId' store_handler/ echo -e "\n=== Check bigtable_handler.h for method signatures ===" sed -n '220,230p' store_handler/bigtable_handler.h echo -e "\n=== Check dynamo_handler.h for method signatures ===" sed -n '165,175p' store_handler/dynamo_handler.h

Repository: eloqdata/tx_service

Length of output: 1364

🏁 Script executed:

#!/bin/bash # Find base class more simply echo "=== Looking for StoreHandler or base class interface ===" fd -n "\.h$" store_handler/ | head -5 | xargs -I {} sh -c 'echo "=== {} ==="; head -30 {}' echo -e "\n=== Search for GetNextRangePartitionId in base/interface files ===" rg -n 'GetNextRangePartitionId' store_handler/ -B 2 -A 1 echo -e "\n=== Check if there's a StoreHandler base class ===" fd -n "store_handler" --maxdepth 1 -type f -name "*.h" -exec grep -l "class.*StoreHandler" {} \;

Repository: eloqdata/tx_service

Length of output: 1960

🏁 Script executed:

#!/bin/bash # Find base class definition echo "=== List store_handler header files ===" find store_handler -maxdepth 1 -type f -name "*.h" echo -e "\n=== Search for class declarations in store_handler ===" rg 'class.*{' store_handler/ -m 1 echo -e "\n=== Check bigtable_handler.h class declaration ===" head -50 store_handler/bigtable_handler.h | grep -A 20 'class' echo -e "\n=== Check dynamo_handler.h class declaration ===" head -50 store_handler/dynamo_handler.h | grep -A 20 'class' echo -e "\n=== Look for store_handler base/interface ===" find store_handler -name "*.h" | xargs grep -l 'virtual.*GetNextRangePartitionId'

Repository: eloqdata/tx_service

Length of output: 1262

🏁 Script executed:

#!/bin/bash # Find DataStoreHandler base class definition echo "=== Search for DataStoreHandler class definition ===" rg 'class DataStoreHandler' -A 30 echo -e "\n=== Look for GetNextRangePartitionId in base class ===" rg 'DataStoreHandler' -l | head -5 echo -e "\n=== Check store directory or include files ===" find . -path ./tx_service -prune -o -name "*.h" -type f -exec grep -l 'DataStoreHandler' {} \; | head -10

Repository: eloqdata/tx_service

Length of output: 10061

🏁 Script executed:

#!/bin/bash # Find GetNextRangePartitionId in the base class definition echo "=== Extract GetNextRangePartitionId from base class ===" rg -A 10 'GetNextRangePartitionId' tx_service/include/store/data_store_handler.h echo -e "\n=== Summary of all signatures ===" echo "LocalCcShards (tx_service/include/cc/local_cc_shards.h:2450):" sed -n '2450,2454p' tx_service/include/cc/local_cc_shards.h echo -e "\nBigTableHandler (store_handler/bigtable_handler.h:222):" sed -n '222,225p' store_handler/bigtable_handler.h echo -e "\nDynamoHandler (store_handler/dynamo_handler.h:168):" sed -n '168,171p' store_handler/dynamo_handler.h

Repository: eloqdata/tx_service

Length of output: 1126

🏁 Script executed:

#!/bin/bash # Search entire data_store_handler.h file for GetNextRangePartitionId echo "=== Full content of data_store_handler.h ===" wc -l tx_service/include/store/data_store_handler.h echo -e "\n=== Search for GetNextRangePartitionId ===" grep -n "GetNextRangePartitionId" tx_service/include/store/data_store_handler.h echo -e "\n=== Check if DataStoreHandler has any method named GetNext ===" grep -n "GetNext" tx_service/include/store/data_store_handler.h echo -e "\n=== Get entire DataStoreHandler class definition ===" sed -n '/^class DataStoreHandler/,/^}/p' tx_service/include/store/data_store_handler.h | head -150

Repository: eloqdata/tx_service

Length of output: 5725

🏁 Script executed:

#!/bin/bash # Verify that GetNextRangePartitionId is not in the base class echo "=== Full DataStoreHandler file length ===" wc -l tx_service/include/store/data_store_handler.h echo -e "\n=== Search entire file for GetNextRangePartitionId ===" grep -c "GetNextRangePartitionId" tx_service/include/store/data_store_handler.h echo -e "\n=== All usages of GetNextRangePartitionId in codebase ===" rg 'GetNextRangePartitionId' --no-heading -l echo -e "\n=== Check if handler methods are actually called ===" rg 'handler.*GetNextRangePartitionId|GetNextRangePartitionId.*handler' -C 3 echo -e "\n=== Verify LocalCcShards call site ===" sed -n '5648,5658p' tx_service/src/cc/local_cc_shards.cpp

Repository: eloqdata/tx_service

Length of output: 1035

Fix compilation errors: override keyword on non-existent base class methods.

BigTableHandler and DynamoHandler declare GetNextRangePartitionId with the override keyword, but this method does not exist in the base class DataStoreHandler. This causes a compilation error. Either remove the override keyword or define the method in the base class.

Additionally, the handler signatures are incompatible with LocalCcShards::GetNextRangePartitionId:

LocalCcShards: GetNextRangePartitionId(tablename, table_schema, ng_id, range_cnt, out_partition_id)

Handlers: GetNextRangePartitionId(tablename, range_cnt, out_partition_id, retry_count)

The handlers are missing table_schema and ng_id parameters while adding retry_count. Reconcile the signatures across all implementations.

🤖 Prompt for AI Agents

In `@tx_service/include/cc/local_cc_shards.h` around lines 2450 - 2454, The compilation fails because BigTableHandler::GetNextRangePartitionId and DynamoHandler::GetNextRangePartitionId use override but no matching virtual exists in DataStoreHandler, and their parameter lists differ from LocalCcShards::GetNextRangePartitionId; fix by adding a virtual declaration in DataStoreHandler with the canonical signature used by LocalCcShards — bool GetNextRangePartitionId(const TableName &tablename, const TableSchema *table_schema, NodeGroupId ng_id, uint32_t range_cnt, int32_t &out_next_partition_id) — or alternatively remove override from the handlers; then update BigTableHandler::GetNextRangePartitionId and DynamoHandler::GetNextRangePartitionId to match that canonical parameter list (add table_schema and ng_id, remove retry_count) so all declarations/definitions are consistent.

coderabbitai · 2026-02-02T04:20:19Z

tx_service/include/cc/range_bucket_cc_map.h

+        int64_t ng_term = req.NodeGroupTerm();
        if (ng_term < 0)
        {
-            req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
-            return true;
+            if (req.AllowRunOnCandidate())
+            {
+                ng_term =
+                    Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId());
+            }
+            if (ng_term < 0)
+            {
+                ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId());
+                int64_t standby_node_term =
+                    Sharder::Instance().StandbyNodeTerm();
+                ng_term = std::max(ng_term, standby_node_term);
+            }
        }
+        assert(ng_term > 0);


⚠️ Potential issue | 🟠 Major

Avoid assert‑only handling for unresolved ng_term.

If all fallbacks still yield a negative term, the assert can crash or (in release) proceed with invalid metadata. Prefer a runtime error.

🔧 Proposed fix (runtime error on invalid term)

- assert(ng_term > 0); + if (ng_term <= 0) + { + req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER); + return true; + }

🤖 Prompt for AI Agents

In `@tx_service/include/cc/range_bucket_cc_map.h` around lines 92 - 108, The code currently relies on assert(ng_term > 0) after trying req.NodeGroupTerm(), CandidateLeaderTerm(), LeaderTerm() and StandbyNodeTerm(); replace this assert with a runtime error path: after computing ng_term check if ng_term <= 0 and if so log or construct a descriptive error including req.NodeGroupId(), the original req.NodeGroupTerm(), CandidateLeaderTerm(...), LeaderTerm(...), and StandbyNodeTerm(), then throw a std::runtime_error (or return a failure status according to your error model) instead of asserting so callers receive a deterministic error rather than crashing or continuing with invalid metadata.

coderabbitai · 2026-02-02T04:20:19Z

tx_service/include/cc/range_cc_map.h

+        int64_t ng_term = req.NodeGroupTerm();
        if (ng_term < 0)
        {
-            hd_result->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
-            return true;
+            if (req.AllowRunOnCandidate())
+            {
+                ng_term =
+                    Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId());
+            }
+            if (ng_term < 0)
+            {
+                ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId());
+                int64_t standby_node_term =
+                    Sharder::Instance().StandbyNodeTerm();
+                ng_term = std::max(ng_term, standby_node_term);
+            }
        }
+        assert(ng_term > 0);


⚠️ Potential issue | 🟠 Major

Handle missing node‑group term without relying on assert

Line 220‑236: after the candidate/leader/standby fallback, ng_term can still be <0 when no valid term exists (e.g., during elections). In non‑assert builds this value flows into AcquireCceKeyLock and the returned CcEntryAddr, risking invalid lock ownership. Prefer returning a NOT_LEADER/RETRY error before proceeding.

💡 Proposed guard to fail fast when no valid term is available

if (ng_term < 0) { if (req.AllowRunOnCandidate()) { ng_term = Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId()); } if (ng_term < 0) { ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId()); int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm(); ng_term = std::max(ng_term, standby_node_term); } } - assert(ng_term > 0); + if (ng_term < 0) + { + hd_result->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER); + hd_result->SetFinished(); + return true; + } + assert(ng_term > 0);

🤖 Prompt for AI Agents

In `@tx_service/include/cc/range_cc_map.h` around lines 220 - 236, The code currently asserts ng_term > 0 after trying CandidateLeaderTerm, LeaderTerm and StandbyNodeTerm; replace the assert with a proper early return that fails fast when no valid term exists: check if ng_term <= 0 after those fallbacks and return an appropriate NOT_LEADER or RETRY error code (the same error path the caller expects) instead of proceeding to call AcquireCceKeyLock and constructing a CcEntryAddr; reference the Sharder::Instance() calls and methods CandidateLeaderTerm, LeaderTerm, StandbyNodeTerm and ensure the function returns the error before any use of AcquireCceKeyLock/CcEntryAddr.

coderabbitai · 2026-02-02T04:20:19Z

tx_service/include/cc/template_cc_map.h

+        int64_t ng_term = req.NodeGroupTerm();
+        assert(ng_term > 0);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the actual code at the specified lines cat -n tx_service/include/cc/template_cc_map.h | sed -n '200,210p' cat -n tx_service/include/cc/template_cc_map.h | sed -n '690,705p'

Repository: eloqdata/tx_service

Length of output: 1259

🏁 Script executed:

# Search for ValidTermCheck function definition and usage rg -n "ValidTermCheck" -A 5 tx_service/

Repository: eloqdata/tx_service

Length of output: 23652

🏁 Script executed:

# Find the broader context of AcquireCc and AcquireAllCc to understand the term lifecycle rg -n "void AcquireCc\|void AcquireAllCc" -A 20 tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check for callers of AcquireCc to understand where NodeGroupTerm comes from rg -n "\.AcquireCc\(|\.AcquireAllCc\(" -B 5 -A 2 tx_service/

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Find the Execute method that contains line 202-203 sed -n '175,250p' tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 3430

🏁 Script executed:

# Look for the request class definition and its entry point rg -n "class.*Acquire.*Cc" tx_service/include/cc/template_cc_map.h | head -20

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check if there's a base class Execute that calls ValidTermCheck before AcquireCc rg -n "bool Execute" tx_service/include/cc/template_cc_map.h | head -10

Repository: eloqdata/tx_service

Length of output: 566

🏁 Script executed:

# Search for how requests are dispatched to AcquireCc - look at request lifecycle rg -n "req.NodeGroupTerm()" tx_service/include/cc/template_cc_map.h -B 5 -A 10 | head -80

Repository: eloqdata/tx_service

Length of output: 2864

🏁 Script executed:

# Look at base request class to see if there's pre-Execute validation sed -n '115,145p' tx_service/include/cc/cc_request.h

Repository: eloqdata/tx_service

Length of output: 755

🏁 Script executed:

# Check if Execute() is called directly or through a wrapper that validates rg -n "ValidTermCheck" tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Search for how Execute() is invoked - look for dispatch patterns rg -n "\.Execute\(" tx_service/include/cc/ -B 2 | grep -A 2 "AcquireCc\|AcquireAllCc" | head -40

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check the CcRequest base class wrapper rg -n "class.*AcquireCc\b" tx_service/include/cc/cc_request.h -A 30 | head -50

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Get more context from the base class wrapper to see exact flow sed -n '130,165p' tx_service/include/cc/cc_request.h

Repository: eloqdata/tx_service

Length of output: 1244

🏁 Script executed:

# Check what NodeGroupTerm() returns - look for its definition rg -n "NodeGroupTerm()" tx_service/include/cc/ --type-list | head -5

Repository: eloqdata/tx_service

Length of output: 452

🏁 Script executed:

# Find where NodeGroupTerm is set on the request object rg -n "SetNodeGroupTerm|NodeGroupTerm_|ng_term_" tx_service/include/cc/cc_request.h | head -20

Repository: eloqdata/tx_service

Length of output: 980

🏁 Script executed:

# Verify: does NodeGroupTerm() return the validated ng_term_ or something else? rg -n "int64_t NodeGroupTerm()" tx_service/include/cc/ -A 3

Repository: eloqdata/tx_service

Length of output: 1491

🏁 Script executed:

# Look at how the base class wrapper invokes the derived Execute sed -n '145,165p' tx_service/include/cc/cc_request.h

Repository: eloqdata/tx_service

Length of output: 958

🏁 Script executed:

# Verify the flow: TemplatedCcRequest::Execute calls ValidTermCheck, then invokes typed_req->Execute rg -n "typed_req->Execute" tx_service/include/cc/cc_request.h -B 5 -A 2

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Look at PostWriteCc Execute to see if it uses runtime guards like ReadCc sed -n '478,520p' tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 1631

🏁 Script executed:

# Confirm the exact flow: does TemplatedCcRequest::Execute call ValidTermCheck before invoking typed_req->Execute? sed -n '136,180p' tx_service/include/cc/cc_request.h

Repository: eloqdata/tx_service

Length of output: 2111

🏁 Script executed:

# Check what happens at the call site - is typed_req->Execute called after ValidTermCheck returns true? sed -n '136,170p' tx_service/include/cc/cc_request.h | tail -35

Repository: eloqdata/tx_service

Length of output: 1418

🏁 Script executed:

# Now check if AcquireCc is overriding or if it's directly called # Look for class AcquireCc definition rg -n "^class AcquireCc" tx_service/include/cc/ -A 10

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Critical: Check if req.NodeGroupTerm() can change between ValidTermCheck and Execute # Look at what NodeGroupTerm represents - is it a volatile value or cached? rg -n "ng_term_\s*=" tx_service/include/cc/cc_request.h | head -20

Repository: eloqdata/tx_service

Length of output: 889

🏁 Script executed:

# Confirm: Does ValidTermCheck set ng_term_ for ALL request types including AcquireCc? rg -n "class AcquireCc\b" tx_service/include/cc/cc_request.h -A 40 | head -50

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check the fault injector test more carefully - it sets ng_term = -1 AFTER ValidTermCheck # This proves ng_term can become invalid even after validation sed -n '200,220p' tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 795

🏁 Script executed:

# Check if there are any other places where ng_term is reassigned after ValidTermCheck rg -n "ng_term\s*=" tx_service/include/cc/template_cc_map.h | grep -A 5 -B 5 "202\|696"

Repository: eloqdata/tx_service

Length of output: 507

🏁 Script executed:

# Compare consistency: PostWriteCc doesn't have assert for ng_term. Check what it does. rg -n "ng_term" tx_service/include/cc/template_cc_map.h | head -30

Repository: eloqdata/tx_service

Length of output: 1583

🏁 Script executed:

# Final check: look at the learning pattern - unpinning should only happen if term >= 0 # Does AcquireCc or AcquireAllCc call any pin/unpin that should be guarded? rg -n "TryPinNodeGroupData\|UnpinNodeGroupData" tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 45

Add runtime guard for invalid ng_term instead of relying on assert.

Lines 202-203 and 695-697 use assertions for term validation, which fail silently in release builds. The fault injector test (line 204-209) explicitly demonstrates that ng_term can become invalid within Execute(). Use a runtime check like ReadCc does (line 1391-1406): if ng_term <= 0, return REQUESTED_NODE_NOT_LEADER error rather than proceeding. This aligns with the codebase pattern for term safety and matches the fault injector's expected error handling.

🤖 Prompt for AI Agents

In `@tx_service/include/cc/template_cc_map.h` around lines 202 - 203, Replace the assert-based term validation with a runtime check: in the locations where you do "int64_t ng_term = req.NodeGroupTerm();" (including the Execute() path and the other site around the current asserts), if ng_term <= 0 return the same error used by ReadCc (REQUESTED_NODE_NOT_LEADER) instead of asserting; update the function that returns the RPC/Status to propagate REQUESTED_NODE_NOT_LEADER so callers handle it like ReadCc (see ReadCc handling around lines ~1391-1406 for the exact behavior) and remove the assert(ng_term > 0) usages.

coderabbitai · 2026-02-02T04:20:19Z

tx_service/include/data_sync_task.h

+    void SetSyncOnLeader(bool sync_on_leader)
+    {
+        sync_on_leader_ = sync_on_leader;
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd tx_service && find . -name "data_sync_task.h" -type f

Repository: eloqdata/tx_service

Length of output: 89

🏁 Script executed:

wc -l tx_service/include/data_sync_task.h

Repository: eloqdata/tx_service

Length of output: 102

🏁 Script executed:

sed -n '1,100p' tx_service/include/data_sync_task.h | cat -n

Repository: eloqdata/tx_service

Length of output: 4097

🏁 Script executed:

sed -n '180,250p' tx_service/include/data_sync_task.h | cat -n

Repository: eloqdata/tx_service

Length of output: 2728

🏁 Script executed:

rg "SetSyncOnLeader" --type cpp --type h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 1058

🏁 Script executed:

rg "sync_on_leader_" --type cpp --type h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 3431

🏁 Script executed:

sed -n '240,250p' tx_service/include/data_sync_task.h | cat -n

Repository: eloqdata/tx_service

Length of output: 403

🏁 Script executed:

rg "SetSyncOnLeader" --type cpp -B 5 -A 5

Repository: eloqdata/tx_service

Length of output: 2508

🏁 Script executed:

rg "sync_on_leader_.*?" --type cpp -B 3 -A 3 | head -100

Repository: eloqdata/tx_service

Length of output: 4643

🏁 Script executed:

rg "SetSyncOnLeader" --type cpp -B 15 -A 5

Repository: eloqdata/tx_service

Length of output: 4562

🏁 Script executed:

rg "sync_on_leader_" --type cpp -B 5 -A 5 | head -150

Repository: eloqdata/tx_service

Length of output: 7268

🏁 Script executed:

rg "std::shared_ptr.*DataSyncTask" --type cpp --type h -A 3 | head -80

Repository: eloqdata/tx_service

Length of output: 5275

🏁 Script executed:

rg "DataSyncForRangePartition|DataSyncForHashPartition" --type cpp -B 3 -A 20 | head -150

Repository: eloqdata/tx_service

Length of output: 10230

🏁 Script executed:

grep -n "SetSyncOnLeader" tx_service/src/cc/local_cc_shards.cpp

Repository: eloqdata/tx_service

Length of output: 176

🏁 Script executed:

sed -n '1,50p' tx_service/src/cc/local_cc_shards.cpp | head -50

Repository: eloqdata/tx_service

Length of output: 1573

🏁 Script executed:

rg "data_sync_task_queue_" --type cpp --type h -B 2 -A 2 | head -80

Repository: eloqdata/tx_service

Length of output: 5807

🏁 Script executed:

sed -n '3620,3650p' tx_service/src/cc/local_cc_shards.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 1581

🏁 Script executed:

sed -n '4680,4710p' tx_service/src/cc/local_cc_shards.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 1596

🏁 Script executed:

rg "void.*DataSyncForRangePartition" --type cpp -A 30 | head -80

Repository: eloqdata/tx_service

Length of output: 5032

🏁 Script executed:

sed -n '180,250p' tx_service/include/data_sync_task.h | cat -n

Repository: eloqdata/tx_service

Length of output: 2728

🏁 Script executed:

rg "void.*Sync.*Adjustable|void.*Set.*OnLeader" --type h --type cpp -B 2 -A 5

Repository: eloqdata/tx_service

Length of output: 754

🏁 Script executed:

rg "last_sync_ts.*=.*sync_on_leader_" --type cpp -B 10 | head -100

Repository: eloqdata/tx_service

Length of output: 911

🏁 Script executed:

rg "PostProcessFlushTaskEntries|FinishDataSyncTask" --type cpp --type h -B 2 -A 10 | head -120

Repository: eloqdata/tx_service

Length of output: 2966

🏁 Script executed:

rg "FinishDataSync|SetScanTaskFinished|SetFinish" --type cpp -B 3 -A 8 | head -100

Repository: eloqdata/tx_service

Length of output: 5644

🏁 Script executed:

rg "void.*Worker" --type cpp --type h | grep -i "data\|sync\|flush" | head -20

Repository: eloqdata/tx_service

Length of output: 898

🏁 Script executed:

sed -n '4700,4750p' tx_service/src/cc/local_cc_shards.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 2323

🏁 Script executed:

rg "sync_on_leader_.*\?" --type cpp -B 2 -A 5 | head -80

Repository: eloqdata/tx_service

Length of output: 2741

🏁 Script executed:

rg "need_update_ckpt_ts_" --type cpp --type h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 1370

Guard sync_on_leader_ updates against data races.

The non-atomic write in SetSyncOnLeader() (lines 186-189) can race with reads in finalization code paths. While SetSyncOnLeader() is called during worker thread execution in DataSyncForRangePartition(), the DataSyncTask object is shared via shared_ptr across threads. The reads of sync_on_leader_ in finalization (e.g., task->sync_on_leader_ ? task->data_sync_ts_ : 0) occur without synchronization.

The same issue applies to the member declaration at line 248 (shown as "Also applies to: 246-247"), where sync_on_leader_ is a plain bool with no atomic or mutex protection. Use std::atomic<bool> or guard access with a mutex to ensure thread safety.

🤖 Prompt for AI Agents

In `@tx_service/include/data_sync_task.h` around lines 186 - 189, The SetSyncOnLeader() setter and unsynchronized reads of sync_on_leader_ in DataSyncTask can race; change the DataSyncTask member sync_on_leader_ to std::atomic<bool> and update SetSyncOnLeader() to call sync_on_leader_.store(sync_on_leader, std::memory_order_release) and replace all unsynchronized reads (e.g., in finalization expressions like task->sync_on_leader_ ? task->data_sync_ts_ : 0) with sync_on_leader_.load(std::memory_order_acquire) so reads and writes are properly synchronized.

coderabbitai · 2026-02-02T04:20:19Z

tx_service/src/cc/cc_req_misc.cpp

+bool UpdateCceCkptTsCc::ValidTermCheck() const
+{
+    int64_t ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
+    int64_t candidate_ng_term =
+        Sharder::Instance().CandidateLeaderTerm(node_group_id_);
+    ng_term = std::max(ng_term, candidate_ng_term);
+    int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();
+    int64_t current_term = std::max(ng_term, standby_node_term);
+
+    if (current_term < 0 || current_term != term_)
+    {
+        LOG(INFO)
+            << "UpdateCceCkptTsCc::ValidTermCheck failed with current term: "
+            << current_term << ", term_: " << term_;
+        return false;
+    }
+
+    return true;
+}
+
 bool UpdateCceCkptTsCc::Execute(CcShard &ccs)
 {
+    if (!ValidTermCheck())
+    {
+        LOG(INFO) << "UpdateCceCkptTsCc::ValidTermCheck failed on shard: "
+                  << ccs.core_id_;
+        SetFinished();
+        return false;
+    }


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find the UpdateCceCkptTsCc class definition and understand term_ initialization cd tx_service && git ls-files | grep -E "\.(h|hpp|cpp)$" | head -20

Repository: eloqdata/tx_service

Length of output: 593

🏁 Script executed:

# Look at the header file for cc_req_misc to understand the class structure rg -A 30 "class UpdateCceCkptTsCc" --type cpp --type h

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Find other ValidTermCheck implementations to understand the pattern rg "ValidTermCheck\(" --type cpp --type h -B 2 -A 10

Repository: eloqdata/tx_service

Length of output: 44362

🏁 Script executed:

# Search for callers of UpdateCceCkptTsCc constructor to see how term_ is set rg "UpdateCceCkptTsCc" --type cpp --type h -B 3 -A 3

Repository: eloqdata/tx_service

Length of output: 5534

🏁 Script executed:

# Find the full UpdateCceCkptTsCc constructor definition rg -A 20 "UpdateCceCkptTsCc\(" tx_service/include/cc/cc_req_misc.h | head -40

Repository: eloqdata/tx_service

Length of output: 1131

🏁 Script executed:

# Search for where UpdateCceCkptTsCc is constructed rg "UpdateCceCkptTsCc\s*\(" --type cpp --type h -B 2 -A 5

Repository: eloqdata/tx_service

Length of output: 1091

🏁 Script executed:

# Check the construction site more carefully rg -B 5 -A 10 "UpdateCceCkptTsCc update_cce_req" --type cpp

Repository: eloqdata/tx_service

Length of output: 1336

🏁 Script executed:

# Check data_sync_task structure for node_group_term_ initialization rg -A 30 "struct DataSyncTask" tx_service/include/data_sync_task.h | head -50

Repository: eloqdata/tx_service

Length of output: 1647

🏁 Script executed:

# Look for where node_group_term_ is assigned in data_sync_task rg "node_group_term_" tx_service/include/data_sync_task.h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 616

🏁 Script executed:

# Check if there's documentation or patterns showing when node_group_term_ is set to -1 vs positive values rg "node_group_term_.*=" tx_service/include/data_sync_task.h -B 3 -A 3

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Look for where DataSyncTask is instantiated to see what ng_term is passed rg "DataSyncTask$" --type cpp -B 3 -A 8 | head -60

Repository: eloqdata/tx_service

Length of output: 4315

🏁 Script executed:

# Check other const ValidTermCheck implementations to understand the pattern better rg -A 8 "bool ValidTermCheck\($ const" --type h

Repository: eloqdata/tx_service

Length of output: 1317

🏁 Script executed:

# Check if DataSyncTask constructor validates ng_term > 0 rg -A 50 "DataSyncTask::DataSyncTask" tx_service/src/cc/local_cc_shards.cpp | head -80

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check calls to EnqueueRangeDataSyncTask and EnqueueHashPartitionDataSyncTask to see what ng_term is passed rg "EnqueueRangeDataSyncTask\|EnqueueHashPartitionDataSyncTask" --type cpp -B 3 -A 3 | head -100

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Specifically search for assertions or checks on node_group_term_ or ng_term in data_sync_task initialization rg "assert.*ng_term\|assert.*node_group_term_" --type cpp --type h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Look at the context where UpdateCceCkptTsCc is constructed more carefully rg -B 20 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | tail -30

Repository: eloqdata/tx_service

Length of output: 963

🏁 Script executed:

# Check the data_sync_task that's being passed rg -B 50 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | grep -A 50 "if (cce_entries_map.size" | head -20

Repository: eloqdata/tx_service

Length of output: 182

🏁 Script executed:

# Look for the actual PostProcessFlushTaskEntries function context rg -B 10 -A 30 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | head -60

Repository: eloqdata/tx_service

Length of output: 1423

🏁 Script executed:

# Get more context of PostProcessFlushTaskEntries to see how ng_term is handled rg -A 80 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | head -100

Repository: eloqdata/tx_service

Length of output: 3355

🏁 Script executed:

# Look for where UpdateCceCkptTsCc is enqueued with this ng_term rg -A 150 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | grep -A 30 "UpdateCceCkptTsCc"

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Get the full PostProcessFlushTaskEntries function rg -A 250 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | tail -150

Repository: eloqdata/tx_service

Length of output: 5846

🏁 Script executed:

# Search more specifically for UpdateCceCkptTsCc in the function sed -n '2949,3188p' tx_service/src/cc/local_cc_shards.cpp

Repository: eloqdata/tx_service

Length of output: 8890

🏁 Script executed:

# Let's find where UpdateCceCkptTsCc is instantiated to see what term values are passed rg -B 10 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | head -25

Repository: eloqdata/tx_service

Length of output: 545

🏁 Script executed:

# Check if there's any code that could create a DataSyncTask with a -1 term rg "node_group_term_\s*=" tx_service/src/cc/local_cc_shards.cpp | head -20

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Check term_ field in UpdateCceCkptTsCc to see if there's any assertion or check rg -B 5 -A 15 "struct UpdateCceCkptTsCc" tx_service/include/cc/cc_req_misc.h | tail -30

Repository: eloqdata/tx_service

Length of output: 547

🏁 Script executed:

# Get the complete instantiation of UpdateCceCkptTsCc with all parameters rg -A 4 "UpdateCceCkptTsCc update_cce_req\(" tx_service/src/cc/local_cc_shards.cpp

Repository: eloqdata/tx_service

Length of output: 324

🏁 Script executed:

# Check if the data_sync_task is guaranteed to have a positive term by searching how it's constructed rg -B 20 "EnqueueRangeDataSyncTask" tx_service/src/cc/local_cc_shards.cpp | grep -A 20 "DataSyncTask"

Repository: eloqdata/tx_service

Length of output: 1334

🏁 Script executed:

# Look for all places where UpdateCceCkptTsCc is instantiated rg "UpdateCceCkptTsCc" tx_service/src/cc/local_cc_shards.cpp -B 5 -A 8

Repository: eloqdata/tx_service

Length of output: 1563

🏁 Script executed:

# Check if there are any DataSyncTask creations with ng_term that might be -1 or uninitialized rg "DataSyncTask\(" tx_service/src/cc/local_cc_shards.cpp -A 8 | grep -B 5 -A 3 "ng_term"

Repository: eloqdata/tx_service

Length of output: 1900

🏁 Script executed:

# Look for the actual field declarations in UpdateCceCkptTsCc to see term_ initialization rg -A 50 "struct UpdateCceCkptTsCc" tx_service/include/cc/cc_req_misc.h | grep -A 5 "term_\|int64_t"

Repository: eloqdata/tx_service

Length of output: 764

🏁 Script executed:

# Check if there's an assertion or check before UpdateCceCkptTsCc is constructed rg -B 30 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | grep -A 30 "if (cce_entries_map.size"

Repository: eloqdata/tx_service

Length of output: 182

🏁 Script executed:

# Check if there's any place where node_group_term_ might be left as -1 in DataSyncTask before being used for UpdateCceCkptTsCc rg -B 5 -A 5 "data_sync_task_->node_group_term_" tx_service/src/cc/local_cc_shards.cpp | head -40

Repository: eloqdata/tx_service

Length of output: 554

🏁 Script executed:

# Check if term_ field in UpdateCceCkptTsCc has any member initialization or assertions rg -A 100 "struct UpdateCceCkptTsCc" tx_service/include/cc/cc_req_misc.h | tail -50

Repository: eloqdata/tx_service

Length of output: 1235

🏁 Script executed:

# Verify if there's any code path where UpdateCceCkptTsCc can be created with term_ < 0 rg "UpdateCceCkptTsCc" tx_service --type cpp -B 15 -A 3 | grep -A 15 "node_group_term_"

Repository: eloqdata/tx_service

Length of output: 240

🏁 Script executed:

# Check if term_ could potentially be -1 by looking at how DataSyncTask is initialized rg "DataSyncTask(" tx_service/src/cc/local_cc_shards.cpp -A 12 | grep -B 5 -A 7 "ng_term"

Repository: eloqdata/tx_service

Length of output: 139

🏁 Script executed:

# Verify the concern: check if there's any scenario where term_ is not explicitly set or could be -1 rg -A 5 "node_group_term_\{" tx_service/include/data_sync_task.h

Repository: eloqdata/tx_service

Length of output: 185

🏁 Script executed:

# Found the issue! node_group_term_ defaults to -1. Now let's check if DataSyncTask is ever created without explicitly setting ng_term rg "std::make_shared<DataSyncTask>" tx_service/src/cc/local_cc_shards.cpp -A 15 | head -50

Repository: eloqdata/tx_service

Length of output: 2177

🏁 Script executed:

# Check all DataSyncTask constructor calls to see if ng_term is always provided rg "make_shared.*DataSyncTask\|std::shared_ptr.*DataSyncTask" tx_service/src/cc/local_cc_shards.cpp -A 5

Repository: eloqdata/tx_service

Length of output: 45

🏁 Script executed:

# Look at the DataSyncTask constructors more carefully rg -A 30 "DataSyncTask::" tx_service/src/cc/data_sync_task.cpp | head -80

Repository: eloqdata/tx_service

Length of output: 139

🏁 Script executed:

# Let's check what values are passed for ng_term when EnqueueRangeDataSyncTask is called rg "EnqueueRangeDataSyncTask" tx_service/src/cc/local_cc_shards.cpp -B 5 -A 1 | grep -B 5 "ng_"

Repository: eloqdata/tx_service

Length of output: 556

🏁 Script executed:

# Check if there's any DataSyncTask created without explicit ng_term parameter (would use default -1) rg "std::make_shared.*DataSyncTask\|DataSyncTask(" tx_service/include/data_sync_task.h -A 15 | head -60

Repository: eloqdata/tx_service

Length of output: 203

🏁 Script executed:

# Review the key creation site again - does node_group_term_ ever stay at default -1? rg -B 30 "if $cce_entries_map.size\($ > 0\)" tx_service/src/cc/local_cc_shards.cpp | tail -40

Repository: eloqdata/tx_service

Length of output: 1316

Align ValidTermCheck() with const-variant patterns by adding an assertion.

The const ValidTermCheck() lacks a safeguard that similar implementations in cc_request.h use. Add assert(term_ > 0); before the term comparison to catch cases where term_ might not be properly initialized, consistent with how other const ValidTermCheck() methods guard against uninitialized stored terms.

🤖 Prompt for AI Agents

In `@tx_service/src/cc/cc_req_misc.cpp` around lines 1171 - 1199, The const method UpdateCceCkptTsCc::ValidTermCheck() is missing the assert guard used by other const variants; add assert(term_ > 0); near the start of ValidTermCheck (before the comparison against current_term) so uninitialized/invalid term_ values are caught early; ensure the assertion is placed in the body of UpdateCceCkptTsCc::ValidTermCheck() prior to the if (current_term < 0 || current_term != term_) check and that <cassert> is available in the translation unit.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tx_service/include/cc/cc_request.h (1)

4167-4204: ⚠️ Potential issue | 🟠 Major

RunOnCandidateNode flag is currently unused in term checks.

RangePartitionDataSyncScanCc stores run_on_candidate_node_ but ValidTermCheck always includes candidate term. This undermines the flag’s purpose and can allow candidate execution even when not intended.

Possible fix

-        int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
-        int64_t candidate_ng_term =
-            Sharder::Instance().CandidateLeaderTerm(node_group_id_);
-        int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();
-        int64_t current_term =
-            std::max({cc_ng_term, standby_node_term, candidate_ng_term});
+        int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
+        int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();
+        int64_t current_term = std::max(cc_ng_term, standby_node_term);
+        if (run_on_candidate_node_)
+        {
+            int64_t candidate_ng_term =
+                Sharder::Instance().CandidateLeaderTerm(node_group_id_);
+            current_term = std::max(current_term, candidate_ng_term);
+        }

Also applies to: 4239-4261, 4324-4355, 4560-4563, 4739-4739

🤖 Fix all issues with AI agents

In `@tx_service/include/cc/cc_request.h`:
- Around line 565-585: The ValidTermCheck currently unconditionally accepts
candidate leader terms by taking the max of Sharder::Instance().LeaderTerm and
CandidateLeaderTerm; add an explicit boolean flag (e.g., allow_run_on_candidate_
or a parameter to ValidTermCheck) and change ValidTermCheck (and any callers) to
only consider CandidateLeaderTerm when that flag is true; update callers such as
AcquireAllCc and PostWriteAllCc to pass/set the flag only when they should
permit running on candidate leaders (mirroring AcquireCc's behavior), and ensure
ng_term_ and the comparison logic remain unchanged when the flag is false so
candidate terms are not accepted implicitly.

In `@tx_service/src/cc/local_cc_shards.cpp`:
- Around line 3201-3204: The CheckLeaderTerm call currently passes 'true'
unconditionally which permits candidate-term acceptance; change the call to only
allow candidate term when the task explicitly allows it (e.g., use
!task->sync_on_leader_ or an explicit allow_candidate flag) so leader-only tasks
still fail fast—update the call site
Sharder::Instance().CheckLeaderTerm(task->node_group_id_,
task->node_group_term_, /*allow_candidate=*/!task->sync_on_leader_) and apply
the same conditional change to the other occurrences (the other CheckLeaderTerm
call sites flagged in the review) so leader-only logic uses false for
allow_candidate.

coderabbitai · 2026-02-02T04:54:41Z

tx_service/include/cc/cc_request.h

+    bool ValidTermCheck() override
+    {
+        int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
+        int64_t candidate_ng_term =
+            Sharder::Instance().CandidateLeaderTerm(node_group_id_);
+        cc_ng_term = std::max(cc_ng_term, candidate_ng_term);
+
+        if (ng_term_ < 0)
+        {
+            ng_term_ = cc_ng_term;
+        }
+
+        if (cc_ng_term < 0 || cc_ng_term != ng_term_)
+        {
+            return false;
+        }
+        else
+        {
+            return true;
+        }
+    }


⚠️ Potential issue | 🟠 Major

Gate candidate-term acceptance behind an explicit flag.

AcquireAllCc and PostWriteAllCc now always accept candidate terms, which bypasses the new “allow_run_on_candidate” intent. If only some flows should run on candidates, add a flag (as in AcquireCc) or otherwise document why these paths are unconditional.

Suggested direction

- int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_); - int64_t candidate_ng_term = - Sharder::Instance().CandidateLeaderTerm(node_group_id_); - cc_ng_term = std::max(cc_ng_term, candidate_ng_term); + int64_t cc_ng_term = -1; + if (allow_run_on_candidate_) + { + cc_ng_term = Sharder::Instance().CandidateLeaderTerm(node_group_id_); + } + if (cc_ng_term < 0) + { + cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_); + }

Also applies to: 996-1016

🤖 Prompt for AI Agents

In `@tx_service/include/cc/cc_request.h` around lines 565 - 585, The ValidTermCheck currently unconditionally accepts candidate leader terms by taking the max of Sharder::Instance().LeaderTerm and CandidateLeaderTerm; add an explicit boolean flag (e.g., allow_run_on_candidate_ or a parameter to ValidTermCheck) and change ValidTermCheck (and any callers) to only consider CandidateLeaderTerm when that flag is true; update callers such as AcquireAllCc and PostWriteAllCc to pass/set the flag only when they should permit running on candidate leaders (mirroring AcquireCc's behavior), and ensure ng_term_ and the comparison logic remain unchanged when the flag is false so candidate terms are not accepted implicitly.

coderabbitai · 2026-02-02T04:54:41Z

tx_service/src/cc/local_cc_shards.cpp

                        if (!Sharder::Instance().CheckLeaderTerm(
-                                task->node_group_id_, task->node_group_term_))
+                                task->node_group_id_,
+                                task->node_group_term_,
+                                true))


⚠️ Potential issue | 🟠 Major

Don’t always allow candidate term in post‑processing checks.

Passing true unconditionally to CheckLeaderTerm can let leader‑only tasks complete after leadership loss (depending on term semantics). Consider gating allow‑candidate by task intent (e.g., !task->sync_on_leader_ or an explicit allow‑candidate flag) so leader‑only tasks still fail fast.

💡 Suggested fix pattern

- if (!Sharder::Instance().CheckLeaderTerm(task->node_group_id_, - task->node_group_term_, - true)) + if (!Sharder::Instance().CheckLeaderTerm(task->node_group_id_, + task->node_group_term_, + !task->sync_on_leader_))

Also applies to: 3291-3292, 3394-3395, 3468-3469, 4489-4490

🤖 Prompt for AI Agents

In `@tx_service/src/cc/local_cc_shards.cpp` around lines 3201 - 3204, The CheckLeaderTerm call currently passes 'true' unconditionally which permits candidate-term acceptance; change the call to only allow candidate term when the task explicitly allows it (e.g., use !task->sync_on_leader_ or an explicit allow_candidate flag) so leader-only tasks still fail fast—update the call site Sharder::Instance().CheckLeaderTerm(task->node_group_id_, task->node_group_term_, /*allow_candidate=*/!task->sync_on_leader_) and apply the same conditional change to the other occurrences (the other CheckLeaderTerm call sites flagged in the review) so leader-only logic uses false for allow_candidate.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tx_service/src/cc/local_cc_shards.cpp`:
- Around line 3852-3855: The condition incorrectly excludes term 0; update the
check on Sharder::Instance().CandidateLeaderTerm(ng_id) to use >= 0 (not > 0) so
term 0 is treated as a valid candidate leader term; modify the if condition
around CandidateLeaderTerm(ng_id) and ensure
data_sync_task->SetRunOnLeaderNode(false) still executes for any non-negative
term.

🧹 Nitpick comments (1)

tx_service/src/cc/local_cc_shards.cpp (1)

3254-3256: UpdateLastDataSyncTS is monotonic and safe with 0 values; clarify calls to skip on candidate runs.

Both UpdateLastDataSyncTS implementations guard with if (last_sync_ts > last_sync_ts_), preventing any regression. Passing 0 on candidate runs is functionally safe and will be ignored by the monotonic check. However, to improve clarity and avoid unnecessary function calls, skip the call entirely unless run_on_leader_node_ is true. This aligns with the PR intent and existing patterns (e.g., local_cc_shards.h:986-989).

Apply to: 3254-3256, 3411-3413, 3930-3933, 4514-4516, 5737-5739.

coderabbitai · 2026-02-02T09:36:40Z

tx_service/src/cc/local_cc_shards.cpp

+        if (Sharder::Instance().CandidateLeaderTerm(ng_id) > 0)
+        {
+            data_sync_task->SetRunOnLeaderNode(false);
+        }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for CandidateLeaderTerm usage to understand term semantics rg "CandidateLeaderTerm" -C 3 -n

Repository: eloqdata/tx_service

Length of output: 36871

🏁 Script executed:

#!/bin/bash # Look for term-related constants (INIT_TERM, etc.) rg "INIT_TERM|MIN_TERM|INVALID_TERM" -n

Repository: eloqdata/tx_service

Length of output: 1442

🏁 Script executed:

#!/bin/bash # Search for other candidate term comparisons to see common patterns rg "CandidateLeaderTerm.*[<>=]" -n

Repository: eloqdata/tx_service

Length of output: 2199

🏁 Script executed:

#!/bin/bash # Check the context around line 3852 to understand the logic better sed -n '3840,3870p' tx_service/src/cc/local_cc_shards.cpp

Repository: eloqdata/tx_service

Length of output: 1098

Change CandidateLeaderTerm(ng_id) > 0 to >= 0 to include term 0. Term 0 is a valid candidate leader term (INIT_TERM = -1 is invalid). Other validity checks in the codebase use >= 0 for this purpose (e.g., line 1595 assert, line 275 in tx_execution.h), making > 0 inconsistent and causing it to incorrectly skip the first candidate leader term.

🤖 Prompt for AI Agents

In `@tx_service/src/cc/local_cc_shards.cpp` around lines 3852 - 3855, The condition incorrectly excludes term 0; update the check on Sharder::Instance().CandidateLeaderTerm(ng_id) to use >= 0 (not > 0) so term 0 is treated as a valid candidate leader term; modify the if condition around CandidateLeaderTerm(ng_id) and ensure data_sync_task->SetRunOnLeaderNode(false) still executes for any non-negative term.

yi-xmu added 2 commits February 2, 2026 11:44

allow checkpoint run on candidate leader node

a3f8173

fix release metadata readlocc fix update last sync ts update debug log debug log and debug print ccpage function debug log for range split datasyncscan and debug print ccmap function disable debug log fix slice size mismatch

remvove debug log

232180e

yi-xmu self-assigned this Feb 2, 2026

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

yi-xmu added 2 commits February 2, 2026 12:33

remove useless variable for rangepartitiondatasyncscancc

19943e8

check the node status when execute datasync

3942fe8

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

yi-xmu added 3 commits February 2, 2026 16:20

debug log for try update ckptts but cce persistent

3dba7d3

remove debug log for updateckpt ts

b36ca73

update

2c5a15b

yi-xmu linked an issue Feb 2, 2026 that may be closed by this pull request

Allow checkpointing to be performed on the candidate leader node. #361

Open

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

yi-xmu requested a review from liunyl February 2, 2026 10:32

Conversation

yi-xmu commented Feb 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yi-xmu commented Feb 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 2, 2026 •

edited

Loading