Skip to content

Allow run on candidate#392

Open
yi-xmu wants to merge 7 commits intomainfrom
allow_run_on_candidate
Open

Allow run on candidate#392
yi-xmu wants to merge 7 commits intomainfrom
allow_run_on_candidate

Conversation

@yi-xmu
Copy link
Collaborator

@yi-xmu yi-xmu commented Feb 2, 2026

To prevent OUT_OF_MEMORY from causing log replay can not completed during ccnode recovery, DataSync operations should be allowed under candidate leader conditions. Additionally:

  1. In this case we can flush data and kickout cce, but we cannot truncate redo log based on this data sync ts since we might miss the data that has not been recovered yet.
  2. we cannot update last sync ts of the range/catalog.
  3. The term check is uniformly placed in the Execute method of ccrequest, using the ValidTermCheck method.
  4. Add a flag named allow_run_on_candidate_ to ccrequest to determine whether to check the candidate term when performing term checks.

Summary by CodeRabbit

  • Refactor
    • Improved internal leadership validation and term management to enhance system stability during recovery and cluster transitions.
    • Enhanced transaction execution logic to better handle candidate node scenarios during failover.
    • Refined data synchronization and term resolution to ensure consistency and reliability across cluster members.
    • Strengthened error handling and diagnostics for cluster operations.

fix release metadata readlocc

fix update last sync ts

update debug log

debug log and debug print ccpage function

debug log for range split datasyncscan and debug print ccmap function

disable debug log

fix slice size mismatch
@yi-xmu yi-xmu self-assigned this Feb 2, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 2, 2026

Walkthrough

This pull request extends transaction service components to support running operations on candidate leader nodes, in addition to current leaders. It refactors term resolution across multiple CC request types and handlers to use request-provided terms with candidate/standby fallbacks, updates method signatures to propagate an allow_run_on_candidate flag, and adds a check_candidate parameter to leader-term verification.

Changes

Cohort / File(s) Summary
Term Resolution Refactoring
cc_catalog_cc_map.h, cluster_config_cc_map.h, range_bucket_cc_map.h, range_cc_map.h
Replaced direct Sharder::Instance().LeaderTerm() calls with req.NodeGroupTerm() and introduced multi-step fallback logic: if term is negative, attempt CandidateLeaderTerm() when allowed, then fall back to LeaderTerm() and StandbyNodeTerm(), with assertions enforcing positive terms.
Template CC Map Updates
template_cc_map.h
Refactored term derivation across Execute paths in multiple CC types (AcquireCc, AcquireAllCc, PostWriteCc, ReadCc, ScanNextBatchCc) to use request-sourced terms and candidate-aware logic; removed cascading external leader-term validation blocks.
CC Request Infrastructure
cc_request.h, cc_shard.h, cc_req_misc.h
Extended request classes with new term accessors (NodeGroupTerm(), GetNodeGroupLeaderTerm()), added ValidTermCheck() overrides incorporating candidate terms, and introduced allow_run_on_candidate flag propagation through constructors and reset methods.
Handler Method Signatures
cc_handler.h, local_cc_handler.h, remote_cc_handler.h
Added bool allow_run_on_candidate parameter to public methods including AcquireWrite, PostWrite, PostRead, Read, ReadLocal, and NewTxn across all handler implementations, with defaults preserving compatibility.
Transaction Execution
tx_execution.h, tx_execution.cpp, tx_request.h, tx_util.h
Extended InitTx signature and TransactionExecution with allow_run_on_candidate parameter; removed is_recovering flag from ReadTxRequest; integrated new flag into candidate-aware term checking via CheckLeaderTerm().
Sharder and Support
sharder.h, sharder.cpp
Extended CheckLeaderTerm() with optional check_candidate parameter to incorporate candidate leader terms when validating leadership, altering error handling for non-leader conditions.
Partition and Sequence Management
local_cc_shards.h, local_cc_shards.cpp, sequences.h, sequences.cpp
Added NodeGroupId ng_id parameter to partition ID allocation (GetNextRangePartitionId, ApplyIdOfTableRangePartition); updated call sites to pass ng_id and extended NewTxInit invocations with additional transaction context parameters.
Data Sync Infrastructure
data_sync_task.h, data_sync_task.cpp, checkpointer.cpp
Added run_on_leader_node_ member flag to DataSyncTask; updated CheckLeaderTerm usages to include new boolean parameter for candidate-aware validation.
Remote CC and Proto
remote_cc_handler.h, remote_cc_handler.cpp, remote_cc_request.cpp, cc_request.proto
Added allow_run_on_candidate parameter to RemoteCcHandler::Read and propagated into ReadRequest; added corresponding proto field to ReadRequest message.
Logging and Diagnostics
cc_req_misc.h, tx_operation.cpp
Enhanced abort logging in FillStoreSliceCc with table name; improved error diagnostics in SplitFlushRangeOp by appending error codes to log messages.

Sequence Diagram

sequenceDiagram
    actor Client
    participant TransactionExecution
    participant CcHandler as CcHandler<br/>(LocalCcHandler)
    participant Sharder
    participant CcRequest as CcRequest<br/>(e.g., ReadCc)
    participant CcShard

    Client->>TransactionExecution: InitTx(allow_run_on_candidate=true)
    TransactionExecution->>TransactionExecution: Store allow_run_on_candidate_
    TransactionExecution->>CcHandler: NewTxn(..., allow_run_on_candidate=true)
    CcHandler->>CcRequest: Reset(..., allow_run_on_candidate=true)
    CcRequest->>CcRequest: Store allow_run_on_candidate flag

    Client->>TransactionExecution: Read(...)
    TransactionExecution->>CcHandler: Read(..., allow_run_on_candidate_=true)
    CcHandler->>CcRequest: Reset(..., allow_run_on_candidate=true)
    
    CcRequest->>Sharder: CandidateLeaderTerm(ng_id)
    Sharder-->>CcRequest: candidate_term (if available)
    
    alt candidate_term available and allowed
        CcRequest->>CcRequest: ng_term = candidate_term
    else fallback
        CcRequest->>Sharder: LeaderTerm(ng_id)
        Sharder-->>CcRequest: leader_term
        CcRequest->>Sharder: StandbyNodeTerm()
        Sharder-->>CcRequest: standby_term
        CcRequest->>CcRequest: ng_term = max(leader_term, standby_term)
    end
    
    CcRequest->>CcShard: Execute(ng_term)
    CcShard->>CcShard: Use ng_term for lock/data operations
    CcShard-->>CcRequest: Result
    CcRequest-->>CcHandler: Result
    CcHandler-->>TransactionExecution: Result
    TransactionExecution-->>Client: Data
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Store ng leader term with ng_leader_cache. #129: Modifies Sharder leader-term propagation and term handling, overlapping with this PR's refactoring of leader-term checks and introduction of candidate-term logic.
  • Fix eloqkv report oom #203: Updates the same CcHandler/Read/Acquire/PostWrite method signatures, adding boolean parameters to parallel interfaces affected by this candidate-aware execution feature.

Suggested reviewers

  • liunyl
  • thweetkomputer

🐰 In meadows of leaders, a shift takes place,
Candidates rise to join the race,
No longer waiting on the side,
They execute with fallback pride,
Terms resolve with grace and care—
Distributed consensus everywhere!

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Allow run on candidate' is vague and generic, using non-descriptive terminology that doesn't convey the specific technical change or its business purpose. Consider a more specific title such as 'Allow DataSync operations on candidate leaders during recovery' or 'Support candidate-leader execution to prevent OOM during log replay'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The PR description explains the motivation and implementation approach well, but the required template checklist items are not addressed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch allow_run_on_candidate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
tx_service/src/remote/remote_cc_request.cpp (1)

478-495: ⚠️ Potential issue | 🔴 Critical

Fix ReadCc::Reset argument order for outside reads.
After inserting allow_run_on_candidate, the outside‑read call shifts arguments (-1 → bool, abort_if_oom → partition_id). Pass point_read_on_miss explicitly to keep parameters aligned.

🛠️ Suggested fix
         ReadCc::Reset(&remote_table_name_,
                       req.schema_version(),
                       &req.key(),
                       req.key_shard_code(),
                       out_record,
                       read_type,
                       input_msg->tx_number(),
                       input_msg->tx_term(),
                       req.ts(),
                       &cc_res_,
                       ToLocalType::ConvertIsolation(req.iso_level()),
                       ToLocalType::ConvertProtocol(req.protocol()),
                       req.is_for_write(),
                       req.is_covering_keys(),
                       nullptr,
-                      false,
-                      -1,
-                      req.abort_if_oom());
+                      false,  // allow_run_on_candidate
+                      false,  // point_read_on_miss
+                      -1,     // partition_id
+                      req.abort_if_oom());
tx_service/src/tx_execution.cpp (1)

2006-2014: ⚠️ Potential issue | 🟠 Major

Avoid stale bucket-lock state on cache hit.
Marking lock_range_bucket_result_ finished here can cause later ops (e.g., object commands) to skip locking and reuse a stale bucket_rec_ when FastToGetBucket() misses, which can misroute shard codes. Prefer clearing the lock result on cache hits instead of setting it finished.

Proposed fix
-                    // Set the lock range bucket result
-                    lock_range_bucket_result_.SetFinished();
+                    // Cache hit doesn't populate bucket_rec_; avoid leaking
+                    // prior lock state into later ops.
+                    lock_range_bucket_result_.Value().Reset();
+                    lock_range_bucket_result_.Reset();
tx_service/include/cc/template_cc_map.h (2)

504-512: ⚠️ Potential issue | 🔴 Critical

Add the trailing semicolon after CODE_FAULT_INJECTOR.

Line 512 now ends the macro invocation with }). If the macro expands to a do { ... } while (0) block (common pattern), omitting the semicolon breaks compilation. Other call sites use });.

🛠️ Proposed fix
-        })
+        });

6053-6087: ⚠️ Potential issue | 🔴 Critical

Replace the leader_term > 0 check with CheckLeaderTerm() to properly verify this node is the leader.

GetNodeGroupLeaderTerm() returns std::max(LeaderTerm(ng_id), CandidateLeaderTerm(ng_id)), not an indicator of whether the local node is leader. A candidate with a higher term would cause leader_term > 0 even though this node is not the leader, triggering the assert incorrectly. Use Sharder::Instance().CheckLeaderTerm(cc_ng_id_, current_term) instead to verify the node actually holds leadership before asserting buffered commands are invalid. Also verify that current_term (from req.NodeGroupTerm()) is guaranteed positive at this point.

tx_service/include/cc/cc_request.h (1)

402-551: ⚠️ Potential issue | 🟠 Major

Reset allow_run_on_candidate_ in the string-key Reset to avoid stale state.

These requests are pooled; if a prior call set the flag to true, the string-key Reset currently leaves it unchanged and can unintentionally run remote acquires on candidates.

🔧 Suggested fix
         is_local_ = false;
         block_by_lock_ = false;
         abort_if_oom_ = abort_if_oom;
+        allow_run_on_candidate_ = false;
tx_service/src/cc/local_cc_shards.cpp (1)

3390-3497: ⚠️ Potential issue | 🟠 Major

Unpin should also run when term == 0.

With candidate-enabled flows, TryPinNodeGroupData can return 0 on success. The current if (ng_term > 0) skips unpinning in that case and can leak a pin. Switch to >= 0 to match the pin-success contract.

🛠️ Suggested fix
-    if (ng_term > 0)
+    if (ng_term >= 0)
     {
         Sharder::Instance().UnpinNodeGroupData(task->node_group_id_);
     }
Based on learnings: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded).
tx_service/src/cc/local_cc_handler.cpp (4)

108-123: ⚠️ Potential issue | 🟠 Major

Propagate allow_run_on_candidate to remote AcquireWrite.
Local requests pass the flag into AcquireCc::Reset, but the remote path drops it, so candidate leaders on other nodes may still be rejected.

Proposed fix
         remote_hd_.AcquireWrite(cc_shards_.node_id_,
                                 ng_id,
                                 table_name,
                                 schema_version,
                                 key,
                                 key_shard_code,
                                 tx_number,
                                 tx_term,
                                 command_id,
                                 ts,
                                 is_insert,
                                 hres,
                                 hd_res_idx,
                                 proto,
                                 iso_level,
-                                abort_if_oom);
+                                abort_if_oom,
+                                allow_run_on_candidate);

310-319: ⚠️ Potential issue | 🟠 Major

Propagate allow_run_on_candidate to remote PostWrite.
The local path forwards the flag, but the remote call omits it, creating inconsistent candidate-leader support.

Proposed fix
         remote_hd_.PostWrite(cc_shards_.node_id_,
                              tx_number,
                              tx_term,
                              command_id,
                              commit_ts,
                              cce_addr,
                              record,
                              operation_type,
                              key_shard_code,
-                             hres);
+                             hres,
+                             allow_run_on_candidate);

413-423: ⚠️ Potential issue | 🟠 Major

Propagate allow_run_on_candidate to remote PostRead.
PostReadCc::Reset gets the flag locally, but the remote call omits it, so candidate-term validation may fail remotely.

Proposed fix
         remote_hd_.PostRead(cc_shards_.node_id_,
                             tx_number,
                             tx_term,
                             command_id,
                             key_ts,
                             gap_ts,
                             commit_ts,
                             cce_addr,
                             hres,
                             need_remote_resp,
-                            post_read_type);
+                            post_read_type,
+                            allow_run_on_candidate);

792-807: ⚠️ Potential issue | 🟠 Major

Pass allow_run_on_candidate into the string-key ReadCc::Reset.
The TxKey overload forwards the flag, but the string-key overload stops at the archive pointer; candidate runs may still be rejected in request execution.

Proposed fix
     read_req->Reset(&table_name,
                     0,
                     key_str,
                     shard_code,
                     &record,
                     read_type,
                     tx_number,
                     tx_term,
                     ts,
                     &hres,
                     iso_level,
                     proto,
                     is_for_write,
                     false,
-                    nullptr);
+                    nullptr,
+                    allow_run_on_candidate);
🤖 Fix all issues with AI agents
In `@tx_service/include/cc/catalog_cc_map.h`:
- Around line 1143-1160: The code currently asserts ng_term > 0 after falling
back via Sharder::Instance() calls; replace this assertion with a runtime error
path: if ng_term <= 0, log a clear error referencing req.NodeGroupId() and the
failed lookups (NodeGroupTerm/StandbyNodeTerm/CandidateLeaderTerm), set the
function's error/Status return (or populate the response error fields used by
this module) and return early instead of proceeding to use ng_term; remove the
assert(ng_term > 0) and ensure no lock metadata is written when ng_term is
invalid.
- Around line 197-199: Replace the assert on ng_term with a runtime guard: check
req.NodeGroupTerm() into ng_term and if ng_term <= 0 return a proper error (or
take the same fallback path used by ReadCc) instead of crashing; update the
block around ng_term,
CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", ...) and
subsequent logic to handle the invalid term via the error/fallback path so
downstream code never receives a negative term.

In `@tx_service/include/cc/cluster_config_cc_map.h`:
- Around line 69-71: The assert on ng_term should be replaced with a graceful
validation like other cc_map implementations: after reading uint32_t ng_id =
req.NodeGroupId(); int64_t ng_term = req.NodeGroupTerm(); remove assert(ng_term
> 0) and instead check if (ng_term < 0) and return REQUESTED_NODE_NOT_LEADER (or
the function's error return consistent with catalog_cc_map.h / range_cc_map.h
patterns); ensure you reference ng_term and REQUESTED_NODE_NOT_LEADER so the
function exits cleanly when term is invalid rather than crashing.

In `@tx_service/include/cc/local_cc_handler.h`:
- Around line 110-121: The remote PostRead declaration and plumbing must be
updated to accept and forward the allow_run_on_candidate flag: modify the remote
handler's PostRead signature in remote_cc_handler.h to include the bool
allow_run_on_candidate parameter (matching local_cc_handler.h), update the
implementation that constructs the ValidateRequest RPC message to set the
corresponding field on ValidateRequest, and propagate the flag through any
intermediate call sites and RPC stubs (e.g., methods that build/send
ValidateRequest and the RPC handler) so remote PostRead honors
allow_run_on_candidate. Ensure symbols updated include PostRead,
allow_run_on_candidate, and ValidateRequest.

In `@tx_service/include/cc/local_cc_shards.h`:
- Around line 2450-2454: The compilation fails because
BigTableHandler::GetNextRangePartitionId and
DynamoHandler::GetNextRangePartitionId use override but no matching virtual
exists in DataStoreHandler, and their parameter lists differ from
LocalCcShards::GetNextRangePartitionId; fix by adding a virtual declaration in
DataStoreHandler with the canonical signature used by LocalCcShards — bool
GetNextRangePartitionId(const TableName &tablename, const TableSchema
*table_schema, NodeGroupId ng_id, uint32_t range_cnt, int32_t
&out_next_partition_id) — or alternatively remove override from the handlers;
then update BigTableHandler::GetNextRangePartitionId and
DynamoHandler::GetNextRangePartitionId to match that canonical parameter list
(add table_schema and ng_id, remove retry_count) so all declarations/definitions
are consistent.

In `@tx_service/include/cc/range_bucket_cc_map.h`:
- Around line 92-108: The code currently relies on assert(ng_term > 0) after
trying req.NodeGroupTerm(), CandidateLeaderTerm(), LeaderTerm() and
StandbyNodeTerm(); replace this assert with a runtime error path: after
computing ng_term check if ng_term <= 0 and if so log or construct a descriptive
error including req.NodeGroupId(), the original req.NodeGroupTerm(),
CandidateLeaderTerm(...), LeaderTerm(...), and StandbyNodeTerm(), then throw a
std::runtime_error (or return a failure status according to your error model)
instead of asserting so callers receive a deterministic error rather than
crashing or continuing with invalid metadata.

In `@tx_service/include/cc/range_cc_map.h`:
- Around line 220-236: The code currently asserts ng_term > 0 after trying
CandidateLeaderTerm, LeaderTerm and StandbyNodeTerm; replace the assert with a
proper early return that fails fast when no valid term exists: check if ng_term
<= 0 after those fallbacks and return an appropriate NOT_LEADER or RETRY error
code (the same error path the caller expects) instead of proceeding to call
AcquireCceKeyLock and constructing a CcEntryAddr; reference the
Sharder::Instance() calls and methods CandidateLeaderTerm, LeaderTerm,
StandbyNodeTerm and ensure the function returns the error before any use of
AcquireCceKeyLock/CcEntryAddr.

In `@tx_service/include/cc/template_cc_map.h`:
- Around line 202-203: Replace the assert-based term validation with a runtime
check: in the locations where you do "int64_t ng_term = req.NodeGroupTerm();"
(including the Execute() path and the other site around the current asserts), if
ng_term <= 0 return the same error used by ReadCc (REQUESTED_NODE_NOT_LEADER)
instead of asserting; update the function that returns the RPC/Status to
propagate REQUESTED_NODE_NOT_LEADER so callers handle it like ReadCc (see ReadCc
handling around lines ~1391-1406 for the exact behavior) and remove the
assert(ng_term > 0) usages.

In `@tx_service/include/data_sync_task.h`:
- Around line 186-189: The SetSyncOnLeader() setter and unsynchronized reads of
sync_on_leader_ in DataSyncTask can race; change the DataSyncTask member
sync_on_leader_ to std::atomic<bool> and update SetSyncOnLeader() to call
sync_on_leader_.store(sync_on_leader, std::memory_order_release) and replace all
unsynchronized reads (e.g., in finalization expressions like
task->sync_on_leader_ ? task->data_sync_ts_ : 0) with
sync_on_leader_.load(std::memory_order_acquire) so reads and writes are properly
synchronized.

In `@tx_service/src/cc/cc_req_misc.cpp`:
- Around line 1171-1199: The const method UpdateCceCkptTsCc::ValidTermCheck() is
missing the assert guard used by other const variants; add assert(term_ > 0);
near the start of ValidTermCheck (before the comparison against current_term) so
uninitialized/invalid term_ values are caught early; ensure the assertion is
placed in the body of UpdateCceCkptTsCc::ValidTermCheck() prior to the if
(current_term < 0 || current_term != term_) check and that <cassert> is
available in the translation unit.
🧹 Nitpick comments (2)
tx_service/src/cc/cc_shard.cpp (1)

1072-1075: Resolve the candidate-leader recovery TODO before merge.

Please either implement the intended candidate-term behavior here or document why recovery must remain leader-only. I can help draft the fix or open a tracking issue if needed.

tx_service/include/cc/range_cc_map.h (1)

615-618: Keep term resolution consistent for key‑cache init

Line 615‑618: this path now relies solely on req.NodeGroupTerm() with a hard assert(ng_term > 0) but doesn’t reuse the candidate/leader/standby fallback from ReadCc. If NodeGroupTerm isn’t populated for this path, the cache init will assert and bail. Consider reusing the same resolution helper or gracefully skipping cache init when the term is invalid.

Comment on lines +197 to 199
int64_t ng_term = req.NodeGroupTerm();
assert(ng_term > 0);
CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard invalid NodeGroupTerm() instead of assert‑only.

If a request arrives with a negative term, this assert can crash debug builds; in release builds the negative term is used downstream. Prefer a runtime error path (optionally mirroring the ReadCc fallback) so invalid terms are rejected cleanly.

🔧 Proposed fix (robust guard + fallback)
-        int64_t ng_term = req.NodeGroupTerm();
-        assert(ng_term > 0);
+        int64_t ng_term = req.NodeGroupTerm();
+        if (ng_term < 0)
+        {
+            if (req.AllowRunOnCandidate())
+            {
+                ng_term = Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId());
+            }
+            if (ng_term < 0)
+            {
+                ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId());
+                int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();
+                ng_term = std::max(ng_term, standby_node_term);
+            }
+        }
+        if (ng_term <= 0)
+        {
+            req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
+            return true;
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int64_t ng_term = req.NodeGroupTerm();
assert(ng_term > 0);
CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", {
int64_t ng_term = req.NodeGroupTerm();
if (ng_term < 0)
{
if (req.AllowRunOnCandidate())
{
ng_term = Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId());
}
if (ng_term < 0)
{
ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId());
int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();
ng_term = std::max(ng_term, standby_node_term);
}
}
if (ng_term <= 0)
{
req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
return true;
}
CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", {
🤖 Prompt for AI Agents
In `@tx_service/include/cc/catalog_cc_map.h` around lines 197 - 199, Replace the
assert on ng_term with a runtime guard: check req.NodeGroupTerm() into ng_term
and if ng_term <= 0 return a proper error (or take the same fallback path used
by ReadCc) instead of crashing; update the block around ng_term,
CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", ...) and
subsequent logic to handle the invalid term via the error/fallback path so
downstream code never receives a negative term.

Comment on lines 1143 to 1160
uint32_t ng_id = req.NodeGroupId();
int64_t ng_term = Sharder::Instance().LeaderTerm(ng_id);
ng_term = std::max(ng_term, Sharder::Instance().StandbyNodeTerm());

if (req.IsInRecovering())
int64_t ng_term = req.NodeGroupTerm();
if (ng_term < 0)
{
ng_term = ng_term > 0
? ng_term
: Sharder::Instance().CandidateLeaderTerm(ng_id);
if (req.AllowRunOnCandidate())
{
ng_term = Sharder::Instance().CandidateLeaderTerm(ng_id);
}
if (ng_term < 0)
{
ng_term = Sharder::Instance().LeaderTerm(ng_id);
int64_t standby_node_term =
Sharder::Instance().StandbyNodeTerm();
ng_term = std::max(ng_term, standby_node_term);
}
}
assert(ng_term > 0);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle unresolved term after fallback instead of asserting.

If the leader/standby/candidate lookup still yields a negative term, the current assert can crash and negative terms can leak into lock metadata in release builds. Return a proper error instead.

🔧 Proposed fix (runtime error on invalid term)
-        assert(ng_term > 0);
+        if (ng_term <= 0)
+        {
+            req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
+            return true;
+        }
🤖 Prompt for AI Agents
In `@tx_service/include/cc/catalog_cc_map.h` around lines 1143 - 1160, The code
currently asserts ng_term > 0 after falling back via Sharder::Instance() calls;
replace this assertion with a runtime error path: if ng_term <= 0, log a clear
error referencing req.NodeGroupId() and the failed lookups
(NodeGroupTerm/StandbyNodeTerm/CandidateLeaderTerm), set the function's
error/Status return (or populate the response error fields used by this module)
and return early instead of proceeding to use ng_term; remove the assert(ng_term
> 0) and ensure no lock metadata is written when ng_term is invalid.

Comment on lines 69 to +71
uint32_t ng_id = req.NodeGroupId();
int64_t ng_term = Sharder::Instance().LeaderTerm(ng_id);

if (ng_term < 0)
{
hd_res->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
return true;
}
int64_t ng_term = req.NodeGroupTerm();
assert(ng_term > 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the file and context around lines 69-71
fd -e h -e cc -e cpp "cluster_config_cc_map" | head -5

Repository: eloqdata/tx_service

Length of output: 108


🏁 Script executed:

# Search for ValidTermCheck usage and NodeGroupTerm patterns
rg -n "ValidTermCheck" --max-count=20 | head -30

Repository: eloqdata/tx_service

Length of output: 2440


🏁 Script executed:

# Search for NodeGroupTerm usage and see the context
rg -n "NodeGroupTerm" -B2 -A2 --max-count=30 | head -50

Repository: eloqdata/tx_service

Length of output: 3600


🏁 Script executed:

# Look for AcquireAllCc method and surrounding validation
rg -n "AcquireAllCc" -B5 -A15 | head -60

Repository: eloqdata/tx_service

Length of output: 5083


🏁 Script executed:

# Read the cluster_config_cc_map.h file around lines 69-71 with broader context
cat tx_service/include/cc/cluster_config_cc_map.h | head -100

Repository: eloqdata/tx_service

Length of output: 4212


🏁 Script executed:

# Search for TryPinNodeGroupData usage in cluster_config_cc_map.h
rg -n "TryPinNodeGroupData" tx_service/include/cc/cluster_config_cc_map.h -B5 -A5

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check if there's any validation on NodeGroupTerm in the request class
rg -n "class.*AcquireAllCc\|struct.*AcquireAllCc" -A30 | head -80

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Look for where NodeGroupTerm is set/validated in AcquireAll request
rg -n "NodeGroupTerm\(\)" -B10 -A2 tx_service/include/cc/cluster_config_cc_map.h | head -50

Repository: eloqdata/tx_service

Length of output: 605


🏁 Script executed:

# Read more of the Execute method to see how ng_term is used
sed -n '61,150p' tx_service/include/cc/cluster_config_cc_map.h

Repository: eloqdata/tx_service

Length of output: 4187


🏁 Script executed:

# Check if AcquireAllCc has ValidTermCheck method
rg -n "class AcquireAllCc\|struct AcquireAllCc" -A40 | head -100

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Search for where NodeGroupTerm is set/populated in the request creation
rg -n "NodeGroupTerm\(" --max-count=20 | grep -E "Set|Reset|="

Repository: eloqdata/tx_service

Length of output: 1216


🏁 Script executed:

# Check remote_cc_request.h for RemoteAcquireAll::Reset and ValidTermCheck
sed -n '85,200p' tx_service/include/remote/remote_cc_request.h

Repository: eloqdata/tx_service

Length of output: 2884


🏁 Script executed:

# Search for where ClusterConfigCcMap::Execute is called and what validation happens before
rg -n "ClusterConfigCcMap.*Execute\|cluster_config_ccm_\|cluster_config_ccm\|AcquireAllCc" tx_service/src/cc/local_cc_handler.cpp -B5 -A5 | head -100

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check the critical finding: template_cc_map.h line 2379 checks ng_term < 0
sed -n '2375,2385p' tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 365


🏁 Script executed:

# Check what happens when term < 0 in other methods
rg -n "if.*ng_term.*<.*0\|if.*term.*<.*0" tx_service/include/cc/*.h | head -30

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Search for AcquireAllCc class definition to see if ValidTermCheck exists
rg -n "class AcquireAllCc\|struct AcquireAllCc" tx_service/include/cc/cc_request.h -A50 | head -100

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check if NodeGroupTerm() comes from the protobuf request message and what values it can have
rg -n "NodeGroupTerm\(\)" tx_service/include/cc/cc_request.h -B2 -A5 | head -40

Repository: eloqdata/tx_service

Length of output: 761


🏁 Script executed:

# Search for the base class AcquireAllCc and its NodeGroupTerm implementation
rg -n "class AcquireAllCc" tx_service/include/cc/cc_request.h -A80 | head -120

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check similar patterns in other cc_maps for comparison
rg -n "req\.NodeGroupTerm\(\)" tx_service/include/cc/*.h -B1 -A1 | grep -E "assert|if.*<|if.*error" | head -20

Repository: eloqdata/tx_service

Length of output: 1098


🏁 Script executed:

# Get the full AcquireAllCc class definition
rg -n "struct AcquireAllCc\|class AcquireAllCc" tx_service/include/cc/cc_request.h -A100 | head -150

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check if the NodeGroupTerm_ is set from protobuf message without validation
rg -n "ng_term_\s*=" tx_service/include/cc/cc_request.h -B2 -A2 | head -40

Repository: eloqdata/tx_service

Length of output: 857


🏁 Script executed:

# Verify: Does Sharder::Instance().LeaderTerm() return negative on invalid state?
rg -n "LeaderTerm" tx_service/src/cc/local_cc_shards.cpp -B3 -A10 | head -60

Repository: eloqdata/tx_service

Length of output: 2533


Replace assert with graceful term validation to match other cc_map implementations.

This assert will hard-crash if ng_term is ≤ 0, but the codebase shows ng_term can legitimately be negative: (1) ng_term_ is initialized to -1 in cc_request.h and conditionally patched later, (2) Sharder::Instance().LeaderTerm() can return negative values (see local_cc_shards.cpp:1594), and (3) other cc_maps like catalog_cc_map.h, range_bucket_cc_map.h, and range_cc_map.h use graceful if (ng_term < 0) checks instead. Replace the assert with a proper validation check that returns REQUESTED_NODE_NOT_LEADER on invalid term, consistent with the pattern used elsewhere in the codebase.

🤖 Prompt for AI Agents
In `@tx_service/include/cc/cluster_config_cc_map.h` around lines 69 - 71, The
assert on ng_term should be replaced with a graceful validation like other
cc_map implementations: after reading uint32_t ng_id = req.NodeGroupId();
int64_t ng_term = req.NodeGroupTerm(); remove assert(ng_term > 0) and instead
check if (ng_term < 0) and return REQUESTED_NODE_NOT_LEADER (or the function's
error return consistent with catalog_cc_map.h / range_cc_map.h patterns); ensure
you reference ng_term and REQUESTED_NODE_NOT_LEADER so the function exits
cleanly when term is invalid rather than crashing.

Comment on lines +110 to +121
CcReqStatus PostRead(uint64_t tx_number,
int64_t tx_term,
uint16_t command_id,
uint64_t key_ts,
uint64_t gap_ts,
uint64_t commit_ts,
const CcEntryAddr &ccentry_addr,
CcHandlerResult<PostProcessResult> &hres,
bool is_local = false,
bool need_remote_resp = true,
PostReadType post_read_type = PostReadType::Release,
bool allow_run_on_candidate = false) override;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Locate RemoteCcHandler::PostRead and candidate flag handling
rg -n --type=cpp 'RemoteCcHandler::PostRead' -A 15
rg -n --type=cpp 'allow_run_on_candidate' --type=cpp

Repository: eloqdata/tx_service

Length of output: 9489


🏁 Script executed:

#!/bin/bash
# Check the header file for all PostRead overloads in RemoteCcHandler
echo "=== RemoteCcHandler header (remote_cc_handler.h) ==="
rg -n -B2 -A12 'class RemoteCcHandler' tx_service/include/remote/remote_cc_handler.h | head -50
echo ""
echo "=== All PostRead declarations in header ==="
rg -n -B2 -A10 'PostRead' tx_service/include/remote/remote_cc_handler.h

echo ""
echo "=== Check line 329 and context in remote_cc_handler.cpp ==="
sed -n '320,395p' tx_service/src/remote/remote_cc_handler.cpp

Repository: eloqdata/tx_service

Length of output: 4203


🏁 Script executed:

#!/bin/bash
echo "=== Full RemoteCcHandler::PostRead implementation ==="
sed -n '272,305p' tx_service/src/remote/remote_cc_handler.cpp

echo ""
echo "=== LocalCcHandler::PostRead signature from header ==="
sed -n '110,121p' tx_service/include/cc/local_cc_handler.h

echo ""
echo "=== Check if there are other PostRead overloads in remote handler ==="
rg -n 'PostRead' tx_service/include/remote/remote_cc_handler.h
</parameter>

Repository: eloqdata/tx_service

Length of output: 2415


Remote PostRead must propagate the allow_run_on_candidate parameter.

The local PostRead signature now includes allow_run_on_candidate, but the remote PostRead (declared in tx_service/include/remote/remote_cc_handler.h lines 103-114) is missing this parameter. The implementation does not set it on the ValidateRequest message. This means candidate-enabled PostReads routed remotely silently drop the flag, potentially allowing invalid operations on candidate nodes. Update the remote PostRead signature and RPC message plumbing to match the local API.

🤖 Prompt for AI Agents
In `@tx_service/include/cc/local_cc_handler.h` around lines 110 - 121, The remote
PostRead declaration and plumbing must be updated to accept and forward the
allow_run_on_candidate flag: modify the remote handler's PostRead signature in
remote_cc_handler.h to include the bool allow_run_on_candidate parameter
(matching local_cc_handler.h), update the implementation that constructs the
ValidateRequest RPC message to set the corresponding field on ValidateRequest,
and propagate the flag through any intermediate call sites and RPC stubs (e.g.,
methods that build/send ValidateRequest and the RPC handler) so remote PostRead
honors allow_run_on_candidate. Ensure symbols updated include PostRead,
allow_run_on_candidate, and ValidateRequest.

Comment on lines 2450 to 2454
bool GetNextRangePartitionId(const TableName &tablename,
const TableSchema *table_schema,
NodeGroupId ng_id,
uint32_t range_cnt,
int32_t &out_next_partition_id);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, verify the function signature at the specified location
echo "=== Function signature at lines 2450-2454 ==="
sed -n '2450,2454p' tx_service/include/cc/local_cc_shards.h

echo -e "\n=== All occurrences of GetNextRangePartitionId ==="
rg -n -C3 'GetNextRangePartitionId\s*\(' --no-heading

Repository: eloqdata/tx_service

Length of output: 3970


🏁 Script executed:

#!/bin/bash
# Check if GetNextRangePartitionId is a virtual method in a base class
echo "=== Search for base class or interface definitions ==="
rg -n 'class.*Handler.*{' store_handler/ -A 5

echo -e "\n=== Check for virtual keyword in declarations ==="
rg -n 'virtual.*GetNextRangePartitionId' store_handler/

echo -e "\n=== Check bigtable_handler.h for method signatures ==="
sed -n '220,230p' store_handler/bigtable_handler.h

echo -e "\n=== Check dynamo_handler.h for method signatures ==="
sed -n '165,175p' store_handler/dynamo_handler.h

Repository: eloqdata/tx_service

Length of output: 1364


🏁 Script executed:

#!/bin/bash
# Find base class more simply
echo "=== Looking for StoreHandler or base class interface ==="
fd -n "\.h$" store_handler/ | head -5 | xargs -I {} sh -c 'echo "=== {} ==="; head -30 {}'

echo -e "\n=== Search for GetNextRangePartitionId in base/interface files ==="
rg -n 'GetNextRangePartitionId' store_handler/ -B 2 -A 1

echo -e "\n=== Check if there's a StoreHandler base class ==="
fd -n "store_handler" --maxdepth 1 -type f -name "*.h" -exec grep -l "class.*StoreHandler" {} \;

Repository: eloqdata/tx_service

Length of output: 1960


🏁 Script executed:

#!/bin/bash
# Find base class definition
echo "=== List store_handler header files ==="
find store_handler -maxdepth 1 -type f -name "*.h"

echo -e "\n=== Search for class declarations in store_handler ==="
rg 'class.*{' store_handler/ -m 1

echo -e "\n=== Check bigtable_handler.h class declaration ==="
head -50 store_handler/bigtable_handler.h | grep -A 20 'class'

echo -e "\n=== Check dynamo_handler.h class declaration ==="
head -50 store_handler/dynamo_handler.h | grep -A 20 'class'

echo -e "\n=== Look for store_handler base/interface ==="
find store_handler -name "*.h" | xargs grep -l 'virtual.*GetNextRangePartitionId'

Repository: eloqdata/tx_service

Length of output: 1262


🏁 Script executed:

#!/bin/bash
# Find DataStoreHandler base class definition
echo "=== Search for DataStoreHandler class definition ==="
rg 'class DataStoreHandler' -A 30

echo -e "\n=== Look for GetNextRangePartitionId in base class ==="
rg 'DataStoreHandler' -l | head -5

echo -e "\n=== Check store directory or include files ==="
find . -path ./tx_service -prune -o -name "*.h" -type f -exec grep -l 'DataStoreHandler' {} \; | head -10

Repository: eloqdata/tx_service

Length of output: 10061


🏁 Script executed:

#!/bin/bash
# Find GetNextRangePartitionId in the base class definition
echo "=== Extract GetNextRangePartitionId from base class ==="
rg -A 10 'GetNextRangePartitionId' tx_service/include/store/data_store_handler.h

echo -e "\n=== Summary of all signatures ==="
echo "LocalCcShards (tx_service/include/cc/local_cc_shards.h:2450):"
sed -n '2450,2454p' tx_service/include/cc/local_cc_shards.h

echo -e "\nBigTableHandler (store_handler/bigtable_handler.h:222):"
sed -n '222,225p' store_handler/bigtable_handler.h

echo -e "\nDynamoHandler (store_handler/dynamo_handler.h:168):"
sed -n '168,171p' store_handler/dynamo_handler.h

Repository: eloqdata/tx_service

Length of output: 1126


🏁 Script executed:

#!/bin/bash
# Search entire data_store_handler.h file for GetNextRangePartitionId
echo "=== Full content of data_store_handler.h ==="
wc -l tx_service/include/store/data_store_handler.h

echo -e "\n=== Search for GetNextRangePartitionId ==="
grep -n "GetNextRangePartitionId" tx_service/include/store/data_store_handler.h

echo -e "\n=== Check if DataStoreHandler has any method named GetNext ==="
grep -n "GetNext" tx_service/include/store/data_store_handler.h

echo -e "\n=== Get entire DataStoreHandler class definition ==="
sed -n '/^class DataStoreHandler/,/^}/p' tx_service/include/store/data_store_handler.h | head -150

Repository: eloqdata/tx_service

Length of output: 5725


🏁 Script executed:

#!/bin/bash
# Verify that GetNextRangePartitionId is not in the base class
echo "=== Full DataStoreHandler file length ==="
wc -l tx_service/include/store/data_store_handler.h

echo -e "\n=== Search entire file for GetNextRangePartitionId ==="
grep -c "GetNextRangePartitionId" tx_service/include/store/data_store_handler.h

echo -e "\n=== All usages of GetNextRangePartitionId in codebase ==="
rg 'GetNextRangePartitionId' --no-heading -l

echo -e "\n=== Check if handler methods are actually called ==="
rg 'handler.*GetNextRangePartitionId|GetNextRangePartitionId.*handler' -C 3

echo -e "\n=== Verify LocalCcShards call site ==="
sed -n '5648,5658p' tx_service/src/cc/local_cc_shards.cpp

Repository: eloqdata/tx_service

Length of output: 1035


Fix compilation errors: override keyword on non-existent base class methods.

BigTableHandler and DynamoHandler declare GetNextRangePartitionId with the override keyword, but this method does not exist in the base class DataStoreHandler. This causes a compilation error. Either remove the override keyword or define the method in the base class.

Additionally, the handler signatures are incompatible with LocalCcShards::GetNextRangePartitionId:

  • LocalCcShards: GetNextRangePartitionId(tablename, table_schema, ng_id, range_cnt, out_partition_id)
  • Handlers: GetNextRangePartitionId(tablename, range_cnt, out_partition_id, retry_count)

The handlers are missing table_schema and ng_id parameters while adding retry_count. Reconcile the signatures across all implementations.

🤖 Prompt for AI Agents
In `@tx_service/include/cc/local_cc_shards.h` around lines 2450 - 2454, The
compilation fails because BigTableHandler::GetNextRangePartitionId and
DynamoHandler::GetNextRangePartitionId use override but no matching virtual
exists in DataStoreHandler, and their parameter lists differ from
LocalCcShards::GetNextRangePartitionId; fix by adding a virtual declaration in
DataStoreHandler with the canonical signature used by LocalCcShards — bool
GetNextRangePartitionId(const TableName &tablename, const TableSchema
*table_schema, NodeGroupId ng_id, uint32_t range_cnt, int32_t
&out_next_partition_id) — or alternatively remove override from the handlers;
then update BigTableHandler::GetNextRangePartitionId and
DynamoHandler::GetNextRangePartitionId to match that canonical parameter list
(add table_schema and ng_id, remove retry_count) so all declarations/definitions
are consistent.

Comment on lines +92 to +108
int64_t ng_term = req.NodeGroupTerm();
if (ng_term < 0)
{
req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
return true;
if (req.AllowRunOnCandidate())
{
ng_term =
Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId());
}
if (ng_term < 0)
{
ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId());
int64_t standby_node_term =
Sharder::Instance().StandbyNodeTerm();
ng_term = std::max(ng_term, standby_node_term);
}
}
assert(ng_term > 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid assert‑only handling for unresolved ng_term.

If all fallbacks still yield a negative term, the assert can crash or (in release) proceed with invalid metadata. Prefer a runtime error.

🔧 Proposed fix (runtime error on invalid term)
-        assert(ng_term > 0);
+        if (ng_term <= 0)
+        {
+            req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
+            return true;
+        }
🤖 Prompt for AI Agents
In `@tx_service/include/cc/range_bucket_cc_map.h` around lines 92 - 108, The code
currently relies on assert(ng_term > 0) after trying req.NodeGroupTerm(),
CandidateLeaderTerm(), LeaderTerm() and StandbyNodeTerm(); replace this assert
with a runtime error path: after computing ng_term check if ng_term <= 0 and if
so log or construct a descriptive error including req.NodeGroupId(), the
original req.NodeGroupTerm(), CandidateLeaderTerm(...), LeaderTerm(...), and
StandbyNodeTerm(), then throw a std::runtime_error (or return a failure status
according to your error model) instead of asserting so callers receive a
deterministic error rather than crashing or continuing with invalid metadata.

Comment on lines +220 to +236
int64_t ng_term = req.NodeGroupTerm();
if (ng_term < 0)
{
hd_result->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
return true;
if (req.AllowRunOnCandidate())
{
ng_term =
Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId());
}
if (ng_term < 0)
{
ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId());
int64_t standby_node_term =
Sharder::Instance().StandbyNodeTerm();
ng_term = std::max(ng_term, standby_node_term);
}
}
assert(ng_term > 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle missing node‑group term without relying on assert

Line 220‑236: after the candidate/leader/standby fallback, ng_term can still be <0 when no valid term exists (e.g., during elections). In non‑assert builds this value flows into AcquireCceKeyLock and the returned CcEntryAddr, risking invalid lock ownership. Prefer returning a NOT_LEADER/RETRY error before proceeding.

💡 Proposed guard to fail fast when no valid term is available
         if (ng_term < 0)
         {
             if (req.AllowRunOnCandidate())
             {
                 ng_term =
                     Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId());
             }
             if (ng_term < 0)
             {
                 ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId());
                 int64_t standby_node_term =
                     Sharder::Instance().StandbyNodeTerm();
                 ng_term = std::max(ng_term, standby_node_term);
             }
         }
-        assert(ng_term > 0);
+        if (ng_term < 0)
+        {
+            hd_result->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
+            hd_result->SetFinished();
+            return true;
+        }
+        assert(ng_term > 0);
🤖 Prompt for AI Agents
In `@tx_service/include/cc/range_cc_map.h` around lines 220 - 236, The code
currently asserts ng_term > 0 after trying CandidateLeaderTerm, LeaderTerm and
StandbyNodeTerm; replace the assert with a proper early return that fails fast
when no valid term exists: check if ng_term <= 0 after those fallbacks and
return an appropriate NOT_LEADER or RETRY error code (the same error path the
caller expects) instead of proceeding to call AcquireCceKeyLock and constructing
a CcEntryAddr; reference the Sharder::Instance() calls and methods
CandidateLeaderTerm, LeaderTerm, StandbyNodeTerm and ensure the function returns
the error before any use of AcquireCceKeyLock/CcEntryAddr.

Comment on lines +202 to +203
int64_t ng_term = req.NodeGroupTerm();
assert(ng_term > 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the actual code at the specified lines
cat -n tx_service/include/cc/template_cc_map.h | sed -n '200,210p'
cat -n tx_service/include/cc/template_cc_map.h | sed -n '690,705p'

Repository: eloqdata/tx_service

Length of output: 1259


🏁 Script executed:

# Search for ValidTermCheck function definition and usage
rg -n "ValidTermCheck" -A 5 tx_service/

Repository: eloqdata/tx_service

Length of output: 23652


🏁 Script executed:

# Find the broader context of AcquireCc and AcquireAllCc to understand the term lifecycle
rg -n "void AcquireCc\|void AcquireAllCc" -A 20 tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check for callers of AcquireCc to understand where NodeGroupTerm comes from
rg -n "\.AcquireCc\(|\.AcquireAllCc\(" -B 5 -A 2 tx_service/

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Find the Execute method that contains line 202-203
sed -n '175,250p' tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 3430


🏁 Script executed:

# Look for the request class definition and its entry point
rg -n "class.*Acquire.*Cc" tx_service/include/cc/template_cc_map.h | head -20

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check if there's a base class Execute that calls ValidTermCheck before AcquireCc
rg -n "bool Execute" tx_service/include/cc/template_cc_map.h | head -10

Repository: eloqdata/tx_service

Length of output: 566


🏁 Script executed:

# Search for how requests are dispatched to AcquireCc - look at request lifecycle
rg -n "req.NodeGroupTerm()" tx_service/include/cc/template_cc_map.h -B 5 -A 10 | head -80

Repository: eloqdata/tx_service

Length of output: 2864


🏁 Script executed:

# Look at base request class to see if there's pre-Execute validation
sed -n '115,145p' tx_service/include/cc/cc_request.h

Repository: eloqdata/tx_service

Length of output: 755


🏁 Script executed:

# Check if Execute() is called directly or through a wrapper that validates
rg -n "ValidTermCheck" tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Search for how Execute() is invoked - look for dispatch patterns
rg -n "\.Execute\(" tx_service/include/cc/ -B 2 | grep -A 2 "AcquireCc\|AcquireAllCc" | head -40

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check the CcRequest base class wrapper
rg -n "class.*AcquireCc\b" tx_service/include/cc/cc_request.h -A 30 | head -50

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Get more context from the base class wrapper to see exact flow
sed -n '130,165p' tx_service/include/cc/cc_request.h

Repository: eloqdata/tx_service

Length of output: 1244


🏁 Script executed:

# Check what NodeGroupTerm() returns - look for its definition
rg -n "NodeGroupTerm()" tx_service/include/cc/ --type-list | head -5

Repository: eloqdata/tx_service

Length of output: 452


🏁 Script executed:

# Find where NodeGroupTerm is set on the request object
rg -n "SetNodeGroupTerm|NodeGroupTerm_|ng_term_" tx_service/include/cc/cc_request.h | head -20

Repository: eloqdata/tx_service

Length of output: 980


🏁 Script executed:

# Verify: does NodeGroupTerm() return the validated ng_term_ or something else?
rg -n "int64_t NodeGroupTerm()" tx_service/include/cc/ -A 3

Repository: eloqdata/tx_service

Length of output: 1491


🏁 Script executed:

# Look at how the base class wrapper invokes the derived Execute
sed -n '145,165p' tx_service/include/cc/cc_request.h

Repository: eloqdata/tx_service

Length of output: 958


🏁 Script executed:

# Verify the flow: TemplatedCcRequest::Execute calls ValidTermCheck, then invokes typed_req->Execute
rg -n "typed_req->Execute" tx_service/include/cc/cc_request.h -B 5 -A 2

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Look at PostWriteCc Execute to see if it uses runtime guards like ReadCc
sed -n '478,520p' tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 1631


🏁 Script executed:

# Confirm the exact flow: does TemplatedCcRequest::Execute call ValidTermCheck before invoking typed_req->Execute?
sed -n '136,180p' tx_service/include/cc/cc_request.h

Repository: eloqdata/tx_service

Length of output: 2111


🏁 Script executed:

# Check what happens at the call site - is typed_req->Execute called after ValidTermCheck returns true?
sed -n '136,170p' tx_service/include/cc/cc_request.h | tail -35

Repository: eloqdata/tx_service

Length of output: 1418


🏁 Script executed:

# Now check if AcquireCc is overriding or if it's directly called
# Look for class AcquireCc definition
rg -n "^class AcquireCc" tx_service/include/cc/ -A 10

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Critical: Check if req.NodeGroupTerm() can change between ValidTermCheck and Execute
# Look at what NodeGroupTerm represents - is it a volatile value or cached?
rg -n "ng_term_\s*=" tx_service/include/cc/cc_request.h | head -20

Repository: eloqdata/tx_service

Length of output: 889


🏁 Script executed:

# Confirm: Does ValidTermCheck set ng_term_ for ALL request types including AcquireCc?
rg -n "class AcquireCc\b" tx_service/include/cc/cc_request.h -A 40 | head -50

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check the fault injector test more carefully - it sets ng_term = -1 AFTER ValidTermCheck
# This proves ng_term can become invalid even after validation
sed -n '200,220p' tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 795


🏁 Script executed:

# Check if there are any other places where ng_term is reassigned after ValidTermCheck
rg -n "ng_term\s*=" tx_service/include/cc/template_cc_map.h | grep -A 5 -B 5 "202\|696"

Repository: eloqdata/tx_service

Length of output: 507


🏁 Script executed:

# Compare consistency: PostWriteCc doesn't have assert for ng_term. Check what it does.
rg -n "ng_term" tx_service/include/cc/template_cc_map.h | head -30

Repository: eloqdata/tx_service

Length of output: 1583


🏁 Script executed:

# Final check: look at the learning pattern - unpinning should only happen if term >= 0
# Does AcquireCc or AcquireAllCc call any pin/unpin that should be guarded?
rg -n "TryPinNodeGroupData\|UnpinNodeGroupData" tx_service/include/cc/template_cc_map.h

Repository: eloqdata/tx_service

Length of output: 45


Add runtime guard for invalid ng_term instead of relying on assert.

Lines 202-203 and 695-697 use assertions for term validation, which fail silently in release builds. The fault injector test (line 204-209) explicitly demonstrates that ng_term can become invalid within Execute(). Use a runtime check like ReadCc does (line 1391-1406): if ng_term <= 0, return REQUESTED_NODE_NOT_LEADER error rather than proceeding. This aligns with the codebase pattern for term safety and matches the fault injector's expected error handling.

🤖 Prompt for AI Agents
In `@tx_service/include/cc/template_cc_map.h` around lines 202 - 203, Replace the
assert-based term validation with a runtime check: in the locations where you do
"int64_t ng_term = req.NodeGroupTerm();" (including the Execute() path and the
other site around the current asserts), if ng_term <= 0 return the same error
used by ReadCc (REQUESTED_NODE_NOT_LEADER) instead of asserting; update the
function that returns the RPC/Status to propagate REQUESTED_NODE_NOT_LEADER so
callers handle it like ReadCc (see ReadCc handling around lines ~1391-1406 for
the exact behavior) and remove the assert(ng_term > 0) usages.

Comment on lines +186 to +189
void SetSyncOnLeader(bool sync_on_leader)
{
sync_on_leader_ = sync_on_leader;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd tx_service && find . -name "data_sync_task.h" -type f

Repository: eloqdata/tx_service

Length of output: 89


🏁 Script executed:

wc -l tx_service/include/data_sync_task.h

Repository: eloqdata/tx_service

Length of output: 102


🏁 Script executed:

sed -n '1,100p' tx_service/include/data_sync_task.h | cat -n

Repository: eloqdata/tx_service

Length of output: 4097


🏁 Script executed:

sed -n '180,250p' tx_service/include/data_sync_task.h | cat -n

Repository: eloqdata/tx_service

Length of output: 2728


🏁 Script executed:

rg "SetSyncOnLeader" --type cpp --type h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 1058


🏁 Script executed:

rg "sync_on_leader_" --type cpp --type h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 3431


🏁 Script executed:

sed -n '240,250p' tx_service/include/data_sync_task.h | cat -n

Repository: eloqdata/tx_service

Length of output: 403


🏁 Script executed:

rg "SetSyncOnLeader" --type cpp -B 5 -A 5

Repository: eloqdata/tx_service

Length of output: 2508


🏁 Script executed:

rg "sync_on_leader_.*?" --type cpp -B 3 -A 3 | head -100

Repository: eloqdata/tx_service

Length of output: 4643


🏁 Script executed:

rg "SetSyncOnLeader" --type cpp -B 15 -A 5

Repository: eloqdata/tx_service

Length of output: 4562


🏁 Script executed:

rg "sync_on_leader_" --type cpp -B 5 -A 5 | head -150

Repository: eloqdata/tx_service

Length of output: 7268


🏁 Script executed:

rg "std::shared_ptr.*DataSyncTask" --type cpp --type h -A 3 | head -80

Repository: eloqdata/tx_service

Length of output: 5275


🏁 Script executed:

rg "DataSyncForRangePartition|DataSyncForHashPartition" --type cpp -B 3 -A 20 | head -150

Repository: eloqdata/tx_service

Length of output: 10230


🏁 Script executed:

grep -n "SetSyncOnLeader" tx_service/src/cc/local_cc_shards.cpp

Repository: eloqdata/tx_service

Length of output: 176


🏁 Script executed:

sed -n '1,50p' tx_service/src/cc/local_cc_shards.cpp | head -50

Repository: eloqdata/tx_service

Length of output: 1573


🏁 Script executed:

rg "data_sync_task_queue_" --type cpp --type h -B 2 -A 2 | head -80

Repository: eloqdata/tx_service

Length of output: 5807


🏁 Script executed:

sed -n '3620,3650p' tx_service/src/cc/local_cc_shards.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 1581


🏁 Script executed:

sed -n '4680,4710p' tx_service/src/cc/local_cc_shards.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 1596


🏁 Script executed:

rg "void.*DataSyncForRangePartition" --type cpp -A 30 | head -80

Repository: eloqdata/tx_service

Length of output: 5032


🏁 Script executed:

sed -n '180,250p' tx_service/include/data_sync_task.h | cat -n

Repository: eloqdata/tx_service

Length of output: 2728


🏁 Script executed:

rg "void.*Sync.*Adjustable|void.*Set.*OnLeader" --type h --type cpp -B 2 -A 5

Repository: eloqdata/tx_service

Length of output: 754


🏁 Script executed:

rg "last_sync_ts.*=.*sync_on_leader_" --type cpp -B 10 | head -100

Repository: eloqdata/tx_service

Length of output: 911


🏁 Script executed:

rg "PostProcessFlushTaskEntries|FinishDataSyncTask" --type cpp --type h -B 2 -A 10 | head -120

Repository: eloqdata/tx_service

Length of output: 2966


🏁 Script executed:

rg "FinishDataSync|SetScanTaskFinished|SetFinish" --type cpp -B 3 -A 8 | head -100

Repository: eloqdata/tx_service

Length of output: 5644


🏁 Script executed:

rg "void.*Worker\(\)" --type cpp --type h | grep -i "data\|sync\|flush" | head -20

Repository: eloqdata/tx_service

Length of output: 898


🏁 Script executed:

sed -n '4700,4750p' tx_service/src/cc/local_cc_shards.cpp | cat -n

Repository: eloqdata/tx_service

Length of output: 2323


🏁 Script executed:

rg "sync_on_leader_.*\?" --type cpp -B 2 -A 5 | head -80

Repository: eloqdata/tx_service

Length of output: 2741


🏁 Script executed:

rg "need_update_ckpt_ts_" --type cpp --type h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 1370


Guard sync_on_leader_ updates against data races.

The non-atomic write in SetSyncOnLeader() (lines 186-189) can race with reads in finalization code paths. While SetSyncOnLeader() is called during worker thread execution in DataSyncForRangePartition(), the DataSyncTask object is shared via shared_ptr across threads. The reads of sync_on_leader_ in finalization (e.g., task->sync_on_leader_ ? task->data_sync_ts_ : 0) occur without synchronization.

The same issue applies to the member declaration at line 248 (shown as "Also applies to: 246-247"), where sync_on_leader_ is a plain bool with no atomic or mutex protection. Use std::atomic<bool> or guard access with a mutex to ensure thread safety.

🤖 Prompt for AI Agents
In `@tx_service/include/data_sync_task.h` around lines 186 - 189, The
SetSyncOnLeader() setter and unsynchronized reads of sync_on_leader_ in
DataSyncTask can race; change the DataSyncTask member sync_on_leader_ to
std::atomic<bool> and update SetSyncOnLeader() to call
sync_on_leader_.store(sync_on_leader, std::memory_order_release) and replace all
unsynchronized reads (e.g., in finalization expressions like
task->sync_on_leader_ ? task->data_sync_ts_ : 0) with
sync_on_leader_.load(std::memory_order_acquire) so reads and writes are properly
synchronized.

Comment on lines +1171 to +1199
bool UpdateCceCkptTsCc::ValidTermCheck() const
{
int64_t ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
int64_t candidate_ng_term =
Sharder::Instance().CandidateLeaderTerm(node_group_id_);
ng_term = std::max(ng_term, candidate_ng_term);
int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();
int64_t current_term = std::max(ng_term, standby_node_term);

if (current_term < 0 || current_term != term_)
{
LOG(INFO)
<< "UpdateCceCkptTsCc::ValidTermCheck failed with current term: "
<< current_term << ", term_: " << term_;
return false;
}

return true;
}

bool UpdateCceCkptTsCc::Execute(CcShard &ccs)
{
if (!ValidTermCheck())
{
LOG(INFO) << "UpdateCceCkptTsCc::ValidTermCheck failed on shard: "
<< ccs.core_id_;
SetFinished();
return false;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find the UpdateCceCkptTsCc class definition and understand term_ initialization
cd tx_service && git ls-files | grep -E "\.(h|hpp|cpp)$" | head -20

Repository: eloqdata/tx_service

Length of output: 593


🏁 Script executed:

# Look at the header file for cc_req_misc to understand the class structure
rg -A 30 "class UpdateCceCkptTsCc" --type cpp --type h

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Find other ValidTermCheck implementations to understand the pattern
rg "ValidTermCheck\(" --type cpp --type h -B 2 -A 10

Repository: eloqdata/tx_service

Length of output: 44362


🏁 Script executed:

# Search for callers of UpdateCceCkptTsCc constructor to see how term_ is set
rg "UpdateCceCkptTsCc" --type cpp --type h -B 3 -A 3

Repository: eloqdata/tx_service

Length of output: 5534


🏁 Script executed:

# Find the full UpdateCceCkptTsCc constructor definition
rg -A 20 "UpdateCceCkptTsCc\(" tx_service/include/cc/cc_req_misc.h | head -40

Repository: eloqdata/tx_service

Length of output: 1131


🏁 Script executed:

# Search for where UpdateCceCkptTsCc is constructed
rg "UpdateCceCkptTsCc\s*\(" --type cpp --type h -B 2 -A 5

Repository: eloqdata/tx_service

Length of output: 1091


🏁 Script executed:

# Check the construction site more carefully
rg -B 5 -A 10 "UpdateCceCkptTsCc update_cce_req" --type cpp

Repository: eloqdata/tx_service

Length of output: 1336


🏁 Script executed:

# Check data_sync_task structure for node_group_term_ initialization
rg -A 30 "struct DataSyncTask" tx_service/include/data_sync_task.h | head -50

Repository: eloqdata/tx_service

Length of output: 1647


🏁 Script executed:

# Look for where node_group_term_ is assigned in data_sync_task
rg "node_group_term_" tx_service/include/data_sync_task.h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 616


🏁 Script executed:

# Check if there's documentation or patterns showing when node_group_term_ is set to -1 vs positive values
rg "node_group_term_.*=" tx_service/include/data_sync_task.h -B 3 -A 3

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Look for where DataSyncTask is instantiated to see what ng_term is passed
rg "DataSyncTask\(" --type cpp -B 3 -A 8 | head -60

Repository: eloqdata/tx_service

Length of output: 4315


🏁 Script executed:

# Check other const ValidTermCheck implementations to understand the pattern better
rg -A 8 "bool ValidTermCheck\(\) const" --type h

Repository: eloqdata/tx_service

Length of output: 1317


🏁 Script executed:

# Check if DataSyncTask constructor validates ng_term > 0
rg -A 50 "DataSyncTask::DataSyncTask" tx_service/src/cc/local_cc_shards.cpp | head -80

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check calls to EnqueueRangeDataSyncTask and EnqueueHashPartitionDataSyncTask to see what ng_term is passed
rg "EnqueueRangeDataSyncTask\|EnqueueHashPartitionDataSyncTask" --type cpp -B 3 -A 3 | head -100

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Specifically search for assertions or checks on node_group_term_ or ng_term in data_sync_task initialization
rg "assert.*ng_term\|assert.*node_group_term_" --type cpp --type h -B 2 -A 2

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Look at the context where UpdateCceCkptTsCc is constructed more carefully
rg -B 20 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | tail -30

Repository: eloqdata/tx_service

Length of output: 963


🏁 Script executed:

# Check the data_sync_task that's being passed
rg -B 50 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | grep -A 50 "if (cce_entries_map.size" | head -20

Repository: eloqdata/tx_service

Length of output: 182


🏁 Script executed:

# Look for the actual PostProcessFlushTaskEntries function context
rg -B 10 -A 30 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | head -60

Repository: eloqdata/tx_service

Length of output: 1423


🏁 Script executed:

# Get more context of PostProcessFlushTaskEntries to see how ng_term is handled
rg -A 80 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | head -100

Repository: eloqdata/tx_service

Length of output: 3355


🏁 Script executed:

# Look for where UpdateCceCkptTsCc is enqueued with this ng_term
rg -A 150 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | grep -A 30 "UpdateCceCkptTsCc"

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Get the full PostProcessFlushTaskEntries function
rg -A 250 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | tail -150

Repository: eloqdata/tx_service

Length of output: 5846


🏁 Script executed:

# Search more specifically for UpdateCceCkptTsCc in the function
sed -n '2949,3188p' tx_service/src/cc/local_cc_shards.cpp

Repository: eloqdata/tx_service

Length of output: 8890


🏁 Script executed:

# Let's find where UpdateCceCkptTsCc is instantiated to see what term values are passed
rg -B 10 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | head -25

Repository: eloqdata/tx_service

Length of output: 545


🏁 Script executed:

# Check if there's any code that could create a DataSyncTask with a -1 term
rg "node_group_term_\s*=" tx_service/src/cc/local_cc_shards.cpp | head -20

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Check term_ field in UpdateCceCkptTsCc to see if there's any assertion or check
rg -B 5 -A 15 "struct UpdateCceCkptTsCc" tx_service/include/cc/cc_req_misc.h | tail -30

Repository: eloqdata/tx_service

Length of output: 547


🏁 Script executed:

# Get the complete instantiation of UpdateCceCkptTsCc with all parameters
rg -A 4 "UpdateCceCkptTsCc update_cce_req\(" tx_service/src/cc/local_cc_shards.cpp

Repository: eloqdata/tx_service

Length of output: 324


🏁 Script executed:

# Check if the data_sync_task is guaranteed to have a positive term by searching how it's constructed
rg -B 20 "EnqueueRangeDataSyncTask" tx_service/src/cc/local_cc_shards.cpp | grep -A 20 "DataSyncTask"

Repository: eloqdata/tx_service

Length of output: 1334


🏁 Script executed:

# Look for all places where UpdateCceCkptTsCc is instantiated
rg "UpdateCceCkptTsCc" tx_service/src/cc/local_cc_shards.cpp -B 5 -A 8

Repository: eloqdata/tx_service

Length of output: 1563


🏁 Script executed:

# Check if there are any DataSyncTask creations with ng_term that might be -1 or uninitialized
rg "DataSyncTask\(" tx_service/src/cc/local_cc_shards.cpp -A 8 | grep -B 5 -A 3 "ng_term"

Repository: eloqdata/tx_service

Length of output: 1900


🏁 Script executed:

# Look for the actual field declarations in UpdateCceCkptTsCc to see term_ initialization
rg -A 50 "struct UpdateCceCkptTsCc" tx_service/include/cc/cc_req_misc.h | grep -A 5 "term_\|int64_t"

Repository: eloqdata/tx_service

Length of output: 764


🏁 Script executed:

# Check if there's an assertion or check before UpdateCceCkptTsCc is constructed
rg -B 30 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | grep -A 30 "if (cce_entries_map.size"

Repository: eloqdata/tx_service

Length of output: 182


🏁 Script executed:

# Check if there's any place where node_group_term_ might be left as -1 in DataSyncTask before being used for UpdateCceCkptTsCc
rg -B 5 -A 5 "data_sync_task_->node_group_term_" tx_service/src/cc/local_cc_shards.cpp | head -40

Repository: eloqdata/tx_service

Length of output: 554


🏁 Script executed:

# Check if term_ field in UpdateCceCkptTsCc has any member initialization or assertions
rg -A 100 "struct UpdateCceCkptTsCc" tx_service/include/cc/cc_req_misc.h | tail -50

Repository: eloqdata/tx_service

Length of output: 1235


🏁 Script executed:

# Verify if there's any code path where UpdateCceCkptTsCc can be created with term_ < 0
rg "UpdateCceCkptTsCc" tx_service --type cpp -B 15 -A 3 | grep -A 15 "node_group_term_"

Repository: eloqdata/tx_service

Length of output: 240


🏁 Script executed:

# Check if term_ could potentially be -1 by looking at how DataSyncTask is initialized
rg "DataSyncTask(" tx_service/src/cc/local_cc_shards.cpp -A 12 | grep -B 5 -A 7 "ng_term"

Repository: eloqdata/tx_service

Length of output: 139


🏁 Script executed:

# Verify the concern: check if there's any scenario where term_ is not explicitly set or could be -1
rg -A 5 "node_group_term_\{" tx_service/include/data_sync_task.h

Repository: eloqdata/tx_service

Length of output: 185


🏁 Script executed:

# Found the issue! node_group_term_ defaults to -1. Now let's check if DataSyncTask is ever created without explicitly setting ng_term
rg "std::make_shared<DataSyncTask>" tx_service/src/cc/local_cc_shards.cpp -A 15 | head -50

Repository: eloqdata/tx_service

Length of output: 2177


🏁 Script executed:

# Check all DataSyncTask constructor calls to see if ng_term is always provided
rg "make_shared.*DataSyncTask\|std::shared_ptr.*DataSyncTask" tx_service/src/cc/local_cc_shards.cpp -A 5

Repository: eloqdata/tx_service

Length of output: 45


🏁 Script executed:

# Look at the DataSyncTask constructors more carefully
rg -A 30 "DataSyncTask::" tx_service/src/cc/data_sync_task.cpp | head -80

Repository: eloqdata/tx_service

Length of output: 139


🏁 Script executed:

# Let's check what values are passed for ng_term when EnqueueRangeDataSyncTask is called
rg "EnqueueRangeDataSyncTask" tx_service/src/cc/local_cc_shards.cpp -B 5 -A 1 | grep -B 5 "ng_"

Repository: eloqdata/tx_service

Length of output: 556


🏁 Script executed:

# Check if there's any DataSyncTask created without explicit ng_term parameter (would use default -1)
rg "std::make_shared.*DataSyncTask\|DataSyncTask(" tx_service/include/data_sync_task.h -A 15 | head -60

Repository: eloqdata/tx_service

Length of output: 203


🏁 Script executed:

# Review the key creation site again - does node_group_term_ ever stay at default -1?
rg -B 30 "if \(cce_entries_map.size\(\) > 0\)" tx_service/src/cc/local_cc_shards.cpp | tail -40

Repository: eloqdata/tx_service

Length of output: 1316


Align ValidTermCheck() with const-variant patterns by adding an assertion.

The const ValidTermCheck() lacks a safeguard that similar implementations in cc_request.h use. Add assert(term_ > 0); before the term comparison to catch cases where term_ might not be properly initialized, consistent with how other const ValidTermCheck() methods guard against uninitialized stored terms.

🤖 Prompt for AI Agents
In `@tx_service/src/cc/cc_req_misc.cpp` around lines 1171 - 1199, The const method
UpdateCceCkptTsCc::ValidTermCheck() is missing the assert guard used by other
const variants; add assert(term_ > 0); near the start of ValidTermCheck (before
the comparison against current_term) so uninitialized/invalid term_ values are
caught early; ensure the assertion is placed in the body of
UpdateCceCkptTsCc::ValidTermCheck() prior to the if (current_term < 0 ||
current_term != term_) check and that <cassert> is available in the translation
unit.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tx_service/include/cc/cc_request.h (1)

4167-4204: ⚠️ Potential issue | 🟠 Major

RunOnCandidateNode flag is currently unused in term checks.

RangePartitionDataSyncScanCc stores run_on_candidate_node_ but ValidTermCheck always includes candidate term. This undermines the flag’s purpose and can allow candidate execution even when not intended.

Possible fix
-        int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
-        int64_t candidate_ng_term =
-            Sharder::Instance().CandidateLeaderTerm(node_group_id_);
-        int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();
-        int64_t current_term =
-            std::max({cc_ng_term, standby_node_term, candidate_ng_term});
+        int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
+        int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();
+        int64_t current_term = std::max(cc_ng_term, standby_node_term);
+        if (run_on_candidate_node_)
+        {
+            int64_t candidate_ng_term =
+                Sharder::Instance().CandidateLeaderTerm(node_group_id_);
+            current_term = std::max(current_term, candidate_ng_term);
+        }

Also applies to: 4239-4261, 4324-4355, 4560-4563, 4739-4739

🤖 Fix all issues with AI agents
In `@tx_service/include/cc/cc_request.h`:
- Around line 565-585: The ValidTermCheck currently unconditionally accepts
candidate leader terms by taking the max of Sharder::Instance().LeaderTerm and
CandidateLeaderTerm; add an explicit boolean flag (e.g., allow_run_on_candidate_
or a parameter to ValidTermCheck) and change ValidTermCheck (and any callers) to
only consider CandidateLeaderTerm when that flag is true; update callers such as
AcquireAllCc and PostWriteAllCc to pass/set the flag only when they should
permit running on candidate leaders (mirroring AcquireCc's behavior), and ensure
ng_term_ and the comparison logic remain unchanged when the flag is false so
candidate terms are not accepted implicitly.

In `@tx_service/src/cc/local_cc_shards.cpp`:
- Around line 3201-3204: The CheckLeaderTerm call currently passes 'true'
unconditionally which permits candidate-term acceptance; change the call to only
allow candidate term when the task explicitly allows it (e.g., use
!task->sync_on_leader_ or an explicit allow_candidate flag) so leader-only tasks
still fail fast—update the call site
Sharder::Instance().CheckLeaderTerm(task->node_group_id_,
task->node_group_term_, /*allow_candidate=*/!task->sync_on_leader_) and apply
the same conditional change to the other occurrences (the other CheckLeaderTerm
call sites flagged in the review) so leader-only logic uses false for
allow_candidate.

Comment on lines +565 to +585
bool ValidTermCheck() override
{
int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
int64_t candidate_ng_term =
Sharder::Instance().CandidateLeaderTerm(node_group_id_);
cc_ng_term = std::max(cc_ng_term, candidate_ng_term);

if (ng_term_ < 0)
{
ng_term_ = cc_ng_term;
}

if (cc_ng_term < 0 || cc_ng_term != ng_term_)
{
return false;
}
else
{
return true;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Gate candidate-term acceptance behind an explicit flag.

AcquireAllCc and PostWriteAllCc now always accept candidate terms, which bypasses the new “allow_run_on_candidate” intent. If only some flows should run on candidates, add a flag (as in AcquireCc) or otherwise document why these paths are unconditional.

Suggested direction
-        int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
-        int64_t candidate_ng_term =
-            Sharder::Instance().CandidateLeaderTerm(node_group_id_);
-        cc_ng_term = std::max(cc_ng_term, candidate_ng_term);
+        int64_t cc_ng_term = -1;
+        if (allow_run_on_candidate_)
+        {
+            cc_ng_term = Sharder::Instance().CandidateLeaderTerm(node_group_id_);
+        }
+        if (cc_ng_term < 0)
+        {
+            cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
+        }

Also applies to: 996-1016

🤖 Prompt for AI Agents
In `@tx_service/include/cc/cc_request.h` around lines 565 - 585, The
ValidTermCheck currently unconditionally accepts candidate leader terms by
taking the max of Sharder::Instance().LeaderTerm and CandidateLeaderTerm; add an
explicit boolean flag (e.g., allow_run_on_candidate_ or a parameter to
ValidTermCheck) and change ValidTermCheck (and any callers) to only consider
CandidateLeaderTerm when that flag is true; update callers such as AcquireAllCc
and PostWriteAllCc to pass/set the flag only when they should permit running on
candidate leaders (mirroring AcquireCc's behavior), and ensure ng_term_ and the
comparison logic remain unchanged when the flag is false so candidate terms are
not accepted implicitly.

Comment on lines 3201 to +3204
if (!Sharder::Instance().CheckLeaderTerm(
task->node_group_id_, task->node_group_term_))
task->node_group_id_,
task->node_group_term_,
true))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t always allow candidate term in post‑processing checks.

Passing true unconditionally to CheckLeaderTerm can let leader‑only tasks complete after leadership loss (depending on term semantics). Consider gating allow‑candidate by task intent (e.g., !task->sync_on_leader_ or an explicit allow‑candidate flag) so leader‑only tasks still fail fast.

💡 Suggested fix pattern
- if (!Sharder::Instance().CheckLeaderTerm(task->node_group_id_,
-                                         task->node_group_term_,
-                                         true))
+ if (!Sharder::Instance().CheckLeaderTerm(task->node_group_id_,
+                                         task->node_group_term_,
+                                         !task->sync_on_leader_))

Also applies to: 3291-3292, 3394-3395, 3468-3469, 4489-4490

🤖 Prompt for AI Agents
In `@tx_service/src/cc/local_cc_shards.cpp` around lines 3201 - 3204, The
CheckLeaderTerm call currently passes 'true' unconditionally which permits
candidate-term acceptance; change the call to only allow candidate term when the
task explicitly allows it (e.g., use !task->sync_on_leader_ or an explicit
allow_candidate flag) so leader-only tasks still fail fast—update the call site
Sharder::Instance().CheckLeaderTerm(task->node_group_id_,
task->node_group_term_, /*allow_candidate=*/!task->sync_on_leader_) and apply
the same conditional change to the other occurrences (the other CheckLeaderTerm
call sites flagged in the review) so leader-only logic uses false for
allow_candidate.

@yi-xmu yi-xmu linked an issue Feb 2, 2026 that may be closed by this pull request
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tx_service/src/cc/local_cc_shards.cpp`:
- Around line 3852-3855: The condition incorrectly excludes term 0; update the
check on Sharder::Instance().CandidateLeaderTerm(ng_id) to use >= 0 (not > 0) so
term 0 is treated as a valid candidate leader term; modify the if condition
around CandidateLeaderTerm(ng_id) and ensure
data_sync_task->SetRunOnLeaderNode(false) still executes for any non-negative
term.
🧹 Nitpick comments (1)
tx_service/src/cc/local_cc_shards.cpp (1)

3254-3256: UpdateLastDataSyncTS is monotonic and safe with 0 values; clarify calls to skip on candidate runs.

Both UpdateLastDataSyncTS implementations guard with if (last_sync_ts > last_sync_ts_), preventing any regression. Passing 0 on candidate runs is functionally safe and will be ignored by the monotonic check. However, to improve clarity and avoid unnecessary function calls, skip the call entirely unless run_on_leader_node_ is true. This aligns with the PR intent and existing patterns (e.g., local_cc_shards.h:986-989).

Apply to: 3254-3256, 3411-3413, 3930-3933, 4514-4516, 5737-5739.

Comment on lines +3852 to +3855
if (Sharder::Instance().CandidateLeaderTerm(ng_id) > 0)
{
data_sync_task->SetRunOnLeaderNode(false);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for CandidateLeaderTerm usage to understand term semantics
rg "CandidateLeaderTerm" -C 3 -n

Repository: eloqdata/tx_service

Length of output: 36871


🏁 Script executed:

#!/bin/bash
# Look for term-related constants (INIT_TERM, etc.)
rg "INIT_TERM|MIN_TERM|INVALID_TERM" -n

Repository: eloqdata/tx_service

Length of output: 1442


🏁 Script executed:

#!/bin/bash
# Search for other candidate term comparisons to see common patterns
rg "CandidateLeaderTerm.*[<>=]" -n

Repository: eloqdata/tx_service

Length of output: 2199


🏁 Script executed:

#!/bin/bash
# Check the context around line 3852 to understand the logic better
sed -n '3840,3870p' tx_service/src/cc/local_cc_shards.cpp

Repository: eloqdata/tx_service

Length of output: 1098


Change CandidateLeaderTerm(ng_id) > 0 to >= 0 to include term 0. Term 0 is a valid candidate leader term (INIT_TERM = -1 is invalid). Other validity checks in the codebase use >= 0 for this purpose (e.g., line 1595 assert, line 275 in tx_execution.h), making > 0 inconsistent and causing it to incorrectly skip the first candidate leader term.

🤖 Prompt for AI Agents
In `@tx_service/src/cc/local_cc_shards.cpp` around lines 3852 - 3855, The
condition incorrectly excludes term 0; update the check on
Sharder::Instance().CandidateLeaderTerm(ng_id) to use >= 0 (not > 0) so term 0
is treated as a valid candidate leader term; modify the if condition around
CandidateLeaderTerm(ng_id) and ensure data_sync_task->SetRunOnLeaderNode(false)
still executes for any non-negative term.

@yi-xmu yi-xmu requested a review from liunyl February 2, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow checkpointing to be performed on the candidate leader node.

1 participant