Skip to content

Fix recent bugs#362

Closed
liunyl wants to merge 1 commit intomainfrom
store_ci
Closed

Fix recent bugs#362
liunyl wants to merge 1 commit intomainfrom
store_ci

Conversation

@liunyl
Copy link
Contributor

@liunyl liunyl commented Jan 14, 2026

  1. Abort blocking cmd does not try to recycle key lock struct after abort.
  2. Range partition data sync did not return mem quota after scan fail or flush fail

Summary by CodeRabbit

  • Bug Fixes
    • Improved transaction coordination state management to ensure consistent reset behavior during operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 14, 2026

Walkthrough

This change adds two unconditional reset statements to the RangePartitionDataSyncScanCc::SliceCoordinator::Reset() method, resetting min_paused_key_ to a default TxKey and min_paused_slice_index_ to 0.

Changes

Cohort / File(s) Summary
State Reset in SliceCoordinator
tx_service/include/cc/cc_request.h
Adds initialization of min_paused_key_ and min_paused_slice_index_ in the Reset() method to ensure consistent state cleanup.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

  • Bug fix #368: Both PRs modify RangePartitionDataSyncScanCc::SliceCoordinator's handling of min_paused_key_ and min_paused_slice_index_ (reset/initialization).
  • Fix datasyncscan paused key with debug log #332: Both PRs modify paused-key handling in RangePartitionDataSyncScanCc, with this PR resetting min_paused_key/index on Reset while the other fixes pause key computation/validation.

Suggested reviewers

  • yi-xmu
  • xiexiaoy

Poem

🐰 Two resets, clean and bright,
Keys and indices set right,
Paused state flows with care,
Reset() clears the air! ✨

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description identifies two specific bugs but does not follow the template structure with required sections like tests, documentation, issue references, or test suite confirmation. Add missing template sections: reference the issue using 'fixes eloqdb/tx_service#issue_id', confirm test execution with './mtr --suite=mono_main,mono_multi,mono_basic', and document any changes made.
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Fix recent bugs' is vague and does not clearly summarize the specific changes; it lacks descriptive detail about what bugs were fixed or which components are affected. Consider using a more specific title that describes the actual bugs fixed, such as 'Fix key lock recycling and memory quota return in range partition sync' or similar.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch store_ci

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tx_service/include/cc/cc_entry.h (1)

367-379: Fix the return value: the function should return ret instead of always true.

The fix correctly adds RecycleKeyLock(*ccs) when the abort succeeds, addressing the PR objective. However, the function now always returns true regardless of whether the inner AbortBlockCmdRequest actually found and processed the blocked request.

The inner NonBlockingLock::AbortBlockCmdRequest returns false when the transaction is not found in either queue. The caller in object_cc_map.h (line 408) explicitly checks for this failure case and logs a warning:

if (!succeed) {
    DLOG(WARNING) << "AbortBlockCmdRequest fail to find tx in queue_block_cmds_ and blocking_queue_...";
}

With the current change, this warning can never trigger. The function should return ret to preserve the original semantic meaning—that false indicates the abort did not find a matching request.

tx_service/src/cc/local_cc_shards.cpp (1)

3490-3493: Change unpin condition from > 0 to >= 0 to avoid pin leak when term equals 0.

Line 3490-3493 uses if (ng_term > 0) to guard UnpinNodeGroupData(), but TryPinNodeGroupData() returns a term value where >= 0 indicates success. This creates a resource leak when term == 0, a valid pinned state. Other calls in the same file guard with >= 0 (lines 3303, 3352, 3613), making this inconsistent.

Fix
-    if (ng_term > 0)
+    if (ng_term >= 0)
     {
         Sharder::Instance().UnpinNodeGroupData(task->node_group_id_);
     }
🧹 Nitpick comments (1)
tx_service/src/cc/local_cc_shards.cpp (1)

4314-4324: Potential quota-accounting bug if “empty after erase” still retains carry-over records

Deallocating flush_data_size here is only correct if no records are retained for the next loop via data_sync_vecs[cc_shards_.size()] / archive_vecs[...] / mv_base_vecs[...]. If this branch can happen with non-empty “remaining” vectors, quota would be returned while memory is still held (and subsequent iterations also won’t re-account that carry-over memory).

Proposed guard (assert or conditional)
             if (data_sync_vec->empty())
             {
                 LOG(WARNING) << "data_sync_vec becomes empty after erase, old "
                                 "size of data_sync_vec_size: "
                              << data_sync_vec_size;
                 // Reset
                 scan_cc.Reset();
+                // Ensure we are not retaining carry-over records while returning quota.
+                assert(data_sync_vecs[cc_shards_.size()].empty());
+                assert(archive_vecs[cc_shards_.size()].empty());
+                assert(mv_base_vecs[cc_shards_.size()].empty());
                 // Return the quota to flush data memory usage pool since the
                 // flush data task is not put into flush worker.
                 data_sync_mem_controller_.DeallocateFlushMemQuota(
                     flush_data_size);
                 continue;
             }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 25dd6a3 and e41c385.

📒 Files selected for processing (2)
  • tx_service/include/cc/cc_entry.h
  • tx_service/src/cc/local_cc_shards.cpp
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/cc/local_cc_shards.cpp
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • tx_service/src/cc/local_cc_shards.cpp
📚 Learning: 2025-11-11T07:10:40.346Z
Learnt from: lzxddz
Repo: eloqdata/tx_service PR: 199
File: include/cc/local_cc_shards.h:233-234
Timestamp: 2025-11-11T07:10:40.346Z
Learning: In the LocalCcShards class in include/cc/local_cc_shards.h, the EnqueueCcRequest methods use `shard_code & 0x3FF` followed by `% cc_shards_.size()` to distribute work across processor cores for load balancing. This is intentional and separate from partition ID calculation. The 0x3FF mask creates a consistent distribution range (0-1023) before modulo by actual core count.

Applied to files:

  • tx_service/src/cc/local_cc_shards.cpp
🔇 Additional comments (1)
tx_service/src/cc/local_cc_shards.cpp (1)

4364-4376: Good: return flush quota when breaking before enqueueing flush work

This closes the leak when a previous flush failure aborts the scan loop before the batch is handed off to flush workers. One nuance: DeallocateFlushMemQuota() is called while holding flight_task_mux_; if DeallocateFlushMemQuota() can block/contend, consider moving it outside the lock.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tx_service/include/cc/cc_request.h (1)

4421-4435: Initialize SliceCoordinator::min_paused_slice_index_ (and clarify Reset semantics).

With the union removed, min_paused_slice_index_ is now always a live size_t (Line 4528) but is only assigned in the non-continuous branch of the ctor (Line 4431-4433). In continuous mode it’s left uninitialized, which is an unnecessary UB footgun if it’s ever read (even accidentally / via future refactor / logging / sanitizer).

Also worth double-checking whether SliceCoordinator::Reset() (Line 4461+) is intended to preserve or reset the “min paused” state; right now it preserves both.

Proposed fix (deterministic initialization)
 struct SliceCoordinator
 {
     static constexpr uint16_t MaxBatchSliceCount = 128;

     SliceCoordinator(bool is_continuous,
                      std::vector<std::pair<TxKey, bool>> *slices_keys)
         : is_continuous_(is_continuous), slices_keys_ptr_(slices_keys)
     {
         pinned_slices_.reserve(MaxBatchSliceCount);
-        if (is_continuous_)
-        {
-            min_paused_key_ = TxKey();
-        }
-        else
-        {
-            min_paused_slice_index_ = 0;
-        }
+        // Always initialize both fields since both are now always present.
+        min_paused_key_ = TxKey();
+        min_paused_slice_index_ = 0;
         batch_end_slice_index_ = 0;
     }
@@
-    TxKey min_paused_key_;
-    size_t min_paused_slice_index_;
+    TxKey min_paused_key_{};
+    size_t min_paused_slice_index_{0};
     std::vector<std::pair<TxKey, bool>> *slices_keys_ptr_{nullptr};

Also applies to: 4461-4469, 4527-4529

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d9d61b7 and de62617.

📒 Files selected for processing (1)
  • tx_service/include/cc/cc_request.h
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/include/cc/cc_request.h

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tx_service/include/cc/cc_request.h (1)

4385-4402: Guard TxKey comparisons against a default-initialized min_paused_key_ to avoid UB.

When min_paused_key_ is default-constructed, GetConstPtr() returns nullptr. The comparison *key < slice_coordinator_.min_paused_key_ calls interface_->LessThan(key_ptr, nullptr), which attempts to cast and dereference a null pointer in the lambda, causing undefined behavior. A guard prevents this by treating the default (negative infinity) case specially.

Proposed fix
 void UpdateMinPausedSlice(const TxKey *key)
 {
     assert(key->KeyPtr() != nullptr);
     assert(export_base_table_item_);
-    if (*key < slice_coordinator_.min_paused_key_)
+    if (slice_coordinator_.min_paused_key_.KeyPtr() == nullptr ||
+        *key < slice_coordinator_.min_paused_key_)
     {
         slice_coordinator_.min_paused_key_ =
             std::move(key->GetShallowCopy());
     }
 }
🧹 Nitpick comments (1)
tx_service/include/cc/cc_request.h (1)

4417-4465: SliceCoordinator cleanup is good; consider clarifying/strengthening init invariants for min_paused_ (post-union).*
Removing the anonymous union + destructor reduces lifetime footguns, but now correctness depends on what “empty” TxKey() and min_paused_slice_index_ = 0 mean (esp. across Reset/reuse). If 0 / empty-key are sentinels, please document that (or encode it explicitly).

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de62617 and 49e2619.

📒 Files selected for processing (2)
  • store_handler/eloq_data_store_service/eloqstore
  • tx_service/include/cc/cc_request.h
✅ Files skipped from review due to trivial changes (1)
  • store_handler/eloq_data_store_service/eloqstore
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 357
File: tx_service/src/tx_operation.cpp:4174-4195
Timestamp: 2026-01-12T13:27:46.584Z
Learning: In tx_service/src/tx_operation.cpp, SplitFlushRangeOp::Forward must only directly recycle (Finish(false), ClearInfos(), pop, return the op to the pool) when op_ == nullptr; otherwise let the state machine unwind to avoid recycling while sub-ops (including AsyncOp workers) may still be running.
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.
📚 Learning: 2025-10-20T04:30:07.884Z
Learnt from: liunyl
Repo: eloqdata/tx_service PR: 149
File: include/cc/cc_request.h:1876-1927
Timestamp: 2025-10-20T04:30:07.884Z
Learning: ScanNextBatchCc in include/cc/cc_request.h is used only for hash-partition scans; range-partition scans are handled by ScanSliceCc.

Applied to files:

  • tx_service/include/cc/cc_request.h
🔇 Additional comments (1)
tx_service/include/cc/cc_request.h (1)

4455-4504: SliceCoordinator::Reset() clears only preparation-phase state; no behavioral change to resume logic.

Reset() clears min_paused_key_ and min_paused_slice_index_ only as part of initializing a new scan. These fields are used exclusively during the one-time slice preparation phase (Phase 1). Actual batch-to-batch resumption during parallel scanning (Phase 2) relies on the separate pause_pos_[core_id] tracking, which is never cleared by SliceCoordinator::Reset(). No impact on resume semantics.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants