Skip to content

fix: reorder BackFill operations to prevent dirty_data_key_count_ underflow#465

Open
githubzilla wants to merge 1 commit intomainfrom
fix/backfill-dirty-key-count-underflow
Open

fix: reorder BackFill operations to prevent dirty_data_key_count_ underflow#465
githubzilla wants to merge 1 commit intomainfrom
fix/backfill-dirty-key-count-underflow

Conversation

@githubzilla
Copy link
Collaborator

@githubzilla githubzilla commented Mar 18, 2026

Summary

Fix an intermittent assertion failure in AdjustDataKeyStats (cc_shard.cpp:428) where dirty_data_key_count_ underflows during cluster scale-out (test_add_node_with_double_write).

Root Cause

BackFill() in TemplateCcMap (template_cc_map.h) called operations in the wrong order:

  1. SetCkptTs(commit_ts) — sets the flush bit (marks entry as persistent)
  2. OnFlushed() — runs dirty-key accounting (sees entry as persistent, no dirty increment)
  3. SetCommitTsPayloadStatus(commit_ts, status)overwrites commit_ts_and_status_ entirely, clearing the flush bit

After step 3, the entry is left dirty but dirty_data_key_count_ was never incremented for it. When checkpoint later flushes this entry and decrements the counter, it underflows — triggering the assertion.

Changes

1. template_cc_map.hBackFill() (primary fix)

Reordered operations so SetCommitTsPayloadStatus() runs before SetCkptTs():

  • SetCommitTsPayloadStatus(commit_ts, status) — updates the payload and commit timestamp
  • SetCkptTs(commit_ts) — sets the flush bit after, so it is preserved
  • OnFlushed() + OnCommittedUpdate() — reconciles dirty-key stats with the entry in its correct final state

This matches the correct ordering already used in ObjectCcMap::BackFill().

2. template_cc_map.hRemoteReadOutsideCc()

Added OnCommittedUpdate() call for consistency with other backfill paths. Defensive fix to ensure dirty-key stats are properly reconciled.

3. cc_req_misc.cppUpdateCceCkptTsCc::Execute()

Relaxed 4 assertions (versioned/non-versioned × range/non-range) from:

assert(v_entry->CommitTs() > 1 && !v_entry->IsPersistent());

to:

assert(v_entry->CommitTs() > 1);

With the BackFill fix, backfilled entries are now correctly marked persistent. But a checkpoint callback that was already scheduled before BackFill ran can still arrive and find the entry already persistent. This is benign — the existing was_dirty / OnEntryFlushed logic handles it correctly (if the entry is already persistent, was_dirty is false and no double-decrement occurs).

Testing

Ran test_add_node_with_double_write (cluster scale-out test) 6 consecutive times — all passed with zero assertion failures in any node logs:

Run Result Time
1 OK 589s
2 OK 560s
3 OK 583s
4 OK 665s
5 OK 621s
6 OK 603s

Before the fix, this test would intermittently crash with the dirty_data_key_count_ underflow assertion.

Summary by CodeRabbit

  • Chores
    • Internal improvements to transaction service state management and checkpoint handling for enhanced reliability during concurrent operations.

…erflow

BackFill() in TemplateCcMap called SetCkptTs() (which sets the flush
bit) before SetCommitTsPayloadStatus() (which overwrites the entire
commit_ts_and_status_ field, clearing the flush bit).  This left
backfilled entries dirty without a corresponding dirty-count increment.
When those entries were later flushed by checkpoint, the decrement
caused dirty_data_key_count_ to underflow, triggering the assertion in
AdjustDataKeyStats.

Fix:
1. template_cc_map.h BackFill(): call SetCommitTsPayloadStatus()
   before SetCkptTs() so the flush bit set by SetCkptTs() is
   preserved.  Add OnCommittedUpdate() to reconcile dirty-key stats.
2. template_cc_map.h RemoteReadOutsideCc(): add OnCommittedUpdate()
   for consistency with other backfill paths.
3. cc_req_misc.cpp UpdateCceCkptTsCc::Execute(): relax 4 assertions
   that checked !IsPersistent() — BackFill can legitimately mark
   entries as persistent before the checkpoint callback fires.  The
   existing was_dirty/OnEntryFlushed logic already handles this case
   correctly.
@coderabbitai
Copy link

coderabbitai bot commented Mar 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3a95bc63-1ebe-4402-a299-fb9169bfb76d

📥 Commits

Reviewing files that changed from the base of the PR and between 1a4729b and 60c8665.

📒 Files selected for processing (2)
  • tx_service/include/cc/template_cc_map.h
  • tx_service/src/cc/cc_req_misc.cpp

Walkthrough

This PR reorders flush-related operations in the template concurrency control map, ensuring payload updates occur before checkpoint timestamp settings and checkpoint flushing. Assertions in the commit-timestamp callback are relaxed to permit concurrent BackFill operations that may have already marked entries as flushed.

Changes

Cohort / File(s) Summary
Flush and Commit Ordering
tx_service/include/cc/template_cc_map.h
Reordered operations in Execute and BackFill paths to perform payload updates before SetCkptTs, with OnFlushed and OnCommittedUpdate calls repositioned accordingly. Added comments clarifying that flush bits should not be cleared by payload updates and checkpointing must occur post-update.
Checkpoint Assertion Relaxation
tx_service/src/cc/cc_req_misc.cpp
Removed !IsPersistent() assertion requirement across four code paths in UpdateCceCkptTsCc::Execute, allowing concurrent BackFill operations to mark entries as flushed before checkpoint callbacks without breaking dirty-key reconciliation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • liunyl
  • MrGuin

Poem

🐰 Flush and dirty states now dance in the right order—
Payloads bow before the checkpoint's honor.
Assertions soften to allow concurrent friends,
While hooks record the journey's proper ends!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary bug fix: reordering BackFill operations to prevent dirty_data_key_count_ underflow.
Description check ✅ Passed The description is comprehensive and well-structured, but the PR template checklist items are not marked as completed (tests, documentation, issue references).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/backfill-dirty-key-count-underflow
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

githubzilla added a commit to githubzilla/eloqkv that referenced this pull request Mar 18, 2026
…erflow

Update data_substrate submodule to include the fix for an intermittent
assertion failure in AdjustDataKeyStats where dirty_data_key_count_
underflows during cluster scale-out.

See eloqdata/tx_service#465 for details.
githubzilla added a commit to githubzilla/eloqkv that referenced this pull request Mar 18, 2026
…erflow

Update data_substrate submodule to include the fix for an intermittent
assertion failure in AdjustDataKeyStats where dirty_data_key_count_
underflows during cluster scale-out.

See eloqdata/tx_service#465 for details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant