Skip to content

LO-LRU: fix UpdateLruList early-return bug + add large↔small state-flip tests#452

Draft
Copilot wants to merge 5 commits intomainfrom
copilot/add-large-obj-page-state-tests
Draft

LO-LRU: fix UpdateLruList early-return bug + add large↔small state-flip tests#452
Copilot wants to merge 5 commits intomainfrom
copilot/add-large-obj-page-state-tests

Conversation

Copy link

Copilot AI commented Mar 12, 2026

Supplements PR #426 (LO-LRU) with state-flip coverage and fixes a partition-invariant bug discovered during test authoring.

Bug fix — CcShard::UpdateLruList (cc_shard.cpp)

When a page already sitting just-before-tail had large_obj_page_ flipped false→true (small→large), the early-return path skipped the protected_head_page_ update. Result: protected_head_page_ stayed at tail_ccp_ even with a large page in the list, putting that page in the "small" eviction segment — the opposite of what LO-LRU intends.

Fix: in the early-return block, if policy is LO_LRU and the page is now large and protected_head_page_==tail, set protected_head_page_=page.

// early-return block in UpdateLruList
if (GetCacheEvictPolicy() == CacheEvictPolicy::LO_LRU &&
    page->large_obj_page_ && protected_head_page_ == &tail_ccp_)
{
    protected_head_page_ = page;   // repair invariant after in-place flip
}
return;

New tests — LargeObjLRU-Test.cpp

VerifyLruListInvariant helper — walks the full LRU list asserting:

  • protected_head_page_ is tail or a large page
  • [head, protected_head) contains only small pages
  • [protected_head, tail) contains only large pages
  • every lru_prev back-pointer is consistent

Six flip scenarios:

Test Scenario Key assertion
Flip A1 Sole large page (protected_head) → small protected_head_page_ resets to tail
Flip A2 protected_head → small, more larges after protected_head_page_ advances to next large
Flip B1 Small page just-before-tail → large (empty protected segment) protected_head_page_ set to that page (exercises bug fix)
Flip B2 Small page → large, protected segment non-empty Page moved to large segment; protected_head_page_ unchanged
Flip A via ELOOPA EnsureLargeObjOccupyPageAlone, entry deleted Page demoted to small; protected_head_page_ resets
Flip B via ELOOPA EnsureLargeObjOccupyPageAlone, single entry exceeds threshold Page promoted to large; protected_head_page_ updated (exercises bug fix)

Minor

  • Fix typo "Invalidate""Invalid" in tx_service_init.cpp error log.

  • Add tests for the change
  • Document changes
  • Reference the link of issue using fixes eloqdb/tx_service#issue_id
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic
Original prompt

在 repo eloqdata/tx_service 的 PR #426 (Support optional large object LRU policy) 上补足单测覆盖“状态翻转(page large_obj_page_ true/false 的翻转)”场景。

背景/现状:

  • 该 PR 引入 LO-LRU:LRU list 按 small pages + large pages 分区,使用 protected_head_page_ 作为分割点。
  • Page 上新增 bool large_obj_page_。
  • EnsureLargeObjOccupyPageAlone() 会根据 entry serialized size 与阈值决定:
    • 迁移 entry 到新 page 并把新 page 标记为 large_obj_page_=true(large object occupy one page exclusively);
    • 或在 entry deleted/变小等情况下把 page.large_obj_page_ 置为 false。
  • 现有 tests/LargeObjLRU-Test.cpp 覆盖了插入顺序和基本迁移,但缺少“page 已在 LRU 链上,large<->small 状态翻转”导致 protected_head_page_ 与链表不变量被破坏的边界覆盖。

需求:

  1. 在 tests/LargeObjLRU-Test.cpp 中新增测试用例(或新增测试文件亦可,但优先放在现有 LargeObjLRU-Test.cpp),覆盖至少以下两类“状态翻转”场景,并且要验证 LRU 分区不变量:

    • 场景 A:page 初始为 large(large_obj_page_=true)并位于 protected 段,随后因 entry 变小/被删除/被迁移导致 page.large_obj_page_ 被设置为 false(large->small flip),并在 flip 后对该 page 调用 UpdateLruList 或触发 DetachLru/AttachLru 流程,确保:
      • 不触发 assert / 不崩溃;
      • protected_head_page_ 被正确推进/重置;
      • VerifyLruList(或等价检查)通过:small 段中无 large pages,large 段中无 small pages。
    • 场景 B:page 初始为 small(large_obj_page_=false)并位于 small 段,随后因 entry 变大(超过阈值)触发 EnsureLargeObjOccupyPageAlone 迁移/标记导致出现 large page(small->large flip)。需要覆盖:
      • protected 段原本为空(protected_head_page_ == tail)与非空两种子情况至少一种;
      • flip 后调用 UpdateLruList 仍保持分区不变量;
      • 迁移后的新 page 在 large 段,且原 page 若仍含其他 entries 则保持 small。
  2. 测试应尽量使用 PR 里已有的测试 helper(例如 VerifyLruList、CreatePage、InsertEntry/Emplace 等,如果存在),并在断言中明确检查:

    • protected_head_page_ 是否为 tail 或指向 large page;
    • LRU 链表 next/prev 连接关系未断裂(可通过遍历计数/查重/首尾检查)。
  3. 测试必须在 Debug/Release 都稳定:不要依赖触发 assert 来证明正确性。

  4. 如为达成测试需要,对生产代码做最小必要的可测试性改动是允许的(例如暴露只读访问器来读取 protected_head_page_ 或提供测试 hook),但要:

    • 不改变生产行为;
    • 尽量只在测试编译单元可见(例如 friend test / #ifdef UNIT_TEST / 仅加 const getter)。

交付:

  • 更新/新增单测文件及必要的最小生产代码改动;
  • 确保 bazel test/现有 test suite(按 repo 现状的构建系统)通过;
  • 在 PR 中解释新增测试覆盖了哪些 flip 场景,以及为何能防止 protected_head_page_ / LRU 不变量回归。

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: 根据这个PR描述的设计,仔细深刻地review这个PR

Assistant: 下面是基于 PR #426(“Support optional large object LRU policy”)描述的设计与当前代码改动,做一次“按设计逐条对照 + 深挖边界/并发/一致性/可维护性”的 review。重点会放在:LO-LRU 分区不变量是否真的被所有路径维护entry 迁移是否满足 lock/状态契约、以及 ccmp_/page key/LRU 指针在迁移时是否可能被破坏。


1) 先复述一下 PR 设计(我理解的“不变量”)

PR 描述的核心目标:

  • 对象可能非常大(从存储加载很重),希望 大对象尽量常驻内存,相比小对象更不容易被淘汰。
  • 引入一个 LRU 变体:LO-LRU(Large-Object LRU)
    • 逻辑上把 LRU list 分成两段:small pages 段 + large pages 段
    • 回收内存时 先淘汰 small 段,small 段清掉再淘汰 large 段
    • 实现上仍是一个双链表,但维护一个“分割点指针” protected_head_page_
      head --- [small pages] --- protected_head_page_ --- [large pages] --- tail
      
  • “Large object occupies one page exclusively”:
    • 当某条记录 serialized size 超过阈值(lolru_large_obj_threshold_kb),要确保它所在 page 是 “large_obj_page_ == true” 且 page 上只有一个 object(如果原 page >1 entry 就把该 entry 迁移到新 page)。
  • 在 insert/update 的工作流上:FindEmplace 创建 entry + commit payload;在 commit payload 阶段调用 EnsureLargeObjOccupyPageAlone() 来维护上面不变量。

配置项:

  • cache_evict_policy: LRU | LO_LRU(默认 LRU)
  • lolru_large_obj_threshold_kb(默认 1024KB)

2) 代码改动总览(结构上你做了什么)

2.1 配置与策略开关

  • core/src/tx_service_init.cpp
    • 新增 gflags:cache_evict_policylolru_large_obj_threshold_kb
    • 从 cmdline 或 ini 读配置,调用:
      • tx_service_->SetupPolicyLRU()
      • tx_service_->SetupPolicyLoLRU(threshold_kb)
  • tx_service/include/type.h
    • 新增 CacheEvictPolicy enum + 两个 policy struct(带字符串 NAME)
  • tx_service/include/cc/local_cc_shards.h
    • LocalCcShards 增加 policy 字段 + union 存 option
    • SetupPolicyLoLRUlarge_obj_threshold_kb * 1024ull 保存 bytes(这一点你已经修掉了 CodeRabbit 提醒的潜在溢出)

2.2 Page/Entry 迁移基础设施

  • tx_service/include/cc/cc_entry.h
    • LruPage 增加 bool large_obj_page_{false}
    • CcPage 增加:
      • Emplace(key, unique_ptr<CcEntry>):用于迁移后的插入(并维护 smallest_ttl_ / last_dirty_commit_ts_)
      • MoveEntry(idx):move 出 unique_ptr
      • UpdateSmallestTTL():迁移/删除后重算最小 TTL

2.3 LO-LRU 分区指针与插入规则

  • tx_service/include/cc/cc_shard.h / tx_service/src/cc/cc_shard.cpp
    • 增加 LruPage* protected_head_page_,初始化为 &tail_ccp_
    • UpdateLruList() 在 LO_LRU 下:
      • large page 插到 tail 前
      • small page 插到 protected_head_page_ 前(从而不会跑到 large 段之后)
      • 并维护 protected_head_page_ 的移动逻辑

2.4 EnsureLargeObjOccupyPageAlone(核心)

  • tx_service/include/cc/template_cc_map.h
    • 新增 EnsureLargeObjOccupyPageAlone(ccp, cce)
      • 如果 entry Deleted:page.large_obj_page_=false
      • 如果 entry Normal 且 SerializedLength > threshold:
        • 若 page.Size()<=1:page.large_obj_page_=true
        • 否则:从原 page move 出该 entry,创建新 page(large_obj_page_=true),插入 ccmp_,更新 LRU
      • 最后 shard_->UpdateLruList(ccp, false)(对原 page 也更新 LRU)
    • 同时对借 entry / merge page / borrow entry 逻辑增加 “large_obj_page_ 不能参与” 的保护(避免大页/小页互相合并/借用破坏“一页一个大对象”的约束)

2.5 ObjectCcMap 多处调用 EnsureLargeObjOccupyPageAlone

  • tx_service/include/cc/object_cc_map.h
    • 在多个写入/commit/回放相关路径中,只要 policy 是 LO_LRU 就调用 `...

This pull request was created from Copilot chat.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ xiexiaoy
❌ Copilot
You have signed the CLA already but the status is still pending? Let us recheck it.

…ly-return bug

Co-authored-by: xiexiaoy <5558602+xiexiaoy@users.noreply.github.com>
Copilot AI changed the title [WIP] Add tests for page large_obj_page state transitions LO-LRU: fix UpdateLruList early-return bug + add large↔small state-flip tests Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants