Skip to content

Support multi data store shards#141

Merged
lzxddz merged 2 commits intomainfrom
dss_multi_shard-rebase
Nov 21, 2025
Merged

Support multi data store shards#141
lzxddz merged 2 commits intomainfrom
dss_multi_shard-rebase

Conversation

@lzxddz
Copy link
Collaborator

@lzxddz lzxddz commented Nov 10, 2025

Related PR:
eloqdata/tx_service#199
eloqdata/eloqdoc#298
eloqdata/eloqsql#158
eloqdata/eloqkv#265

Summary by CodeRabbit

  • New Features

    • Shard-aware routing: operations now carry shard context for per-shard reads, writes, scans, archives, snapshots, and shard enumeration.
  • Refactor

    • Per-shard lifecycle and routing: services, scanners, RPCs, datastores and factories manage shards independently; shard context propagated end-to-end.
    • Thread pool initialization and shutdown improved.
  • Chores

    • RPC/messages standardized to single-key blobs and explicit shard IDs; removed legacy bucket/key helpers; refined startup/timing logs.
  • Bug Fixes

    • Several fatal asserts converted to warnings/no-ops to improve runtime resilience.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 10, 2025

Walkthrough

Implements shard-aware routing and per-shard DataShard lifecycle: threads shard_id through client APIs, closures, RPCs, protos, scanners, and internal requests; adds bucket_infos/InitBucketsInfo and GetShardIdByPartitionId; converts partition_id to int32; replaces single global datastore with per-shard DataShard array; downgrades several assert/fatal paths to warnings/no-ops.

Changes

Cohort / File(s) Summary
Core per-shard service
eloq_data_store_service/data_store_service.h, eloq_data_store_service/data_store_service.cpp
Replace global DataStore with std::array<DataShard,1000> data_shards_; add DataShard struct, per-shard lifecycle, per-shard RPC routing and counters; StartService signature simplified; RPCs and handlers now accept and route by shard_id.
Client shard resolution & config
data_store_service_client.h, data_store_service_client.cpp
Add bucket/shard metadata and ownership APIs (InitBucketsInfo, GetShardIdByPartitionId, GetAllDataShards, GetOwnerNodeIndexOfShard, UpdateShardOwner); change TxConfigsToDssClusterConfig signature; propagate shard_id through client APIs; remove legacy Encode/Decode helpers.
Closures & client call-paths
data_store_service_client_closure.h, data_store_service_client_closure.cpp
Change partition_id to int32_t; add shard_id_ and ShardId() accessors; Key() returns std::string_view; PrepareRequest/Reset now store shard_id/key; forward shard_id across callbacks and BatchWrite flows.
Scanner / partition scanning
data_store_service_scanner.h, data_store_service_scanner.cpp
SinglePartitionScanner uses int32_t partition_id_ and new data_shard_id_; Reset signature extended to accept data_shard_id; scanners pass shard into ScanNext/ScanClose; HASH_PARTITION_COUNT set to txservice::total_hash_partitions.
Protobuf / wire
eloq_data_store_service/ds_request.proto
Add shard_id fields to Read/BatchWriteRecords/DeleteRange/Scan; change Read/Scan partition_id to int32; change Read.key_str to single bytes; adjust BatchWriteRecords field ordering.
Internal requests & request impls
eloq_data_store_service/internal_request.h, *LocalRequest/*RpcRequest
Add virtual GetShardId() to request interfaces; store/propagate shard_id_ in local/RPC requests; update Reset/Fill/SetFinish to carry shard_id and call DecreaseWriteReqCount(shard).
RocksDB / cloud & datastore common
eloq_data_store_service/rocksdb_cloud_data_store_factory.h, eloq_data_store_service/rocksdb_cloud_data_store.cpp, eloq_data_store_service/rocksdb_data_store_common.{h,cpp}, eloq_data_store_service/rocksdb_data_store.cpp
Construct per-shard cloud object_path and pass shard_id into cloud store; add RocksDBDataStoreCommon::Shutdown(); remove BuildKey(ReadRequest) overload and use BuildKey(..., key); adjust shutdown ordering and timing/logging.
Thread pool & worker init
eloq_data_store_service/thread_worker_pool.cpp, eloq_data_store_service/thread_worker_pool.h
Introduce Initialize() to move thread startup out of ctor; redesign worker loop and Shutdown join semantics; add Initialize declaration.
Build config
eloq_data_store_service/CMakeLists.txt
Remove USE_ONE_ELOQDSS_PARTITION_ENABLED option and associated compile-definition gating.
Cross-cutting signature changes
data_store_service_client.*, data_store_service_client_closure.*, data_store_service_scanner.*, eloq_data_store_service/internal_request.h, *LocalRequest/*RpcRequest, eloq_data_store_service/*.h
Widespread API/signature updates to pass shard_id into Read/Scan/BatchWrite/DeleteRange/ScanClose and related closures/requests; many partition types changed to int32_t; propagate shard_id throughout call chains and RPC payloads.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DSC as DataStoreServiceClient
    participant CM as ClusterManager
    participant Node as RemoteNode
    participant DSS as DataStoreService
    participant Shard as DataShard

    Client->>DSC: Read(table, partition_id, key)
    DSC->>DSC: shard_id = GetShardIdByPartitionId(partition_id)
    DSC->>CM: resolve owner for shard_id
    alt owner is local
        DSC->>DSS: Read(table, partition_id, shard_id, key)
        DSS->>Shard: data_store_->Read(..., shard_id, ...)
        Shard-->>DSS: result
        DSS-->>DSC: result
    else owner is remote
        DSC->>Node: RPC Read(..., shard_id, ...)
        Node-->>DSC: response
    end
    DSC-->>Client: return result
Loading
sequenceDiagram
    participant Main
    participant DSS as DataStoreService
    participant Shard as DataShard[*]

    Main->>DSS: StartService(create_db_if_missing)
    DSS->>DSS: for each shard_id in GetAllDataShards()
    DSS->>Shard: allocate/open DataShard(shard_id)
    Shard-->>DSS: shard_status -> Open/ReadWrite
    DSS-->>Main: Start complete
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

  • Attention areas:
    • Per-shard lifecycle, initialization and shutdown in eloq_data_store_service/data_store_service.{h,cpp}
    • Protobuf wire changes (field reorder and new shard_id) — verify schema compatibility and regenerated stubs
    • Widespread signature changes across closures, internal requests, scanners and RPCs — ensure all call sites and marshaling updated
    • Shard ownership / InitBucketsInfo / TxConfigsToDssClusterConfig correctness and mapping logic
    • Concurrency around per-shard counters, DecreaseWriteReqCount(shard) and shutdown races
    • RocksDB cloud factory per-shard object_path construction and RocksDBDataStoreCommon::Shutdown behavior

Possibly related PRs

Suggested reviewers

  • liunyl
  • githubzilla

"🐇 I hop through shards at dawn's bright ray,
Keys snug in fur, then off I bound away,
Buckets bloom to shard maps neat,
Messages hop on nimble feet,
Carrots of data — shared and on their way! 🥕"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Support multi data store shards' accurately describes the main objective of the pull request - introducing per-shard routing and multi-shard support throughout the codebase.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dss_multi_shard-rebase

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
data_store_service_client.cpp (4)

316-318: Don’t crash in ScheduleTimerTasks stub

Asserting false here will kill the process if the stub is ever exercised. Prefer a no-op or TODO log.

-    LOG(ERROR) << "ScheduleTimerTasks not implemented";
-    assert(false);
+    LOG(WARNING) << "ScheduleTimerTasks not implemented (noop)";

2624-2624: Remove assert(false) in FetchArchives or guard behind feature flag

This function asserts and then proceeds with fully-formed logic. The assert will crash in debug builds.

-    assert(false);
+    // TODO: validate behavior and remove when complete; leaving enabled.

2734-2734: Remove assert(false) in FetchVisibleArchive

Same pattern; the code below is functional. Don’t crash.

-    assert(false);
+    // TODO: keep functional path; no-op assert removed.

3070-3117: CloseDataStore uses hard-coded 0; should close the ng_id passed

Lines 3081-3082 use CloseDataStore(0); this breaks multi-shard. Use CloseDataStore(ng_id).

-        // Now, only support one shard.
-        data_store_service_->CloseDataStore(0);
+        data_store_service_->CloseDataStore(ng_id);
eloq_data_store_service/internal_request.h (1)

754-763: Pass shard_id into PrepareShardingError.

PrepareShardingError now expects a shard identifier, but this code is still forwarding the partition id. On a multi-shard node the partition id can easily exceed the number of local shards, so this ends up indexing data_shards_ with the wrong value (or throwing) and returns sharding metadata for the wrong shard. Please pass the closure’s shard_id instead, and mirror the same fix in the other PrepareShardingError call sites in this file (ReadLocalRequest::SetFinish, ScanRpcRequest::SetFinish, and ScanLocalRequest::SetFinish).

-            ds_service_->PrepareShardingError(req_->partition_id(),
+            ds_service_->PrepareShardingError(req_->shard_id(),
                                              resp_->mutable_result());
🧹 Nitpick comments (5)
data_store_service_client.cpp (1)

3178-3234: InitBucketsInfo: avoid srand/rand and global RNG side effects

  • srand/rand mutate global state and aren’t thread-safe.
  • Prefer per-call mt19937_64 seeded by ng, use uniform_int_distribution<uint16_t>. Also avoid modulo bias.
-    for (auto ng : node_groups)
-    {
-        srand(ng);
-        size_t generated = 0;
-        while (generated < 64)
-        {
-            uint16_t rand_num = rand() % txservice::total_range_buckets;
+    for (auto ng : node_groups) {
+        std::mt19937_64 rng(static_cast<uint64_t>(ng));
+        std::uniform_int_distribution<uint16_t> dist(0, txservice::total_range_buckets - 1);
+        size_t generated = 0;
+        while (generated < 64) {
+            uint16_t rand_num = dist(rng);
             if (rand_num_to_ng.find(rand_num) == rand_num_to_ng.end())
             {
                 generated++;
                 rand_num_to_ng.emplace(rand_num, ng);
             }
eloq_data_store_service/rocksdb_data_store.cpp (1)

61-67: Destructor guard relies on pointer nulling

Given the change above, destructor won’t re-enter Shutdown. Without nulling, the guard stays true and Shutdown will run again.

No code change beyond the previous diff is needed.

eloq_data_store_service/rocksdb_cloud_data_store.cpp (1)

155-163: Avoid const_cast; mutate member directly

cloud_config_ is a non-const member. Use it directly and simplify path normalization; ensure idempotence if constructed multiple times.

-    auto *cloud_config_ptr =
-        const_cast<EloqDS::RocksDBCloudConfig *>(&cloud_config_);
-    assert(!cloud_config_ptr->object_path_.empty());
-    if (cloud_config_ptr->object_path_.back() != '/')
+    assert(!cloud_config_.object_path_.empty());
+    if (cloud_config_.object_path_.back() != '/')
     {
-        cloud_config_ptr->object_path_.append("/");
+        cloud_config_.object_path_.append("/");
     }
-    cloud_config_ptr->object_path_.append("ds_" + std::to_string(shard_id));
+    cloud_config_.object_path_.append("ds_" + std::to_string(shard_id));
eloq_data_store_service/data_store_service.h (2)

714-714: Replace magic number with named constant.

The shard count 1000 is hardcoded. This should be a named constant for maintainability and clarity.

Consider defining a constant:

+    static constexpr uint32_t MAX_SHARD_COUNT = 1000;
+
-    std::array<DataShard, 1000> data_shards_;
+    std::array<DataShard, MAX_SHARD_COUNT> data_shards_;

This makes the architecture more flexible for future changes and clarifies the design intent.


683-690: Remove commented-out code before merging.

The old single-shard member declarations (lines 683-689) are commented out. These should be removed entirely rather than left as comments, as they add noise and can cause confusion.

-    // uint32_t shard_id_{UINT32_MAX};
-    // std::unique_ptr<DataStore> data_store_{nullptr};
-    // std::atomic<DSShardStatus> shard_status_{DSShardStatus::Closed};
-    // std::atomic<uint64_t> ongoing_write_requests_{0};
-
-    // // scan iterator cache
-    // TTLWrapperCache scan_iter_cache_;
-
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce0e931 and cf10320.

📒 Files selected for processing (17)
  • data_store_service_client.cpp (90 hunks)
  • data_store_service_client.h (14 hunks)
  • data_store_service_client_closure.cpp (17 hunks)
  • data_store_service_client_closure.h (34 hunks)
  • data_store_service_scanner.cpp (3 hunks)
  • data_store_service_scanner.h (4 hunks)
  • eloq_data_store_service/CMakeLists.txt (0 hunks)
  • eloq_data_store_service/data_store_service.cpp (44 hunks)
  • eloq_data_store_service/data_store_service.h (13 hunks)
  • eloq_data_store_service/data_store_service_util.h (1 hunks)
  • eloq_data_store_service/ds_request.proto (5 hunks)
  • eloq_data_store_service/internal_request.h (55 hunks)
  • eloq_data_store_service/main.cpp (1 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp (6 hunks)
  • eloq_data_store_service/rocksdb_data_store.cpp (1 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.cpp (2 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.h (0 hunks)
💤 Files with no reviewable changes (2)
  • eloq_data_store_service/rocksdb_data_store_common.h
  • eloq_data_store_service/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client_closure.cpp
  • data_store_service_client.h
  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client_closure.cpp
  • eloq_data_store_service/data_store_service.cpp
  • data_store_service_client.cpp
🧬 Code graph analysis (10)
eloq_data_store_service/rocksdb_data_store.cpp (1)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
data_store_service_scanner.cpp (1)
data_store_service_scanner.h (1)
  • data_shard_id_ (266-266)
data_store_service_scanner.h (1)
eloq_data_store_service/internal_request.h (9)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (818-826)
  • partition_id_ (647-647)
  • partition_id_ (869-869)
  • partition_id_ (1528-1528)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (1)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
eloq_data_store_service/data_store_service.h (2)
eloq_data_store_service/data_store_service.cpp (14)
  • StartService (245-314)
  • StartService (245-245)
  • Read (400-433)
  • Read (400-403)
  • Read (435-478)
  • Read (435-443)
  • ScanClose (996-1029)
  • ScanClose (996-999)
  • ScanClose (1031-1063)
  • ScanClose (1031-1036)
  • CloseDataStore (1846-1891)
  • CloseDataStore (1846-1846)
  • OpenDataStore (1893-1929)
  • OpenDataStore (1893-1893)
eloq_data_store_service/data_store_service_config.cpp (6)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
eloq_data_store_service/internal_request.h (1)
data_store_service_client_closure.h (31)
  • Reset (81-85)
  • Reset (144-153)
  • Reset (210-216)
  • Reset (269-274)
  • Reset (385-392)
  • Reset (600-619)
  • Reset (933-948)
  • Reset (1154-1177)
  • Reset (1393-1408)
  • Reset (1598-1624)
  • Reset (1627-1658)
  • Reset (1660-1694)
  • Reset (1933-2251)
  • Reset (2328-2347)
  • Reset (2606-2612)
  • Reset (3002-3010)
  • vector (1078-1081)
  • vector (1121-1128)
  • vector (2253-2256)
  • string (804-809)
  • string (2173-2178)
  • string_view (311-368)
  • string_view (752-755)
  • string_view (767-770)
  • string_view (792-797)
  • string_view (1291-1294)
  • string_view (1533-1536)
  • string_view (2133-2136)
  • string_view (2138-2141)
  • string_view (2241-2246)
  • string_view (2453-2456)
data_store_service_client_closure.h (2)
eloq_data_store_service/internal_request.h (13)
  • partition_id_ (647-647)
  • partition_id_ (869-869)
  • partition_id_ (1528-1528)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (818-826)
  • string_view (988-996)
  • string_view (1123-1131)
  • string_view (1425-1433)
  • string_view (1440-1448)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
eloq_data_store_service/data_store_service.cpp (3)
eloq_data_store_service/data_store_service.h (4)
  • IsOwnerOfShard (618-623)
  • PrepareShardingError (525-532)
  • IncreaseWriteReqCount (606-610)
  • DecreaseWriteReqCount (612-616)
data_store_service_client.cpp (6)
  • Read (1151-1160)
  • Read (1151-1156)
  • Read (3574-3590)
  • Read (3574-3579)
  • ScanNext (3766-3799)
  • ScanNext (3766-3780)
eloq_data_store_service/rocksdb_data_store_common.cpp (9)
  • Read (416-477)
  • Read (416-416)
  • key (832-832)
  • start_key (367-367)
  • start_key (679-679)
  • end_key (382-382)
  • end_key (684-684)
  • ScanNext (735-1045)
  • ScanNext (735-735)
data_store_service_client.h (2)
data_store_service_client.cpp (20)
  • FindFreeNodeIndex (3313-3332)
  • FindFreeNodeIndex (3313-3313)
  • InitBucketsInfo (3178-3234)
  • InitBucketsInfo (3178-3182)
  • OnLeaderStart (3032-3047)
  • OnLeaderStart (3032-3033)
  • OnLeaderStop (3049-3061)
  • OnLeaderStop (3049-3049)
  • OnStartFollowing (3070-3117)
  • OnStartFollowing (3070-3073)
  • GetShardIdByPartitionId (3145-3162)
  • GetShardIdByPartitionId (3145-3146)
  • IsLocalShard (3135-3143)
  • IsLocalShard (3135-3135)
  • GetOwnerNodeIndexOfShard (3253-3258)
  • GetOwnerNodeIndexOfShard (3253-3254)
  • GetAllDataShards (3164-3176)
  • GetAllDataShards (3164-3164)
  • UpdateOwnerNodeIndexOfShard (3260-3311)
  • UpdateOwnerNodeIndexOfShard (3260-3261)
eloq_data_store_service/data_store_service_config.cpp (2)
  • GetShardIdByPartitionId (802-807)
  • GetShardIdByPartitionId (802-803)
data_store_service_client.cpp (3)
data_store_service_client.h (3)
  • TxPort2DssPort (126-129)
  • need_bootstrap_ (683-816)
  • KvPartitionIdOf (621-626)
eloq_data_store_service/data_store_service.cpp (8)
  • BatchWriteRecords (854-902)
  • BatchWriteRecords (854-858)
  • BatchWriteRecords (1124-1189)
  • BatchWriteRecords (1124-1137)
  • Read (400-433)
  • Read (400-403)
  • Read (435-478)
  • Read (435-443)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
🪛 Clang (14.0.6)
eloq_data_store_service/data_store_service_util.h

[error] 24-24: 'atomic' file not found

(clang-diagnostic-error)

🔇 Additional comments (21)
eloq_data_store_service/data_store_service_util.h (1)

24-29: Static analysis false positive: ignore the 'atomic' error.

The static analysis error about 'atomic' file not found is a false positive. <atomic> is a standard C++11 header and should compile correctly in the target environment. The new includes support the multi-shard data structures introduced in this PR.

data_store_service_client_closure.h (3)

602-614: LGTM! Type changes and shard_id propagation are consistent.

The updated signature correctly changes partition_id to int32_t and adds uint32_t shard_id, aligning with the shard-aware architecture. The initialization of key_ directly from the parameter simplifies key handling.


1850-1858: LGTM! Accessor methods follow consistent patterns.

The PartitionId() and ShardId() accessor methods are well-defined and return the correct types, enabling proper shard routing throughout the system.


631-633: Sentinel values are intentionally inconsistent across different modules but adequately scoped.

The review comment's concern about consistency has merit, but the codebase shows intentional design differences rather than errors:

  • data_store_service_client_closure.h: Uses INT32_MAX as a sentinel for "uninitialized/reset" state in closure classes (lines 631, 1605, 1913)
  • eloq_data_store_service/internal_request.h and data_store_service_scanner.h: Use 0 as the sentinel value

These are separate modules with different semantics—not shared classes. The shard_id_ sentinel (UINT32_MAX) is consistently checked in assertions throughout the codebase (e.g., data_store_service.cpp lines 326, 1089, 1097, 1105, etc.), validating sentinel-aware code paths exist.

However, partition_id_ sentinels lack comparable defensive checks—there are no comparisons or validations against INT32_MAX or 0 found in the codebase. This represents a minor defensive programming gap rather than an outright inconsistency issue.

data_store_service_client.h (4)

90-102: LGTM! Shard initialization logic is sound.

The initialization correctly:

  1. Populates dss_shards_ array with node indices for each shard
  2. Builds dss_shard_ids_ set from the cluster manager
  3. Calls InitBucketsInfo to establish bucket-to-shard mappings

This properly sets up the multi-shard architecture.


387-395: Leader callback signatures correctly updated for node group awareness.

The updated signatures for OnLeaderStart, OnLeaderStop, and OnStartFollowing now accept ng_id (node group ID) parameters, enabling proper per-shard leadership management in the multi-shard architecture.


624-625: Verify hash consistency for system table partition mapping.

The KvPartitionIdOf method uses std::hash<std::string_view> to compute partition IDs for system tables. Ensure that:

  1. The hash function produces consistent results across different nodes
  2. The hash distribution is uniform to avoid hotspots

Based on learnings


751-755: LGTM! New shard-aware state members are well-structured.

The additions of:

  • dss_shard_ids_ (set of active shard IDs)
  • bucket_infos_ (bucket-to-shard mapping)

provide the necessary infrastructure for multi-shard routing and management. The use of std::unique_ptr for BucketInfo ensures proper ownership semantics.

data_store_service_scanner.cpp (1)

417-419: The code at lines 417-419 correctly handles partition ID boundaries and produces valid shard IDs.

The loop iterates for (uint32_t part_cnt = 0; part_cnt < HASH_PARTITION_COUNT; part_cnt++) and passes static_cast<int32_t>(part_cnt) to GetShardIdByPartitionId. This guarantees the partition ID is always in the range [0, HASH_PARTITION_COUNT-1). The implementation uses two assertions to validate the returned bucket ID and shard ID exist in their respective maps, providing runtime safety. All 34 callers throughout the codebase follow similar patterns with documented, computed, or hardcoded valid partition IDs.

data_store_service_client.cpp (4)

3560-3563: Local vs remote routing: pattern looks consistent

All Read/DeleteRange/ScanNext/ScanClose paths check IsLocalShard and route to the right target with timeouts and owner index lookup. Good.

Also applies to: 3446-3450, 3498-3504, 3750-3763, 3804-3838, 3870-3894, 3594-3621, 3648-3676


4384-4390: Catalog paths: shard resolution for partition 0 is now consistent

Passing shard_id for partition 0 catalogs eliminates implicit shard 0 assumptions. Looks good.

Also applies to: 4610-4621, 4655-4672


4715-4749: Batch-size accounting improvements

Incrementing batch_size for key sizes prevents oversize batches. This is correct and matches MAX_WRITE_BATCH_SIZE enforcement.

Also applies to: 4761-4787, 4801-4810, 4916-4931


399-404: Per-partition coordinator decrement verified correct

The batch callback chaining at lines 493-538 ensures OnPartitionCompleted() is called exactly once per partition lifecycle:

  • If any batch fails: called at line 507
  • If all batches succeed: called at line 536 (when no more batches exist)

The same callback_data (containing one partition_state and global_coordinator) persists across all batches for a partition, and the callback chains subsequent batches before calling OnPartitionCompleted(). This confirms the global coordinator's unfinished_request_cnt_ decrements only when the partition fully completes or fails, not per batch. The routing logic at lines 399-404 using batch[0].partition_id is correct and properly grouped.

eloq_data_store_service/main.cpp (1)

321-323: Startup wiring LGTM

StartService signature update is correctly applied and respects bootstrap/single-node. No issues spotted.

eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)

370-393: Timing/logging improvements LGTM

Switch to steady_clock and explicit NewCloudFileSystem timing are good operationally.

Also applies to: 647-671


612-636: Paranoid checks disabled with max_open_files=0: note operational tradeoff

This is fine for open-speed and to avoid LRU churn, but document the rationale in code; already noted. No change requested.

eloq_data_store_service/data_store_service.h (5)

628-631: Verify GetClusterManager visibility change is intentional.

The GetClusterManager() method returns a non-const reference and appears to have been made public (indicated by the ~ marker on line 628). This exposes the cluster manager for external modification.

Ensure that:

  • This visibility change is intentional and necessary for the multi-shard architecture
  • External callers have a legitimate need to access/modify cluster manager state
  • Thread-safety is maintained if multiple threads access the cluster manager

If external modification is not required, consider returning a const reference instead:

const DataStoreServiceClusterManager& GetClusterManager() const

606-616: LGTM: Per-shard write request tracking.

The refactoring to per-shard write request counters is correctly implemented. The use of .at() provides bounds checking, and the atomic operations with appropriate memory ordering are correct.


585-593: LGTM: Defensive shard status check.

The implementation correctly verifies that the shard is initialized (shard_id_ matches the requested shard_id) before returning the actual status, otherwise returns Closed. This is good defensive programming.


634-644: LGTM: Safe per-shard data store accessor.

The private GetDataStore method correctly validates that the shard is initialized before returning the data store pointer, returning nullptr otherwise. The use of .at() provides bounds checking.


190-190: StartService signature appropriately simplified for multi-shard architecture.

The signature has been narrowed to a single create_db_if_missing parameter, which is appropriate for the new multi-shard architecture where shard-specific configuration is obtained from the cluster manager rather than passed as parameters.

@lzxddz lzxddz self-assigned this Nov 10, 2025
@lzxddz lzxddz requested a review from liunyl November 10, 2025 09:25
@lzxddz lzxddz linked an issue Nov 10, 2025 that may be closed by this pull request
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
data_store_service_client.cpp (1)

3069-3116: Inconsistent use of ng_id in OnStartFollowing

Lines 3040 and 3057 correctly use the ng_id parameter when calling OpenDataStore(ng_id) and CloseDataStore(ng_id). However, line 3080 hardcodes 0 instead of using the ng_id parameter. This inconsistency could cause the wrong shard to be closed when following a leader.

Apply this fix:

     if (data_store_service_ != nullptr)
     {
-        // Now, only support one shard.
-        data_store_service_->CloseDataStore(0);
+        data_store_service_->CloseDataStore(ng_id);
     }
♻️ Duplicate comments (5)
data_store_service_client.cpp (5)

313-317: Still unresolved: Fatal assert will crash the process

The unconditional assert(false) at line 316 will abort the process if this method is called. This was flagged in past reviews but remains unaddressed.

Replace with graceful error handling:

 void DataStoreServiceClient::ScheduleTimerTasks()
 {
     LOG(ERROR) << "ScheduleTimerTasks not implemented";
-    assert(false);
+    // TODO: Implement timer-based tasks when needed
 }

2616-2624: Still unresolved: Fatal assert in FetchArchives

The unconditional assert(false) at line 2623 was flagged in past reviews but remains unaddressed. This will crash the process if called.

Replace with proper error handling:

 bool DataStoreServiceClient::FetchArchives(
     const txservice::TableName &table_name,
     const txservice::KVCatalogInfo *kv_info,
     const txservice::TxKey &key,
     std::vector<txservice::VersionTxRecord> &archives,
     uint64_t from_ts)
 {
-    assert(false);
-
     LOG(INFO) << "FetchArchives: table_name: " << table_name.StringView();
+    LOG(ERROR) << "FetchArchives not fully implemented yet";
+    return false;

2724-2734: Still unresolved: Fatal assert in FetchVisibleArchive

The unconditional assert(false) at line 2733 was flagged in past reviews but remains unaddressed.

Replace with proper error handling:

 bool DataStoreServiceClient::FetchVisibleArchive(
     const txservice::TableName &table_name,
     const txservice::KVCatalogInfo *kv_info,
     const txservice::TxKey &key,
     const uint64_t upper_bound_ts,
     txservice::TxRecord &rec,
     txservice::RecordStatus &rec_status,
     uint64_t &commit_ts)
 {
-    assert(false);
-
+    LOG(ERROR) << "FetchVisibleArchive not fully implemented yet";
+    return false;

213-273: Still unresolved: Redundant check and unguarded map access

Past review flagged two issues that remain unaddressed:

  1. Redundant check (lines 257-262): The condition if (leader_id == node_id) is checked twice—once implicitly when calling SwitchShardToReadWrite and again immediately after. Consolidate into a single block.

  2. Unguarded map access (line 255): nodes_map.at(leader_id) is called without verifying that leader_id exists in nodes_map. The ng_leaders map is independent of nodes_map, so a leader_id from ng_leaders might not exist in nodes_map. The assert on line 252 comes after this access, so it doesn't prevent the exception.

Apply this fix:

         // set primary node
         if (ng_leaders.find(ng_id) != ng_leaders.end())
         {
             uint32_t leader_id = ng_leaders.at(ng_id);
-            assert(nodes_map.find(leader_id) != nodes_map.end());
-            if (nodes_map.find(leader_id) != nodes_map.end())
+            auto leader_it = nodes_map.find(leader_id);
+            if (leader_it != nodes_map.end())
             {
-                cluster_manager.UpdatePrimaryNode(ng_id,
-                                                  nodes_map.at(leader_id));
-            }
-            if (leader_id == node_id)
-            {
-                contain_this_node = true;
+                cluster_manager.UpdatePrimaryNode(ng_id, leader_it->second);
+                if (leader_id == node_id)
+                {
+                    contain_this_node = true;
+                    cluster_manager.SwitchShardToReadWrite(ng_id,
+                                                           DSShardStatus::Closed);
+                }
+            }
+            else
+            {
+                LOG(ERROR) << "Leader " << leader_id << " not found in nodes_map for ng " << ng_id;
+            }
+            if (leader_id == node_id && leader_it == nodes_map.end())
+            {
+                contain_this_node = true;
                 cluster_manager.SwitchShardToReadWrite(ng_id,
                                                        DSShardStatus::Closed);
             }
         }

3144-3167: Still unresolved: Unsynchronized access to shared containers

Past review flagged that bucket_infos_ and dss_shard_ids_ are read without synchronization, which can race with topology updates. This concern remains unaddressed.

Add synchronization to prevent races:

 uint32_t DataStoreServiceClient::GetShardIdByPartitionId(
     int32_t partition_id, bool is_range_partition) const
 {
     uint16_t bucket_id;
     if (is_range_partition)
     {
         bucket_id = txservice::Sharder::MapRangeIdToBucketId(partition_id);
     }
     else
     {
         bucket_id =
             txservice::Sharder::MapHashPartitionIdToBucketId(partition_id);
     }
 
+    // TODO: Add shared_lock when bucket_infos_mutex_ is introduced
+    // std::shared_lock lk(bucket_infos_mutex_);
     auto it = bucket_infos_.find(bucket_id);
     assert(it != bucket_infos_.end());
     if (it != bucket_infos_.end())
     {
         uint32_t shard_id = bucket_infos_.at(bucket_id)->BucketOwner();
         assert(dss_shard_ids_.find(shard_id) != dss_shard_ids_.end());
         return shard_id;
     }
     return 0;
 }

Verify whether bucket_infos_ is modified concurrently:

#!/bin/bash
# Search for writes to bucket_infos_ to assess race potential
rg -nP '\bbucket_infos_\s*(\[|\.emplace|\.insert|\.erase|\.clear|\s*=)' -C2
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf10320 and 1b98556.

📒 Files selected for processing (4)
  • data_store_service_client.cpp (89 hunks)
  • data_store_service_client.h (13 hunks)
  • eloq_data_store_service/data_store_service.cpp (44 hunks)
  • eloq_data_store_service/data_store_service.h (13 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • eloq_data_store_service/data_store_service.cpp
  • data_store_service_client.cpp
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client.h
  • data_store_service_client.cpp
🧬 Code graph analysis (4)
eloq_data_store_service/data_store_service.h (1)
eloq_data_store_service/data_store_service.cpp (14)
  • StartService (236-305)
  • StartService (236-236)
  • Read (391-424)
  • Read (391-394)
  • Read (426-469)
  • Read (426-434)
  • ScanClose (987-1020)
  • ScanClose (987-990)
  • ScanClose (1022-1054)
  • ScanClose (1022-1027)
  • CloseDataStore (1837-1882)
  • CloseDataStore (1837-1837)
  • OpenDataStore (1884-1920)
  • OpenDataStore (1884-1884)
eloq_data_store_service/data_store_service.cpp (3)
eloq_data_store_service/data_store_service.h (4)
  • IsOwnerOfShard (615-625)
  • PrepareShardingError (525-529)
  • IncreaseWriteReqCount (603-607)
  • DecreaseWriteReqCount (609-613)
data_store_service_client.cpp (12)
  • Read (1150-1159)
  • Read (1150-1155)
  • Read (3580-3596)
  • Read (3580-3585)
  • DeleteRange (3629-3651)
  • DeleteRange (3629-3636)
  • ScanNext (3772-3805)
  • ScanNext (3772-3786)
  • ScanClose (3847-3871)
  • ScanClose (3847-3852)
  • BatchWriteRecords (4172-4207)
  • BatchWriteRecords (4172-4185)
eloq_data_store_service/rocksdb_data_store_common.cpp (15)
  • Read (416-477)
  • Read (416-416)
  • key (832-832)
  • DeleteRange (339-414)
  • DeleteRange (339-339)
  • start_key (367-367)
  • start_key (679-679)
  • end_key (382-382)
  • end_key (684-684)
  • ScanNext (735-1045)
  • ScanNext (735-735)
  • ScanClose (1047-1071)
  • ScanClose (1047-1047)
  • BatchWriteRecords (479-641)
  • BatchWriteRecords (479-480)
data_store_service_client.h (2)
data_store_service_client.cpp (24)
  • FindFreeNodeIndex (3319-3338)
  • FindFreeNodeIndex (3319-3319)
  • InitBucketsInfo (3184-3240)
  • InitBucketsInfo (3184-3188)
  • OnLeaderStart (3031-3046)
  • OnLeaderStart (3031-3032)
  • OnLeaderStop (3048-3060)
  • OnLeaderStop (3048-3048)
  • OnStartFollowing (3069-3116)
  • OnStartFollowing (3069-3072)
  • GetShardIdByPartitionId (3144-3167)
  • GetShardIdByPartitionId (3144-3145)
  • IsLocalShard (3134-3142)
  • IsLocalShard (3134-3134)
  • GetOwnerNodeIndexOfShard (3259-3264)
  • GetOwnerNodeIndexOfShard (3259-3260)
  • GetAllDataShards (3169-3182)
  • GetAllDataShards (3169-3169)
  • UpdateOwnerNodeIndexOfShard (3266-3317)
  • UpdateOwnerNodeIndexOfShard (3266-3267)
  • HandleShardingError (3340-3374)
  • HandleShardingError (3340-3341)
  • UpgradeShardVersion (3376-3424)
  • UpgradeShardVersion (3376-3379)
eloq_data_store_service/data_store_service_config.cpp (4)
  • GetShardIdByPartitionId (802-807)
  • GetShardIdByPartitionId (802-803)
  • HandleShardingError (950-992)
  • HandleShardingError (950-951)
data_store_service_client.cpp (2)
data_store_service_client.h (3)
  • TxPort2DssPort (126-129)
  • need_bootstrap_ (680-804)
  • KvPartitionIdOf (621-626)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
🔇 Additional comments (3)
data_store_service_client.cpp (3)

286-289: LGTM: Early return when bootstrap not needed

The guard correctly skips table initialization when need_bootstrap_ is false, improving efficiency.


467-492: LGTM: Proper shard-aware routing in PutAll

The method correctly determines partition type and resolves shard_id before batch writes. The is_range_partitioned flag is properly derived and propagated.


3169-3182: LGTM: Thread-safety issue resolved with mutex

Past review flagged unsynchronized access to dss_shard_ids_. The addition of dss_shard_ids_mutex_ at line 3173 properly addresses this concern. The lock scope is minimal and appropriate.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
data_store_service_client.cpp (5)

313-317: Remove fatal assert in ScheduleTimerTasks; return gracefully

Asserting false crashes callers. Replace with a no-op + TODO until implemented.

 void DataStoreServiceClient::ScheduleTimerTasks()
 {
-    LOG(ERROR) << "ScheduleTimerTasks not implemented";
-    assert(false);
+    DLOG(WARNING) << "ScheduleTimerTasks not implemented yet";
+    // TODO: implement periodic tasks (health checks, cleanup, etc.)
 }

1871-1885: DropKvTableAsync: avoid assert(false); make it a documented no-op or implement

This assert crashes the process if ever called. Either remove the API or implement a safe async path; minimally, return without asserting.

 void DataStoreServiceClient::DropKvTableAsync(const std::string &kv_table_name)
 {
-    // FIXME(lzx): this function may not be used now.
-    assert(false);
+    DLOG(WARNING) << "DropKvTableAsync is not implemented and will be a no-op for now: " << kv_table_name;
+    return;

3069-3116: OnStartFollowing: CloseDataStore(0) ignores ng_id

Hardcoding shard 0 contradicts multi-shard support. Use ng_id consistently.

-    if (data_store_service_ != nullptr)
-    {
-        // Now, only support one shard.
-        data_store_service_->CloseDataStore(0);
-    }
+    if (data_store_service_ != nullptr) {
+        data_store_service_->CloseDataStore(ng_id);
+    }

3015-3021: RestoreTxCache: avoid assert(false); log and return

Don’t abort the process on unimplemented path.

 void DataStoreServiceClient::RestoreTxCache(txservice::NodeGroupId cc_ng_id,
                                             int64_t cc_ng_term)
 {
-    LOG(ERROR) << "RestoreTxCache not implemented";
-    assert(false);
+    DLOG(WARNING) << "RestoreTxCache not implemented for ng_id=" << cc_ng_id
+                  << " term=" << cc_ng_term;
+    return;
 }

3580-3627: Add bounds validation in GetOwnerNodeIndexOfShard before use

The review comment identifies a legitimate vulnerability. GetOwnerNodeIndexOfShard currently asserts only that the returned value is not UINT32_MAX, but does not validate that the value is within the valid range [0, 1024) for indexing dss_nodes_.

The evidence shows:

  • dss_nodes_ is a fixed-size array of 1024 elements (valid indices 0-1023)
  • FindFreeNodeIndex() returns 1024 as an error sentinel when no free slot exists
  • If this error value (1024) gets stored in dss_shards_[shard_id], GetOwnerNodeIndexOfShard will return it
  • All call sites directly index dss_nodes_[node_index] without bounds checking (lines 2980, 3618, 3674, 3717, 3761, 3836, 3892, 4237)
  • Even HandleShardingError (line 3390) loads and directly indexes without validation

The fix should add an upper-bound check in GetOwnerNodeIndexOfShard to ensure the returned value is less than dss_nodes_.size() (1024), or validate at each call site before indexing.

data_store_service_client.h (2)

414-424: Signature/type mismatch will not compile: uint64_t vs size_t

CPP defines SerializeTxRecord helpers with std::vector<size_t>, but header declares std::vector<uint64_t>. Align to size_t.

-    static void SerializeTxRecord(bool is_deleted,
-                                  const txservice::TxRecord *rec,
-                                  std::vector<uint64_t> &record_tmp_mem_area,
-                                  std::vector<std::string_view> &record_parts,
-                                  size_t &write_batch_size);
+    static void SerializeTxRecord(bool is_deleted,
+                                  const txservice::TxRecord *rec,
+                                  std::vector<size_t> &record_tmp_mem_area,
+                                  std::vector<std::string_view> &record_parts,
+                                  size_t &write_batch_size);
...
-    static void SerializeTxRecord(const txservice::TxRecord *rec,
-                                  std::vector<uint64_t> &record_tmp_mem_area,
-                                  std::vector<std::string_view> &record_parts,
-                                  size_t &write_batch_size);
+    static void SerializeTxRecord(const txservice::TxRecord *rec,
+                                  std::vector<size_t> &record_tmp_mem_area,
+                                  std::vector<std::string_view> &record_parts,
+                                  size_t &write_batch_size);

73-116: Constructor: initialize + bucket map wiring looks good; add bounds check

Before writing dss_shards_[shard_id], assert shard_id < dss_shards_.size() to avoid OOB if configs drift.

-            dss_shards_[shard_id].store(node_idx);
+            assert(shard_id < dss_shards_.size());
+            dss_shards_[shard_id].store(node_idx);
♻️ Duplicate comments (4)
data_store_service_client.cpp (4)

2623-2669: Remove unconditional assert in FetchArchives and fail gracefully

This crashes in release scenarios invoking the API. Return false with an error path instead of asserting.

 bool DataStoreServiceClient::FetchArchives(
     const txservice::TableName &table_name,
     const txservice::KVCatalogInfo *kv_info,
     const txservice::TxKey &key,
     std::vector<txservice::VersionTxRecord> &archives,
     uint64_t from_ts)
 {
-    assert(false);
+    // TODO: implement or route to async variant
+    LOG(ERROR) << "FetchArchives (sync) not implemented";
+    return false;

2724-2809: Remove unconditional assert in FetchVisibleArchive and fail gracefully

Same as above—avoid crashing production.

 bool DataStoreServiceClient::FetchVisibleArchive(
     const txservice::TableName &table_name,
     const txservice::KVCatalogInfo *kv_info,
     const txservice::TxKey &key,
     const uint64_t upper_bound_ts,
     txservice::TxRecord &rec,
     txservice::RecordStatus &rec_status,
     uint64_t &commit_ts)
 {
-    assert(false);
+    // TODO: implement or route to async variant
+    LOG(ERROR) << "FetchVisibleArchive (sync) not implemented";
+    return false;

3144-3167: GetShardIdByPartitionId: add synchronization and avoid silent 0 fallback

  • bucket_infos_ access is unsynchronized; races with topology updates.
  • Returning 0 on miss can misroute requests.

Proposed fix:

  • Guard bucket_infos_ with a shared lock.
  • If bucket not found, return an error or throw/log and avoid defaulting to 0.
 uint32_t DataStoreServiceClient::GetShardIdByPartitionId(
     int32_t partition_id, bool is_range_partition) const
 {
-    uint16_t bucket_id;
+    uint16_t bucket_id;
     if (is_range_partition)
     {
         bucket_id = txservice::Sharder::MapRangeIdToBucketId(partition_id);
     }
     else
     {
         bucket_id =
             txservice::Sharder::MapHashPartitionIdToBucketId(partition_id);
     }

-    auto it = bucket_infos_.find(bucket_id);
+    // TODO: protect bucket_infos_ with a mutex; see header change to add bucket_infos_mutex_
+    auto it = bucket_infos_.find(bucket_id);
     assert(it != bucket_infos_.end());
     if (it != bucket_infos_.end())
     {
         uint32_t shard_id = bucket_infos_.at(bucket_id)->BucketOwner();
         assert(dss_shard_ids_.find(shard_id) != dss_shard_ids_.end());
         return shard_id;
     }
-    return 0;
+    LOG(ERROR) << "Bucket not found for partition_id=" << partition_id
+               << " (bucket_id=" << bucket_id << ")";
+    return UINT32_MAX; // explicit invalid
 }

3184-3240: Replace srand/rand with thread-safe RNG and deterministic sampling

rand()/srand() are not thread-safe and use global state. Use a local std::mt19937 seeded by ng.

-    std::map<uint16_t, uint32_t> rand_num_to_ng;
-    // use ng id as seed to generate random numbers
+    std::map<uint16_t, uint32_t> rand_num_to_ng;
+    // use ng id as seed to generate deterministic virtual nodes per group
     for (auto ng : node_groups)
     {
-        srand(ng);
+        std::mt19937 rng(ng);
+        std::uniform_int_distribution<uint16_t> dist(
+            0, txservice::total_range_buckets - 1);
         size_t generated = 0;
         while (generated < 64)
         {
-            uint16_t rand_num = rand() % txservice::total_range_buckets;
+            uint16_t rand_num = dist(rng);
             if (rand_num_to_ng.find(rand_num) == rand_num_to_ng.end())
             {
                 generated++;
                 rand_num_to_ng.emplace(rand_num, ng);
             }
             if (rand_num_to_ng.size() >= txservice::total_range_buckets)
🧹 Nitpick comments (8)
data_store_service_client.cpp (4)

111-141: Avoid redundant re-initialization and relying on nodes_[0] as primary

  • Use strict version check to skip no-op upgrades when versions are equal.
  • Don’t assume nodes_[0] is the owner; prefer the cluster manager’s recorded primary if available.

Suggested tweak:

-    if (current_version <= cluster_manager.GetTopologyVersion() &&
+    if (current_version < cluster_manager.GetTopologyVersion() &&
         dss_topology_version_.compare_exchange_strong(current_version,
                                                       new_version))
 {
-            // The first node is the owner of shard.
-            assert(group.nodes_.size() > 0);
-            while (!UpgradeShardVersion(group.shard_id_,
-                                        group.version_,
-                                        group.nodes_[0].host_name_,
-                                        group.nodes_[0].port_))
+            assert(!group.nodes_.empty());
+            // TODO: consider using the primary node from cluster_manager instead of nodes_[0]
+            while (!UpgradeShardVersion(group.shard_id_,
+                                        group.version_,
+                                        group.nodes_[0].host_name_,
+                                        group.nodes_[0].port_))
             {

135-147: Fix log typos to aid on-call grepability

“DataStoreServiceCliet” → “DataStoreServiceClient”.

-            LOG(INFO) << "DataStoreServiceCliet UpgradeShardVersion success, "
+            LOG(INFO) << "DataStoreServiceClient UpgradeShardVersion success, "
...
-        LOG(INFO)
-            << "DataStoreServiceCliet SetupConfig skipped, current_version:"
+        LOG(INFO)
+            << "DataStoreServiceClient SetupConfig skipped, current_version:"

213-272: TxConfigsToDssClusterConfig: good guards; set “this node” before switching to RW

The leader existence check and duplicate condition consolidation look good. SetThisNode should be called before switching to ReadWrite to ensure local metadata is in place.

-        // set primary node
+        // set primary node
         if (ng_leaders.find(ng_id) != ng_leaders.end())
         {
             uint32_t leader_id = ng_leaders.at(ng_id);
             assert(nodes_map.find(leader_id) != nodes_map.end());
             if (nodes_map.find(leader_id) != nodes_map.end())
             {
                 cluster_manager.UpdatePrimaryNode(ng_id,
                                                   nodes_map.at(leader_id));
             }
             if (leader_id == node_id)
             {
-                contain_this_node = true;
-                cluster_manager.SwitchShardToReadWrite(ng_id,
-                                                       DSShardStatus::Closed);
+                contain_this_node = true;
             }
         }

         // set this node
         if (contain_this_node && nodes_map.find(node_id) != nodes_map.end())
         {
             auto &this_node_config = nodes_map.at(node_id);
             cluster_manager.SetThisNode(this_node_config.host_name_,
                                         this_node_config.port_);
+            // if we are leader, flip to RW after local-node metadata is set
+            if (ng_leaders.count(ng_id) && ng_leaders.at(ng_id) == node_id) {
+                cluster_manager.SwitchShardToReadWrite(ng_id,
+                                                       DSShardStatus::Closed);
+            }
         }

3169-3182: GetAllDataShards: thread-safe copy is good; preserve deterministic order

Looks correct; iterating std::set keeps ascending order which is useful for fanout. Consider documenting ordering guarantees.

data_store_service_client.h (4)

640-646: Comment param mismatch: update docstring

The comment says “@param partition_id” but the method takes shard_id. Update to avoid confusion for maintainers.


479-481: Expose GetShardIdByPartitionId publicly: OK; add explicit invalid return policy

Document that it may return UINT32_MAX on failure (after cpp change) and that callers must handle it by refreshing topology.


652-661: UpdateShardOwner declared but not defined in this PR

If unused, consider removing to reduce surface area; otherwise add implementation.


736-741: Node vs shard array sizes

dss_nodes_ is 1024 and dss_shards_ is 1000; document rationale and enforce that owner indices stored in dss_shards_ are < dss_nodes_.size().

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b98556 and dbc6a98.

📒 Files selected for processing (2)
  • data_store_service_client.cpp (89 hunks)
  • data_store_service_client.h (13 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client.cpp
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client.cpp
  • data_store_service_client.h
🧬 Code graph analysis (2)
data_store_service_client.cpp (2)
data_store_service_client.h (3)
  • TxPort2DssPort (126-129)
  • need_bootstrap_ (680-804)
  • KvPartitionIdOf (621-626)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
data_store_service_client.h (1)
data_store_service_client.cpp (22)
  • FindFreeNodeIndex (3319-3338)
  • FindFreeNodeIndex (3319-3319)
  • InitBucketsInfo (3184-3240)
  • InitBucketsInfo (3184-3188)
  • OnLeaderStart (3031-3046)
  • OnLeaderStart (3031-3032)
  • OnLeaderStop (3048-3060)
  • OnLeaderStop (3048-3048)
  • OnStartFollowing (3069-3116)
  • OnStartFollowing (3069-3072)
  • GetShardIdByPartitionId (3144-3167)
  • GetShardIdByPartitionId (3144-3145)
  • IsLocalShard (3134-3142)
  • IsLocalShard (3134-3134)
  • GetOwnerNodeIndexOfShard (3259-3264)
  • GetOwnerNodeIndexOfShard (3259-3260)
  • GetAllDataShards (3169-3182)
  • GetAllDataShards (3169-3169)
  • UpdateOwnerNodeIndexOfShard (3266-3317)
  • UpdateOwnerNodeIndexOfShard (3266-3267)
  • UpgradeShardVersion (3376-3424)
  • UpgradeShardVersion (3376-3379)
🔇 Additional comments (10)
data_store_service_client.cpp (3)

286-289: Early-return when not bootstrapping looks good

Avoids unnecessary init work on secondaries/clients. No issues spotted.


655-668: Shard routing for FetchCurrentTableStatistics: LGTM

Correctly derives kv partition and resolves shard_id. No changes needed.


1001-1023: Shard routing for FetchTableRanges: LGTM

Consistent partition→shard resolution and RPC path. No issues.

data_store_service_client.h (7)

736-744: Shard owner index bounds and capacity validation

  • dss_shards_ has fixed size 1000; enforce bounds wherever shard_id is used.
  • Validate configured shard_count against 1000 at startup.

Minimal guard:

-    std::array<std::atomic<uint32_t>, 1000> dss_shards_;
+    std::array<std::atomic<uint32_t>, 1000> dss_shards_; // MAX_SHARDS = 1000

And ensure callers assert shard_id < dss_shards_.size() (see cpp comments). Consider introducing a constexpr MAX_SHARDS and validating config on load.


387-395: Leadership hooks: interface looks right; ensure implementations avoid hardcoded shard 0

See cpp comment recommending CloseDataStore(ng_id) in OnStartFollowing.


620-626: KvPartitionIdOf(table): LGTM

Deterministic mapping via std::hash and Sharder is consistent with multi‑shard routing.


124-137: TxPort2DssPort offset: confirm contract

Port +7 mapping is unchanged; ensure DS listens on this offset in deployments to avoid misroutes.


493-506: Read/BatchWrite/Scan signatures now shard-aware: LGTM

Public → internal signatures align with cpp usage. No issues spotted.


566-590: ScanNext/ScanClose shard_id propagation: LGTM

Interfaces align with remote/local routing logic.


344-363: FetchVisibleArchive/FetchArchives async variants OK

Public async APIs exist; ensure sync counterparts don’t assert (see cpp diffs).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
data_store_service_client.cpp (1)

3069-3081: Close the shard you’re actually following.

After making the client shard-aware, OnStartFollowing must release the shard identified by ng_id. The remaining CloseDataStore(0) call still assumes a single shard and will leave the real shard open (and incorrectly tear down shard 0) whenever ng_id != 0. That breaks follower transitions for every non-zero shard. Switch the call to use the passed-in ng_id.

-        // Now, only support one shard.
-        data_store_service_->CloseDataStore(0);
+        data_store_service_->CloseDataStore(ng_id);
♻️ Duplicate comments (1)
eloq_data_store_service/data_store_service.h (1)

615-624: Treat read-only shards as owned.

Restricting IsOwnerOfShard to ReadWrite reintroduces the regression we discussed earlier: followers (ReadOnly) stop serving reads and shutdown paths bail out with REQUESTED_NODE_NOT_OWNER. Keep the shard-id guard, but consider both ReadWrite and ReadOnly as owned states.

-        return ds_ref.shard_status_.load(std::memory_order_acquire) ==
-               DSShardStatus::ReadWrite;
+        const auto status =
+            ds_ref.shard_status_.load(std::memory_order_acquire);
+        return status == DSShardStatus::ReadWrite ||
+               status == DSShardStatus::ReadOnly;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbc6a98 and 415fc21.

📒 Files selected for processing (2)
  • data_store_service_client.cpp (89 hunks)
  • eloq_data_store_service/data_store_service.h (13 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client.cpp
🧬 Code graph analysis (2)
eloq_data_store_service/data_store_service.h (3)
eloq_data_store_service/data_store_service.cpp (14)
  • StartService (236-305)
  • StartService (236-236)
  • Read (391-424)
  • Read (391-394)
  • Read (426-469)
  • Read (426-434)
  • ScanClose (987-1020)
  • ScanClose (987-990)
  • ScanClose (1022-1054)
  • ScanClose (1022-1027)
  • CloseDataStore (1837-1882)
  • CloseDataStore (1837-1837)
  • OpenDataStore (1884-1920)
  • OpenDataStore (1884-1884)
eloq_data_store_service/data_store_service_config.cpp (6)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
eloq_data_store_service/data_store.h (1)
  • DataStore (42-143)
data_store_service_client.cpp (3)
data_store_service_client.h (3)
  • TxPort2DssPort (126-129)
  • need_bootstrap_ (680-804)
  • KvPartitionIdOf (621-626)
eloq_data_store_service/data_store_service.cpp (8)
  • BatchWriteRecords (845-893)
  • BatchWriteRecords (845-849)
  • BatchWriteRecords (1115-1180)
  • BatchWriteRecords (1115-1128)
  • Read (391-424)
  • Read (391-394)
  • Read (426-469)
  • Read (426-434)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
data_store_service_client.cpp (1)

3007-3020: Use ng_id instead of hardcoded 0 for consistency.

Line 3018 hardcodes CloseDataStore(0) while the method receives ng_id as a parameter. For consistency with OnLeaderStart and OnLeaderStop (which correctly use ng_id), and to support the multi-shard design, this should be CloseDataStore(ng_id).

Apply this diff:

     if (data_store_service_ != nullptr)
     {
-        // Now, only support one shard.
-        data_store_service_->CloseDataStore(0);
+        data_store_service_->CloseDataStore(ng_id);
     }
🧹 Nitpick comments (2)
eloq_data_store_service/data_store_service.h (2)

682-691: Reset shard_id_ in ShutDown for lifecycle consistency.

CloseDataStore (data_store_service.cpp line 1837) checks ds_ref.shard_id_ == UINT32_MAX to detect unassigned shards. For consistency, reset shard_id_ to UINT32_MAX after shutdown.

Apply this diff:

 void ShutDown()
 {
     if (data_store_ != nullptr)
     {
         shard_status_.store(DSShardStatus::Closed,
                             std::memory_order_release);
         data_store_->Shutdown();
         scan_iter_cache_.Clear();
+        shard_id_ = UINT32_MAX;
     }
 }

626-629: Fix formatting: Remove leading space before return type.

Line 626 has an extra space before DataStoreServiceClusterManager.

Apply this diff:

-     DataStoreServiceClusterManager &GetClusterManager()
+    DataStoreServiceClusterManager &GetClusterManager()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 415fc21 and dda63c8.

📒 Files selected for processing (3)
  • data_store_service_client.cpp (89 hunks)
  • data_store_service_client.h (13 hunks)
  • eloq_data_store_service/data_store_service.h (13 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client.h
  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client.cpp
🧬 Code graph analysis (3)
eloq_data_store_service/data_store_service.h (4)
eloq_data_store_service/data_store_service.cpp (14)
  • StartService (236-305)
  • StartService (236-236)
  • Read (391-424)
  • Read (391-394)
  • Read (426-469)
  • Read (426-434)
  • ScanClose (987-1020)
  • ScanClose (987-990)
  • ScanClose (1022-1054)
  • ScanClose (1022-1027)
  • CloseDataStore (1837-1882)
  • CloseDataStore (1837-1837)
  • OpenDataStore (1884-1920)
  • OpenDataStore (1884-1884)
data_store_service_client.cpp (6)
  • Read (1088-1097)
  • Read (1088-1093)
  • Read (3504-3520)
  • Read (3504-3509)
  • ScanClose (3771-3795)
  • ScanClose (3771-3776)
eloq_data_store_service/data_store_service_config.cpp (6)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
eloq_data_store_service/data_store.h (1)
  • DataStore (42-143)
data_store_service_client.h (1)
data_store_service_client.cpp (24)
  • FindFreeNodeIndex (3243-3262)
  • FindFreeNodeIndex (3243-3243)
  • InitBucketsInfo (3122-3181)
  • InitBucketsInfo (3122-3126)
  • OnLeaderStart (2969-2984)
  • OnLeaderStart (2969-2970)
  • OnLeaderStop (2986-2998)
  • OnLeaderStop (2986-2986)
  • OnStartFollowing (3007-3054)
  • OnStartFollowing (3007-3010)
  • GetShardIdByPartitionId (3082-3105)
  • GetShardIdByPartitionId (3082-3083)
  • IsLocalShard (3072-3080)
  • IsLocalShard (3072-3072)
  • GetOwnerNodeIndexOfShard (3183-3188)
  • GetOwnerNodeIndexOfShard (3183-3184)
  • GetAllDataShards (3107-3120)
  • GetAllDataShards (3107-3107)
  • UpdateOwnerNodeIndexOfShard (3190-3241)
  • UpdateOwnerNodeIndexOfShard (3190-3191)
  • HandleShardingError (3264-3298)
  • HandleShardingError (3264-3265)
  • UpgradeShardVersion (3300-3348)
  • UpgradeShardVersion (3300-3303)
data_store_service_client.cpp (4)
data_store_service_client.h (4)
  • TxPort2DssPort (132-135)
  • need_bootstrap_ (686-810)
  • KvPartitionIdOf (627-632)
  • Connect (152-157)
eloq_data_store_service/data_store_service.cpp (16)
  • BatchWriteRecords (845-893)
  • BatchWriteRecords (845-849)
  • BatchWriteRecords (1115-1180)
  • BatchWriteRecords (1115-1128)
  • Read (391-424)
  • Read (391-394)
  • Read (426-469)
  • Read (426-434)
  • ScanNext (895-951)
  • ScanNext (895-910)
  • ScanNext (953-985)
  • ScanNext (953-956)
  • DeleteRange (561-608)
  • DeleteRange (561-565)
  • DeleteRange (610-663)
  • DeleteRange (610-617)
eloq_data_store_service/rocksdb_data_store_common.cpp (9)
  • BatchWriteRecords (479-641)
  • BatchWriteRecords (479-480)
  • key (832-832)
  • Read (416-477)
  • Read (416-416)
  • ScanNext (735-1045)
  • ScanNext (735-735)
  • DeleteRange (339-414)
  • DeleteRange (339-339)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
🔇 Additional comments (8)
data_store_service_client.h (2)

92-97: Good: Constructor validates shard_id bounds.

The bounds check ensures shard_id doesn't exceed the hardcoded array size during initialization, addressing past concerns about the 1000-element limit.


137-142: LGTM: Signature updated for shard-aware configuration.

Adding node_id and ng_leaders parameters enables per-shard ownership resolution during cluster configuration, consistent with the multi-shard architecture.

data_store_service_client.cpp (6)

224-227: Early return logic looks correct.

The early return when !need_bootstrap_ appropriately skips initialization that's not required in shard-aware configurations.


3082-3105: Consider the fallback return value.

If bucket_infos_.find(bucket_id) returns end(), the assert at line 3097 fires, but in release builds the function would return 0 (line 3104). If 0 is not a valid shard_id or if this masks errors, consider returning an error status or throwing instead.

However, if the assert guarantees that this path never executes in correct code, and 0 is a safe sentinel, this is acceptable.


3122-3181: Consistent hashing implementation looks correct.

The deterministic virtual node generation using std::mt19937 seeded with node group ID ensures stable bucket assignments across restarts. The 64 virtual nodes per group provide good distribution.


2368-2406: Partition-to-shard mapping in CopyBaseToArchive is correct.

The logic correctly determines is_range_partitioned once (line 2368) and consistently uses it for all partition-to-shard mappings, properly handling both hash and range partitioned tables.


3107-3120: Thread-safe shard enumeration.

The shared lock correctly protects concurrent access to dss_shard_ids_, and returning a vector copy ensures safe iteration by callers without holding the lock.


2143-2146: No issues found. The double partition mapping is intentional and correct.

The code follows a consistent pattern: HashArchiveKey computes a partition ID in archive key space, then KvPartitionIdOf(partition_id, false) maps that to the KV partition space. This is confirmed by the second overload signature int32_t KvPartitionIdOf(int32_t key_partition, bool is_range_partition = true), which accepts a partition ID and transforms it. The same pattern is used throughout the codebase (lines 2570–2572, 2681–2683, 2778–2781, 2816–2818), confirming this is the intended design.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
eloq_data_store_service/data_store_service.h (1)

615-620: Critical: Add shard_id verification to IsOwnerOfShard.

IsOwnerOfShard only checks shard_status_ != Closed but does not verify ds_ref.shard_id_ == shard_id. This is inconsistent with FetchDSShardStatus (lines 584-589) which does check shard_id equality. Without this check, an uninitialized shard slot (where shard_id_ is still UINT32_MAX from line 691) or a mismatched shard could be treated as owned if its status happens to be non-Closed.

Apply this diff to add the missing check:

 bool IsOwnerOfShard(uint32_t shard_id) const
 {
     const auto &ds_ref = data_shards_.at(shard_id);
-    return ds_ref.shard_status_.load(std::memory_order_acquire) !=
-           DSShardStatus::Closed;
+    return ds_ref.shard_id_ == shard_id &&
+           ds_ref.shard_status_.load(std::memory_order_acquire) !=
+               DSShardStatus::Closed;
 }
🧹 Nitpick comments (1)
eloq_data_store_service/data_store_service.h (1)

680-689: Consider resetting shard_id_ in ShutDown for consistency.

The ShutDown() method sets shard_status_ to Closed and clears resources, but does not reset shard_id_ back to UINT32_MAX. While not strictly necessary if status checks are sufficient, this creates an inconsistency with initialization (line 691) where shard_id_ starts at UINT32_MAX. Code like GetDataStore (line 633) checks shard_id_ equality to determine if a shard is properly initialized, so resetting it after shutdown would make the lifecycle clearer.

Apply this diff if you want to maintain consistency:

 void ShutDown()
 {
     if (data_store_ != nullptr)
     {
         shard_status_.store(DSShardStatus::Closed,
                             std::memory_order_release);
         data_store_->Shutdown();
         scan_iter_cache_.Clear();
+        shard_id_ = UINT32_MAX;
     }
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dda63c8 and 7a94594.

📒 Files selected for processing (2)
  • eloq_data_store_service/data_store_service.cpp (45 hunks)
  • eloq_data_store_service/data_store_service.h (14 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • eloq_data_store_service/data_store_service.cpp
🧬 Code graph analysis (2)
eloq_data_store_service/data_store_service.h (4)
eloq_data_store_service/data_store_service.cpp (10)
  • StartService (242-310)
  • StartService (242-242)
  • Read (392-425)
  • Read (392-395)
  • Read (427-470)
  • Read (427-435)
  • ScanClose (988-1021)
  • ScanClose (988-991)
  • ScanClose (1023-1055)
  • ScanClose (1023-1028)
eloq_data_store_service/rocksdb_data_store_common.cpp (5)
  • Read (416-477)
  • Read (416-416)
  • key (832-832)
  • ScanClose (1047-1071)
  • ScanClose (1047-1047)
data_store_service_client.cpp (6)
  • Read (1088-1097)
  • Read (1088-1093)
  • Read (3504-3520)
  • Read (3504-3509)
  • ScanClose (3771-3795)
  • ScanClose (3771-3776)
eloq_data_store_service/data_store.h (1)
  • DataStore (42-143)
eloq_data_store_service/data_store_service.cpp (4)
eloq_data_store_service/data_store_service.h (4)
  • IsOwnerOfShard (615-620)
  • PrepareShardingError (525-529)
  • IncreaseWriteReqCount (603-607)
  • DecreaseWriteReqCount (609-613)
eloq_data_store_service/data_store_service_config.cpp (6)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
eloq_data_store_service/rocksdb_data_store_common.cpp (13)
  • Read (416-477)
  • Read (416-416)
  • key (832-832)
  • DeleteRange (339-414)
  • DeleteRange (339-339)
  • start_key (367-367)
  • start_key (679-679)
  • end_key (382-382)
  • end_key (684-684)
  • ScanNext (735-1045)
  • ScanNext (735-735)
  • ScanClose (1047-1071)
  • ScanClose (1047-1047)
data_store_service_client.cpp (10)
  • Read (1088-1097)
  • Read (1088-1093)
  • Read (3504-3520)
  • Read (3504-3509)
  • DeleteRange (3553-3575)
  • DeleteRange (3553-3560)
  • ScanNext (3696-3729)
  • ScanNext (3696-3710)
  • ScanClose (3771-3795)
  • ScanClose (3771-3776)
🔇 Additional comments (11)
eloq_data_store_service/data_store_service.h (3)

190-190: LGTM: StartService signature simplified for per-shard architecture.

The signature change correctly removes per-call parameters, delegating shard ownership queries to cluster_manager_. The implementation properly initializes all owned shards.


582-613: LGTM: Per-shard accessors correctly implemented.

FetchDSShardStatus, IncreaseWriteReqCount, and DecreaseWriteReqCount properly use per-shard references with appropriate atomic memory ordering. Note that FetchDSShardStatus correctly verifies shard_id_ equality before returning status.


631-641: LGTM: GetDataStore properly validates shard_id.

The method correctly verifies shard_id_ equality before returning the data store pointer, ensuring the shard slot is properly initialized.

eloq_data_store_service/data_store_service.cpp (8)

216-221: LGTM: Per-shard initialization correctly implemented.

The constructor properly initializes all shard slots with their respective IDs and Closed status, establishing a clean initial state for the per-shard architecture.


236-238: LGTM: Destructor properly cleans up all shards.

The destructor correctly iterates through all shard slots and invokes ShutDown() to ensure proper cleanup.


242-310: LGTM: StartService correctly implements per-shard initialization.

The method properly:

  • Queries cluster_manager for owned shards
  • Uses atomic compare_exchange for safe state transitions (Closed → Starting → final mode)
  • Handles both ReadOnly and ReadWrite initial modes
  • Provides appropriate error handling and logging

312-390: LGTM: ConnectAndStartDataStore handles per-shard lifecycle correctly.

The method properly:

  • Uses atomic compare_exchange with spin-wait for concurrent initialization attempts
  • Handles both new data store creation and existing data store reinitialization
  • Synchronizes with cluster_manager for shard status updates
  • Uses appropriate atomic memory ordering

392-425: RPC handlers correctly implement per-shard dispatch pattern.

All RPC handlers follow a consistent, correct pattern:

  1. Extract shard_id from request
  2. Check ownership via IsOwnerOfShard
  3. Increment write request counter for write operations
  4. Verify shard status (ReadOnly/ReadWrite for reads; ReadWrite only for writes)
  5. Dispatch to per-shard data_store_

Note: These rely on IsOwnerOfShard which has a critical bug flagged in the header file review.

Also applies to: 472-517, 562-609


1076-1109: LGTM: Scan iterator cache operations correctly use per-shard cache.

All scan iterator cache methods properly delegate to the per-shard scan_iter_cache_ member, maintaining session isolation across shards.


1831-1904: LGTM: CloseDataStore and OpenDataStore implement proper per-shard lifecycle with good observability.

Both methods correctly:

  • Access per-shard state via ds_ref
  • Implement progressive state transitions (CloseDataStore: ReadWrite → ReadOnly → Closed)
  • Provide detailed logging with shard_id and status for debugging
  • Measure and log timing for OpenDataStore operations

Note: CloseDataStore relies on IsOwnerOfShard which has a critical bug flagged in the header file review.


2281-2365: LGTM: Shard state transition methods correctly implement atomic state machine.

All three transition methods properly:

  • Use per-shard ds_ref references
  • Employ atomic compare_exchange for safe concurrent transitions
  • Handle idempotent cases (already in target state)
  • Synchronize with cluster_manager_ for cluster-wide consistency
  • Notably, SwitchReadWriteToReadOnly correctly waits for ongoing write requests to complete before finalizing the transition

@lzxddz lzxddz force-pushed the dss_multi_shard-rebase branch from 7a94594 to 19ea736 Compare November 11, 2025 10:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
eloq_data_store_service/rocksdb_data_store.cpp (1)

69-83: Null data_store_service_ in base class Shutdown() or enforce contract.

The issue is real. The destructor at line 60-67 checks data_store_service_ != nullptr and calls Shutdown() if true, indicating double-shutdown safety is intended. However:

  1. Base class Shutdown() (rocksdb_data_store_common.cpp:291) nulls query_worker_pool_ but not data_store_service_
  2. Derived class Shutdown() (rocksdb_data_store.cpp:69-83) also does not null data_store_service_
  3. If external code deletes data_store_service_ but this object isn't destroyed, the pointer remains non-null
  4. When destructor runs, it calls Shutdown() again, which calls data_store_service_->ForceEraseScanIters() on a dangling pointer—use-after-free

Either null data_store_service_ in the base class after calling ForceEraseScanIters(), or document that the service must outlive the data store and that Shutdown() cannot be called if the service is destroyed externally.

eloq_data_store_service/internal_request.h (2)

754-763: Pass shard_id when preparing sharding errors

PrepareShardingError now expects a shard id, but both the RPC and local Read completion paths still pass partition_id. When the node isn’t the owner this will emit the wrong sharding metadata (likely returning “shard 123” instead of the real shard), breaking client re-routing. Switch these calls to use the request’s shard id.

-            ds_service_->PrepareShardingError(req_->partition_id(),
+            ds_service_->PrepareShardingError(req_->shard_id(),
                                               resp_->mutable_result());
…
-            ds_service_->PrepareShardingError(partition_id_, result_);
+            ds_service_->PrepareShardingError(shard_id_, result_);

Also applies to: 855-863


1324-1333: Fix shard_id propagation in Scan error handling

Same issue here: the Scan RPC/local paths still pass partition_id into PrepareShardingError, so clients receive the wrong shard metadata on ownership changes. Use the associated shard id instead.

-            ds_service_->PrepareShardingError(req_->partition_id(),
+            ds_service_->PrepareShardingError(req_->shard_id(),
                                              resp_->mutable_result());
…
-            ds_service_->PrepareShardingError(partition_id_, result_);
+            ds_service_->PrepareShardingError(shard_id_, result_);

Also applies to: 1513-1522

♻️ Duplicate comments (2)
eloq_data_store_service/data_store_service.h (1)

615-625: Critical: IsOwnerOfShard missing shard_id validation.

Unlike FetchDSShardStatus (line 584) and GetDataStore (line 638), IsOwnerOfShard does not validate that ds_ref.shard_id_ == shard_id. This inconsistency could cause it to return true for the wrong shard if slots are reused or not properly initialized.

The prior review comment explicitly stated: "keep the existing pattern of reading shard_id first and then loading shard_status_" and "return false for non-matching shard_id." This validation is missing.

Apply this diff to add the missing validation:

 bool IsOwnerOfShard(uint32_t shard_id) const
 {
     const auto &ds_ref = data_shards_.at(shard_id);

     DLOG(INFO) << "====IsOwnerOfShard shard_id:" << shard_id << " res:"
                << (int) (ds_ref.shard_status_.load(
                              std::memory_order_acquire) !=
                          DSShardStatus::Closed);
-    return ds_ref.shard_status_.load(std::memory_order_acquire) !=
-           DSShardStatus::Closed;
+    return ds_ref.shard_id_ == shard_id &&
+           ds_ref.shard_status_.load(std::memory_order_acquire) !=
+               DSShardStatus::Closed;
 }
data_store_service_client.cpp (1)

3985-4001: Replace assert(false) with proper error handling.

Lines 3995-4001 contain an else branch with assert(false) (line 3997) followed by fallback logic. This pattern creates inconsistent behavior:

  • Debug builds: Process crashes if Sequences::table_name_ is not hash-partitioned.
  • Release builds: Code continues with fallback kv_partition_id, potentially masking a configuration error.

If the Sequences table must be hash-partitioned (as the comment implies), replace the assert and fallback with:

 else
 {
-    assert(false);
-    // Use the init partition id.
-    kv_partition_id = KvPartitionIdOf(txservice::Sequences::table_name_);
-    data_shard_id = GetShardIdByPartitionId(kv_partition_id, false);
+    LOG(ERROR) << "Sequences table must be hash-partitioned, but is not";
+    return false;
 }

This ensures callers can handle the error gracefully.

🧹 Nitpick comments (1)
eloq_data_store_service/data_store_service.h (1)

676-705: Consider resetting shard_id_ in ShutDown() for consistency.

The documentation and cleanup logic are good. However, shard_id_ is not reset to UINT32_MAX in ShutDown(), even though other methods like FetchDSShardStatus and GetDataStore check for this sentinel value. For defensive coding and consistency with the pattern established elsewhere, consider adding:

 void ShutDown()
 {
     if (data_store_ != nullptr)
     {
         data_store_->Shutdown();
         data_store_ = nullptr;
     }

     if (scan_iter_cache_ != nullptr)
     {
         scan_iter_cache_->Clear();
         scan_iter_cache_ = nullptr;
     }
+    shard_id_ = UINT32_MAX;
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a94594 and c39a22d.

📒 Files selected for processing (17)
  • data_store_service_client.cpp (90 hunks)
  • data_store_service_client.h (12 hunks)
  • data_store_service_client_closure.cpp (17 hunks)
  • data_store_service_client_closure.h (34 hunks)
  • data_store_service_scanner.cpp (3 hunks)
  • data_store_service_scanner.h (4 hunks)
  • eloq_data_store_service/CMakeLists.txt (0 hunks)
  • eloq_data_store_service/data_store_service.cpp (45 hunks)
  • eloq_data_store_service/data_store_service.h (14 hunks)
  • eloq_data_store_service/data_store_service_util.h (1 hunks)
  • eloq_data_store_service/ds_request.proto (5 hunks)
  • eloq_data_store_service/internal_request.h (55 hunks)
  • eloq_data_store_service/main.cpp (1 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp (5 hunks)
  • eloq_data_store_service/rocksdb_data_store.cpp (1 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.cpp (3 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.h (1 hunks)
💤 Files with no reviewable changes (1)
  • eloq_data_store_service/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client_closure.cpp
  • data_store_service_client.h
  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client_closure.cpp
  • eloq_data_store_service/data_store_service.cpp
  • data_store_service_client.cpp
🧬 Code graph analysis (13)
data_store_service_client_closure.cpp (1)
data_store_service_client.cpp (1)
  • callback_data (2575-2581)
eloq_data_store_service/rocksdb_data_store.cpp (2)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (173-194)
  • Shutdown (173-173)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Shutdown (283-298)
  • Shutdown (283-283)
data_store_service_scanner.h (1)
eloq_data_store_service/internal_request.h (9)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (818-826)
  • partition_id_ (647-647)
  • partition_id_ (869-869)
  • partition_id_ (1528-1528)
eloq_data_store_service/rocksdb_data_store_common.h (4)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (173-194)
  • Shutdown (173-173)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Shutdown (283-298)
  • Shutdown (283-283)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Shutdown (91-106)
  • Shutdown (91-91)
data_store_service_scanner.cpp (1)
data_store_service_scanner.h (1)
  • data_shard_id_ (266-266)
eloq_data_store_service/data_store_service.cpp (3)
eloq_data_store_service/data_store_service.h (4)
  • IsOwnerOfShard (615-625)
  • PrepareShardingError (525-529)
  • IncreaseWriteReqCount (603-607)
  • DecreaseWriteReqCount (609-613)
eloq_data_store_service/data_store_service_config.cpp (6)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
eloq_data_store_service/rocksdb_data_store_common.cpp (3)
  • key (849-849)
  • ScanNext (752-1062)
  • ScanNext (752-752)
eloq_data_store_service/rocksdb_data_store_common.cpp (5)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (173-194)
  • Shutdown (173-173)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
rocksdb_handler.cpp (4)
  • Shutdown (2323-2358)
  • Shutdown (2323-2323)
  • Shutdown (3404-3436)
  • Shutdown (3404-3404)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Shutdown (91-106)
  • Shutdown (91-91)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
eloq_data_store_service/internal_request.h (2)
eloq_data_store_service/data_store_service_config.h (1)
  • uint32_t (107-110)
data_store_service_client_closure.h (31)
  • Reset (81-85)
  • Reset (144-153)
  • Reset (210-216)
  • Reset (269-274)
  • Reset (385-392)
  • Reset (600-619)
  • Reset (933-948)
  • Reset (1154-1177)
  • Reset (1393-1408)
  • Reset (1598-1624)
  • Reset (1627-1658)
  • Reset (1660-1694)
  • Reset (1933-2251)
  • Reset (2328-2347)
  • Reset (2606-2612)
  • Reset (3002-3010)
  • vector (1078-1081)
  • vector (1121-1128)
  • vector (2253-2256)
  • string (804-809)
  • string (2173-2178)
  • string_view (311-368)
  • string_view (752-755)
  • string_view (767-770)
  • string_view (792-797)
  • string_view (1291-1294)
  • string_view (1533-1536)
  • string_view (2133-2136)
  • string_view (2138-2141)
  • string_view (2241-2246)
  • string_view (2453-2456)
data_store_service_client_closure.h (2)
eloq_data_store_service/internal_request.h (13)
  • partition_id_ (647-647)
  • partition_id_ (869-869)
  • partition_id_ (1528-1528)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (818-826)
  • string_view (988-996)
  • string_view (1123-1131)
  • string_view (1425-1433)
  • string_view (1440-1448)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
data_store_service_client.h (1)
data_store_service_client.cpp (24)
  • FindFreeNodeIndex (3250-3269)
  • FindFreeNodeIndex (3250-3250)
  • InitBucketsInfo (3129-3188)
  • InitBucketsInfo (3129-3133)
  • OnLeaderStart (2969-2986)
  • OnLeaderStart (2969-2970)
  • OnLeaderStop (2988-3002)
  • OnLeaderStop (2988-2988)
  • OnStartFollowing (3011-3061)
  • OnStartFollowing (3011-3015)
  • GetShardIdByPartitionId (3089-3112)
  • GetShardIdByPartitionId (3089-3090)
  • IsLocalShard (3079-3087)
  • IsLocalShard (3079-3079)
  • GetOwnerNodeIndexOfShard (3190-3195)
  • GetOwnerNodeIndexOfShard (3190-3191)
  • GetAllDataShards (3114-3127)
  • GetAllDataShards (3114-3114)
  • UpdateOwnerNodeIndexOfShard (3197-3248)
  • UpdateOwnerNodeIndexOfShard (3197-3198)
  • HandleShardingError (3271-3305)
  • HandleShardingError (3271-3272)
  • UpgradeShardVersion (3307-3355)
  • UpgradeShardVersion (3307-3310)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (3)
eloq_data_store_service/rocksdb_cloud_data_store.h (1)
  • RocksDBCloudDataStore (38-161)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Shutdown (283-298)
  • Shutdown (283-283)
eloq_data_store_service/data_store_service.h (4)
eloq_data_store_service/data_store_service.cpp (12)
  • StartService (242-312)
  • StartService (242-242)
  • Read (399-432)
  • Read (399-402)
  • Read (434-477)
  • Read (434-442)
  • ScanClose (995-1028)
  • ScanClose (995-998)
  • ScanClose (1030-1062)
  • ScanClose (1030-1035)
  • TTLWrapperCache (72-77)
  • TTLWrapperCache (79-88)
data_store_service_client.cpp (6)
  • Read (1088-1097)
  • Read (1088-1093)
  • Read (3511-3527)
  • Read (3511-3516)
  • ScanClose (3778-3802)
  • ScanClose (3778-3783)
eloq_data_store_service/data_store_service_config.cpp (6)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
eloq_data_store_service/data_store.h (1)
  • DataStore (42-143)
data_store_service_client.cpp (3)
data_store_service_client.h (3)
  • TxPort2DssPort (134-137)
  • KvPartitionIdOf (629-634)
  • Connect (154-159)
eloq_data_store_service/data_store_service.cpp (16)
  • BatchWriteRecords (853-901)
  • BatchWriteRecords (853-857)
  • BatchWriteRecords (1118-1183)
  • BatchWriteRecords (1118-1131)
  • Read (399-432)
  • Read (399-402)
  • Read (434-477)
  • Read (434-442)
  • ScanNext (903-959)
  • ScanNext (903-918)
  • ScanNext (961-993)
  • ScanNext (961-964)
  • DeleteRange (569-616)
  • DeleteRange (569-573)
  • DeleteRange (618-671)
  • DeleteRange (618-625)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
🪛 Clang (14.0.6)
eloq_data_store_service/data_store_service_util.h

[error] 24-24: 'atomic' file not found

(clang-diagnostic-error)

🔇 Additional comments (28)
data_store_service_client_closure.cpp (2)

516-521: LGTM: Shard ID propagation in batch write flow.

Extracting data_shard_id from the closure and passing it to BatchWriteRecords correctly propagates shard context through the partition batch callback.


72-73: Key() method confirmed to return std::string_view.

Verification at data_store_service_client_closure.h:767-770 confirms ReadClosure::Key() returns const std::string_view. The code change from Key().back() to Key() is correct given this return type.

data_store_service_scanner.cpp (1)

417-419: LGTM: Per-partition shard routing correctly implemented.

The scanner now computes data_shard_id per partition using GetShardIdByPartitionId and passes it to Reset. This aligns with the broader shard-aware architecture.

data_store_service_client.h (1)

502-506: Approve: Shard-aware Read signature.

The updated Read signature now correctly accepts int32_t partition_id and uint32_t shard_id, aligning with the per-shard routing introduced across the PR.

eloq_data_store_service/data_store_service_util.h (1)

24-31: LGTM: Standard library includes for shard-aware containers.

The new includes (atomic, map, memory, set, unordered_map) support per-shard data structures (e.g., bucket_infos_, dss_shard_ids_) introduced across the PR. The static analysis error about <atomic> not found is a false positive due to the analysis environment lacking system headers.

eloq_data_store_service/main.cpp (1)

321-322: LGTM: StartService API simplified to single parameter.

The change from StartService(FLAGS_bootstrap || is_single_node, 0, 0) to StartService(FLAGS_bootstrap || is_single_node) aligns with the per-shard initialization refactoring mentioned in the PR summary.

eloq_data_store_service/rocksdb_data_store_common.h (1)

165-165: LGTM: Shutdown method exposed for per-shard cleanup.

Adding void Shutdown() override enables centralized cleanup of query_worker_pool_ and scanner state, as shown in the implementation (rocksdb_data_store_common.cpp:282-297). This supports the per-shard lifecycle refactoring.

eloq_data_store_service/ds_request.proto (1)

95-95: No breaking change detected — review comment is incorrect.

The current code shows bytes key_str = 3; as a single (non-repeated) field in both locations (lines 95 and 109). There is no evidence in the codebase that this field was ever repeated bytes. The repetition in FetchRecordsRequest occurs at the message container level (repeated key keys), not at the key_str field level. The review comment appears to mischaracterize the proto structure or assumes a prior state that does not exist in the current codebase.

Likely an incorrect or invalid review comment.

eloq_data_store_service/data_store_service.h (9)

190-190: LGTM: StartService signature simplified for multi-shard support.

The signature change from StartService(bool, uint32_t, uint32_t) to StartService(bool) is correct. The implementation now dynamically discovers shards via cluster_manager_.GetShardsForThisNode(), eliminating the need for explicit shard parameters.


220-228: LGTM: Read signature updated for per-shard routing.

The method signature correctly reflects the shard-aware architecture:

  • partition_id changed to int32_t (consistent with prior review resolutions)
  • shard_id parameter added for explicit shard routing
  • Changes align with implementation in data_store_service.cpp lines 433-476

256-268: LGTM: BatchWriteRecords signature updated for shard routing.

The shard_id parameter addition at line 258 is consistent with the per-shard routing architecture applied to other methods.


320-327: LGTM: DeleteRange, ScanNext, and ScanClose signatures updated consistently.

All three methods correctly adopt the per-shard routing pattern with int32_t partition_id and uint32_t shard_id parameters, consistent with the Read and BatchWriteRecords changes.

Also applies to: 403-417, 440-445


525-529: LGTM: PrepareShardingError correctly delegates to cluster manager.

The delegation to cluster_manager_.PrepareShardingError(shard_id, result) properly resolves the critical issue flagged in prior reviews where topology routing information was lost. The cluster manager's implementation populates all necessary routing metadata for client failover.


582-590: LGTM: FetchDSShardStatus properly validates shard initialization.

The method correctly checks ds_ref.shard_id_ == shard_id before returning the shard status, ensuring the shard slot is initialized for the requested shard_id.


603-613: LGTM: Write request counters operate directly on shard slots.

These methods directly access data_shards_.at(shard_id).ongoing_write_requests_ without validating shard_id_. This is acceptable since:

  • They're called in hot paths after ownership validation
  • Atomic operations are safe even on uninitialized slots (default value 0)
  • Performance matters here

636-646: LGTM: GetDataStore properly validates shard initialization.

The method correctly validates data_shards_.at(shard_id).shard_id_ == shard_id before returning the data store pointer, consistent with the defensive pattern in FetchDSShardStatus.


707-707: LGTM: Per-shard data architecture established.

The std::array<DataShard, 1000> data_shards_ provides fixed-capacity shard storage with bounds-checked access via .at(). The limit of 1000 shards is a reasonable upper bound for the system's scale.

data_store_service_client_closure.h (6)

499-510: LGTM: ReadBaseForArchiveCallbackData updated consistently.

The AddResult method and partition_id_ member correctly use int32_t, consistent with the partition_id type change applied throughout the codebase.

Also applies to: 532-532


600-899: LGTM: ReadClosure updated consistently for per-shard routing.

The closure correctly integrates shard-aware changes:

  • int32_t partition_id_ with INT32_MAX sentinel (line 631)
  • uint32_t shard_id_ with UINT32_MAX sentinel (line 632)
  • std::string_view key_ simplifies key storage (line 633)
  • PrepareRequest properly populates shard_id and key_str (lines 660, 662)
  • Accessors return correct types (lines 757-770)
  • Error handling updates owner index with shard_id_ (line 688)

All changes are consistent with the per-shard routing architecture.


1154-1366: LGTM: DeleteRangeClosure updated consistently.

The closure follows the same pattern as ReadClosure:

  • int32_t partition_id and uint32_t shard_id parameters in Reset (lines 1156-1157)
  • PrepareRequest populates shard_id in RPC request (line 1200)
  • Error handling propagates shard_id_ (line 1228)
  • Accessors added (lines 1296-1304)

1598-1892: LGTM: BatchWriteRecordsClosure updated consistently.

Both Reset overloads correctly handle the new parameters:

  • Lines 1629-1630: First overload takes int32_t partition_id, uint32_t shard_id
  • Lines 1662-1663: Second overload takes the same types
  • Clear() properly resets sentinels (lines 1605-1606)
  • PrepareRequest populates shard_id (line 1777)
  • Accessors provided (lines 1850-1858)

1901-2294: LGTM: ScanNextClosure updated consistently.

The closure integrates shard-aware changes following the established pattern:

  • Reset takes int32_t partition_id, uint32_t shard_id (lines 1936-1937)
  • Sentinels properly set in Clear() (lines 1913-1914)
  • RPC request populated with shard_id (line 2008)
  • Accessors provided (lines 2123-2131)

2930-2958: LGTM: FetchArchivesCallbackData updated consistently.

The constructor parameter and member both use int32_t partition_id (lines 2931, 2949), consistent with the codebase-wide type change.

data_store_service_client.cpp (5)

222-242: LGTM: Bootstrap guard improves efficiency.

The need_bootstrap_ check prevents unnecessary initialization attempts when the client is already bootstrapped.


3114-3127: LGTM: Thread-safe shard enumeration.

Correctly uses shared_lock to protect read access to dss_shard_ids_, addressing the concurrency concern raised in earlier reviews. The returned vector preserves sorted order from the underlying std::set.


3129-3188: LGTM: Thread-safe and deterministic bucket initialization.

The use of std::mt19937 with node-group-ID seeding ensures:

  • Thread-safe random generation (no shared state).
  • Deterministic bucket-to-shard mapping across cluster nodes.

The virtual-node consistent-hashing logic correctly distributes buckets, and the early break (lines 3157-3163) guards against excessive node groups.


2969-3061: LGTM: Shard lifecycle management is consistent.

The bind_data_shard_with_ng_ flag correctly controls whether shard open/close operations are performed. The three callbacks (OnLeaderStart, OnLeaderStop, OnStartFollowing) consistently:

  • Guard shard operations behind the flag.
  • Pass ng_id to OpenDataStore/CloseDataStore.
  • Maintain symmetry (open on leader start, close on leader stop/follower start).

3089-3112: Verify Sharder contract and shard ID 0 validity.

The review identifies a real pattern concern, but resolution depends on external context:

  1. Assert-then-check redundancy depends on Sharder guarantees: The InitBucketsInfo method populates bucket_infos_ with ALL bucket IDs in range [0, txservice::total_range_buckets). If Sharder::MapRangeIdToBucketId and Sharder::MapHashPartitionIdToBucketId always return valid IDs within that range, the assert at line 3104 should never fire and the subsequent if check (line 3105) becomes dead code. However, if Sharder can return out-of-range bucket IDs, the assert would fail in debug builds while line 3111's fallback silently returns 0 in release builds—masking the error.

  2. Returning 0 as error sentinel is problematic: The code returns 0 when a bucket is not found, but there's no indication whether 0 is a valid shard ID or an error sentinel. Callers use the returned value directly without special handling for 0, suggesting they cannot distinguish between a valid shard 0 and an error condition.

To confirm the safety of this pattern:

  • Verify that txservice::Sharder guarantees returned bucket IDs are always < txservice::total_range_buckets
  • Confirm whether shard ID 0 is a valid assignment or should be reserved as an error code
  • Document the intended behavior when a bucket lookup fails (crash on invalid input vs. graceful degradation)

Comment on lines +94 to +99
if (shard_id >= dss_shards_.size())
{
LOG(FATAL) << "Shard id " << shard_id
<< " is out of range, should expand the hard-coded "
"dss_shards_ size.";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add bounds validation for shard_id to prevent array overrun.

The constructor now checks shard_id >= dss_shards_.size() and logs a fatal error, but this only covers initialization. Runtime accesses at lines 659 and elsewhere (e.g., GetOwnerNodeIndexOfShard, UpdateOwnerNodeIndexOfShard) directly index dss_shards_[shard_id] without bounds checks. A shard_id ≥ 1000 from a malicious/buggy caller triggers undefined behavior.

Add validation before array access:

uint32_t GetOwnerNodeIndexOfShard(uint32_t shard_id) const
{
    if (shard_id >= dss_shards_.size()) {
        LOG(ERROR) << "shard_id " << shard_id << " exceeds array bounds";
        return UINT32_MAX;
    }
    assert(dss_shards_[shard_id].load(std::memory_order_acquire) != UINT32_MAX);
    return dss_shards_[shard_id].load(std::memory_order_acquire);
}

Apply similar checks in UpdateOwnerNodeIndexOfShard, UpgradeShardVersion, and other methods that index dss_shards_.

🤖 Prompt for AI Agents
In data_store_service_client.h around lines 94 to 99 and throughout the file
(notably methods that index dss_shards_ such as GetOwnerNodeIndexOfShard,
UpdateOwnerNodeIndexOfShard, UpgradeShardVersion and any other direct
dss_shards_[shard_id] accesses), add explicit bounds validation before accessing
dss_shards_: check if shard_id >= dss_shards_.size(), log an ERROR with the
offending shard_id, and return a sentinel (e.g., UINT32_MAX) for getter-style
methods; for mutators, log and early-return or return a failure code as
appropriate; after the bounds check you may keep the existing assert and
original load/store with the proper memory order. Ensure every direct
dss_shards_[shard_id] access is guarded this way to prevent out-of-range
indexing.

@lzxddz lzxddz force-pushed the dss_multi_shard-rebase branch from c39a22d to 7a75157 Compare November 13, 2025 11:48
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)
data_store_service_client.cpp (5)

3022-3026: Wrong shard closed on follower start; closes shard 0 instead of ng_id

Calling CloseDataStore(0) will close the wrong shard when ng_id ≠ 0. Use ng_id (the DSS shard id) to avoid data unavailability.

-        // Now, only support one shard.
-        data_store_service_->CloseDataStore(0);
+        // Close the shard corresponding to this node group.
+        data_store_service_->CloseDataStore(ng_id);

136-147: Fix typos in log/comments for clarity

Minor nits: “Cliet” → “Client”; “Binded” → “Bound”; “swith” → “switch”.

-            LOG(INFO) << "DataStoreServiceCliet UpgradeShardVersion success, "
+            LOG(INFO) << "DataStoreServiceClient UpgradeShardVersion success, "
...
-        LOG(INFO)
-            << "DataStoreServiceCliet SetupConfig skipped, current_version:"
+        LOG(INFO)
+            << "DataStoreServiceClient SetupConfig skipped, current_version:"
...
-        // Binded data store shard with ng.
+        // Bound data store shard with ng.
...
-    // swith to read only in case of data store status is read write
+    // switch to read-only in case data store status is read-write

Also applies to: 2979-2981, 2995-2999


3191-3195: Add shard_id bounds check to prevent array overrun

Directly indexing dss_shards_ without validating shard_id risks OOB access. Return a sentinel on invalid input.

 uint32_t DataStoreServiceClient::GetOwnerNodeIndexOfShard(
     uint32_t shard_id) const
 {
-    assert(dss_shards_[shard_id].load(std::memory_order_acquire) != UINT32_MAX);
-    return dss_shards_[shard_id].load(std::memory_order_acquire);
+    if (shard_id >= dss_shards_.size()) {
+        LOG(ERROR) << "shard_id " << shard_id << " exceeds dss_shards_ bounds";
+        return UINT32_MAX;
+    }
+    auto idx = dss_shards_[shard_id].load(std::memory_order_acquire);
+    assert(idx != UINT32_MAX);
+    return idx;
 }

3312-3319: Return failure for invalid shard_id; avoid assert+FATAL path

If shard_id is out of range, log and return false so callers can handle it gracefully.

-    if (shard_id >= dss_shards_.size())
-    {
-        assert(false);
-        // Now only support one shard.
-        LOG(FATAL) << "Shard id not found, shard_id: " << shard_id;
-        return true;
-    }
+    if (shard_id >= dss_shards_.size()) {
+        LOG(ERROR) << "Shard id not found, shard_id: " << shard_id;
+        return false;
+    }

4195-4245: Header/implementation mismatch: record_tmp_mem_area uses size_t in .cpp but uint64_t in header

The two SerializeTxRecord overloads here take std::vector<size_t>&, but the header declares std::vector<uint64_t>&. This will not compile. Align the header to size_t (or switch implementation to uint64_t consistently). Suggest updating the header to size_t to match current implementation.

Proposed header fix shown in data_store_service_client.h review.

data_store_service_client.h (1)

416-431: Make SerializeTxRecord signatures consistent with implementation (size_t vs uint64_t)

Implementation uses std::vector<size_t>& for record_tmp_mem_area. Header declares std::vector<uint64_t>&, causing a compile error. Align to size_t.

-    static void SerializeTxRecord(bool is_deleted,
-                                  const txservice::TxRecord *rec,
-                                  std::vector<uint64_t> &record_tmp_mem_area,
-                                  std::vector<std::string_view> &record_parts,
-                                  size_t &write_batch_size);
+    static void SerializeTxRecord(bool is_deleted,
+                                  const txservice::TxRecord *rec,
+                                  std::vector<size_t> &record_tmp_mem_area,
+                                  std::vector<std::string_view> &record_parts,
+                                  size_t &write_batch_size);

-    static void SerializeTxRecord(const txservice::TxRecord *rec,
-                                  std::vector<uint64_t> &record_tmp_mem_area,
-                                  std::vector<std::string_view> &record_parts,
-                                  size_t &write_batch_size);
+    static void SerializeTxRecord(const txservice::TxRecord *rec,
+                                  std::vector<size_t> &record_tmp_mem_area,
+                                  std::vector<std::string_view> &record_parts,
+                                  size_t &write_batch_size);
eloq_data_store_service/internal_request.h (2)

754-763: Pass shard_id, not partition_id, to PrepareShardingError.

The method signature expects uint32_t shard_id, but these SetFinish() implementations call it with partition_id. Four occurrences need fixes:

-            ds_service_->PrepareShardingError(req_->partition_id(),
+            ds_service_->PrepareShardingError(req_->shard_id(),
                                               resp_->mutable_result());
-            ds_service_->PrepareShardingError(partition_id_, result_);
+            ds_service_->PrepareShardingError(shard_id_, result_);
-            ds_service_->PrepareShardingError(req_->partition_id(),
+            ds_service_->PrepareShardingError(req_->shard_id(),
                                               resp_->mutable_result());
-            ds_service_->PrepareShardingError(partition_id_, result_);
+            ds_service_->PrepareShardingError(shard_id_, result_);

Applies to: lines 759, 858, 1327, 1518


523-531: Update GetPartitionId() and GetShardId() overrides from uint32_t to int32_t throughout the file.

Verification confirms multiple GetPartitionId() and GetShardId() overrides still return uint32_t when they should return int32_t to match the interface contract:

  • GetPartitionId() at lines 523, 611, 729, 828, 1235, 1430 use uint32_t (should be int32_t)
  • GetShardId() at lines 121, 268, 383, 450, 528, 616, 734, 833, 927, 993, 1062, 1128, 1240, 1435, 1594, 1669 use uint32_t (should be int32_t)

Update all these method signatures to return int32_t.

data_store_service_scanner.h (1)

235-238: Fix return type: align GetPartitionId() with int32_t member.

partition_id_ is int32_t, but the getter returns uint32_t. This implicit conversion can corrupt negative values and break comparisons. Change the getter to return int32_t.

-    uint32_t GetPartitionId()
+    int32_t GetPartitionId()
     {
         return partition_id_;
     }
eloq_data_store_service/rocksdb_cloud_data_store.cpp (1)

238-251: Fix unconditional TLS verification disablement in S3 client configuration.

Disabling TLS verification (config.verifySSL = false;) weakens transport security and enables man-in-the-middle attacks. The AWS SDK defaults to verifySSL = true; this setting should not override that default.

Affected locations:

  • eloq_data_store_service/rocksdb_cloud_data_store.cpp:250
  • eloq_data_store_service/rocksdb_cloud_dump.cpp:218

Fix: Remove the line config.verifySSL = false; entirely, or explicitly set it to true to respect the AWS SDK default:

-        // Disable SSL verification for HTTPS
-        config.verifySSL = false;
+        // Enable SSL verification (AWS SDK default)
+        config.verifySSL = true;

If certificate verification must be made configurable in the future, add a disable_tls_verify_ field to the RocksDBCloudConfig structure and gate the disablement behind that configuration flag. For now, keep verification enabled by default.

data_store_service_client_closure.h (1)

602-619: Fix std::string_view lifetime issue in ReadClosure by storing owned strings.

The key_ member is stored as non-owning std::string_view but must remain valid throughout the async operation. The public Read() API accepts temporary std::string_view parameters and passes them to the closure, creating a dangling reference risk if the caller's strings go out of scope before the closure completes.

Change std::string_view key_ to std::string key_ (line ~892) and update the assignment at line 614: key_ = std::string(key);. Apply the same fix to table_name_ and other non-owning std::string_view members stored in closures.

♻️ Duplicate comments (4)
eloq_data_store_service/ds_request.proto (1)

141-146: Proto wire-compat break: don’t renumber existing fields

skip_wal/items shifted (3→4 and 4→5). This breaks compatibility. Add shard_id using a new field number; keep existing numbers.

 message BatchWriteRecordsRequest {
   message Item {
     bytes key = 1;
     bytes value = 2;
     WriteOpType op_type = 3;
     uint64 ts = 4;
     uint64 ttl = 5;
   }

   string kv_table_name = 1;
   int32 partition_id = 2;
-  uint32 shard_id = 3;
-  bool skip_wal = 4;
-  repeated Item items = 5;
+  bool skip_wal = 3;          // preserve original number
+  repeated Item items = 4;    // preserve original number
+  uint32 shard_id = 6;        // new field number
 }
data_store_service_client.h (1)

648-653: Fix @param name to match signature

Doc says partition_id; function takes shard_id.

-     * @param partition_id
+     * @param shard_id
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)

173-194: Null out service pointer after shutdown to prevent re-entry/UAF.

After RocksDBDataStoreCommon::Shutdown() (which clears the worker pool), also set data_store_service_ = nullptr locally. This mirrors the guidance for the base RocksDB store and avoids accidental reuse.

 void RocksDBCloudDataStore::Shutdown()
 {
     RocksDBDataStoreCommon::Shutdown();
+    data_store_service_ = nullptr;
@@
     if (db_ != nullptr)
     {

155-163: Stop mutating a const member; normalize object_path before initializing cloud_config_.

Writing via const_cast on a const member is undefined behaviour. Compute a normalized path/copy up front and avoid mutating cloud_config_.

Example fix in this file (plus a small header change to add normalized_object_path_ and use it):

@@
-      cloud_config_(cloud_config),
+      cloud_config_(cloud_config),
       cloud_fs_(),
       cloud_env_(nullptr),
       db_(nullptr)
 {
-    auto *cloud_config_ptr =
-        const_cast<EloqDS::RocksDBCloudConfig *>(&cloud_config_);
-    assert(!cloud_config_ptr->object_path_.empty());
-    if (cloud_config_ptr->object_path_.back() != '/')
-    {
-        cloud_config_ptr->object_path_.append("/");
-    }
-    cloud_config_ptr->object_path_.append("ds_" + std::to_string(shard_id));
+    std::string obj_path = cloud_config.object_path_;
+    assert(!obj_path.empty());
+    if (obj_path.back() != '/') obj_path.push_back('/');
+    normalized_object_path_ = std::move(obj_path) + "ds_" + std::to_string(shard_id);
 }

And use the normalized path when wiring cfs options:

@@
-    cfs_options_.src_bucket.SetObjectPath(cloud_config_.object_path_);
+    cfs_options_.src_bucket.SetObjectPath(normalized_object_path_);
@@
-    cfs_options_.dest_bucket.SetObjectPath(cloud_config_.object_path_);
+    cfs_options_.dest_bucket.SetObjectPath(normalized_object_path_);

Add a private member in rocksdb_cloud_data_store.h:

-    const EloqDS::RocksDBCloudConfig cloud_config_;
+    const EloqDS::RocksDBCloudConfig cloud_config_;
+    std::string normalized_object_path_;
🧹 Nitpick comments (4)
data_store_service_client.cpp (1)

251-255: Avoid assert(false) in production paths

ScheduleTimerTasks is a virtual API; asserting unconditionally may crash debug builds. Prefer returning after logging.

 void DataStoreServiceClient::ScheduleTimerTasks()
 {
-    LOG(ERROR) << "ScheduleTimerTasks not implemented";
-    assert(false);
+    LOG(ERROR) << "ScheduleTimerTasks not implemented";
+    return;
 }
data_store_service_client.h (1)

747-750: Update outdated comment about shard support

Comment says “Now only support one shard” but code supports many shards (arrays size 1000). Refresh the comment to avoid confusion.

-    // Now only support one shard. dss_shards_ caches the index in dss_nodes_ of
-    // shard owner.
+    // dss_shards_[shard_id] caches the index in dss_nodes_ of the shard owner.
eloq_data_store_service/data_store_service.cpp (1)

1014-1019: Remove asserts in RPC error paths.

assert(false); inside handlers will crash the process on reachable error states; return structured errors instead.

-    if (shard_status != DSShardStatus::ReadWrite &&
-        shard_status != DSShardStatus::ReadOnly)
-    {
-        assert(false);
+    if (shard_status != DSShardStatus::ReadWrite &&
+        shard_status != DSShardStatus::ReadOnly)
+    {

Also applies to: 1049-1054

eloq_data_store_service/data_store_service.h (1)

680-694: Optional hygiene: reset shard_id_ on shutdown.

Not required for correctness here, but resetting shard_id_ to UINT32_MAX in DataShard::ShutDown() avoids stale identity and aligns with guard patterns elsewhere.

         void ShutDown()
         {
             if (data_store_ != nullptr)
             {
                 data_store_->Shutdown();
                 data_store_ = nullptr;
             }
             if (scan_iter_cache_ != nullptr)
             {
                 scan_iter_cache_->Clear();
                 scan_iter_cache_ = nullptr;
             }
+            shard_id_ = UINT32_MAX;
         }

Also applies to: 695-700

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c39a22d and 6ff11e0.

📒 Files selected for processing (17)
  • data_store_service_client.cpp (90 hunks)
  • data_store_service_client.h (12 hunks)
  • data_store_service_client_closure.cpp (17 hunks)
  • data_store_service_client_closure.h (34 hunks)
  • data_store_service_scanner.cpp (3 hunks)
  • data_store_service_scanner.h (4 hunks)
  • eloq_data_store_service/CMakeLists.txt (0 hunks)
  • eloq_data_store_service/data_store_service.cpp (45 hunks)
  • eloq_data_store_service/data_store_service.h (14 hunks)
  • eloq_data_store_service/data_store_service_util.h (1 hunks)
  • eloq_data_store_service/ds_request.proto (5 hunks)
  • eloq_data_store_service/internal_request.h (55 hunks)
  • eloq_data_store_service/main.cpp (1 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp (5 hunks)
  • eloq_data_store_service/rocksdb_data_store.cpp (1 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.cpp (3 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.h (1 hunks)
💤 Files with no reviewable changes (1)
  • eloq_data_store_service/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (3)
  • eloq_data_store_service/rocksdb_data_store.cpp
  • eloq_data_store_service/rocksdb_data_store_common.h
  • data_store_service_scanner.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • eloq_data_store_service/data_store_service.cpp
  • data_store_service_client_closure.cpp
  • data_store_service_client.cpp
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client_closure.cpp
  • data_store_service_client.cpp
  • data_store_service_client.h
🧬 Code graph analysis (10)
data_store_service_scanner.h (1)
eloq_data_store_service/internal_request.h (9)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (818-826)
  • partition_id_ (647-647)
  • partition_id_ (869-869)
  • partition_id_ (1528-1528)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (3)
eloq_data_store_service/rocksdb_cloud_data_store.h (1)
  • RocksDBCloudDataStore (38-161)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Shutdown (283-298)
  • Shutdown (283-283)
eloq_data_store_service/rocksdb_data_store_common.cpp (4)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (173-194)
  • Shutdown (173-173)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Shutdown (91-106)
  • Shutdown (91-91)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
eloq_data_store_service/data_store_service.cpp (3)
eloq_data_store_service/data_store_service.h (4)
  • IsOwnerOfShard (615-620)
  • PrepareShardingError (525-529)
  • IncreaseWriteReqCount (603-607)
  • DecreaseWriteReqCount (609-613)
data_store_service_client.cpp (12)
  • Read (1088-1097)
  • Read (1088-1093)
  • Read (3511-3527)
  • Read (3511-3516)
  • DeleteRange (3560-3582)
  • DeleteRange (3560-3567)
  • ScanNext (3703-3736)
  • ScanNext (3703-3717)
  • ScanClose (3778-3802)
  • ScanClose (3778-3783)
  • BatchWriteRecords (4103-4138)
  • BatchWriteRecords (4103-4116)
eloq_data_store_service/rocksdb_data_store_common.cpp (15)
  • Read (433-494)
  • Read (433-433)
  • key (849-849)
  • DeleteRange (356-431)
  • DeleteRange (356-356)
  • start_key (384-384)
  • start_key (696-696)
  • end_key (399-399)
  • end_key (701-701)
  • ScanNext (752-1062)
  • ScanNext (752-752)
  • ScanClose (1064-1088)
  • ScanClose (1064-1064)
  • BatchWriteRecords (496-658)
  • BatchWriteRecords (496-497)
data_store_service_client_closure.cpp (1)
data_store_service_client.cpp (1)
  • callback_data (2575-2581)
data_store_service_client_closure.h (2)
eloq_data_store_service/internal_request.h (13)
  • partition_id_ (647-647)
  • partition_id_ (869-869)
  • partition_id_ (1528-1528)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (818-826)
  • string_view (988-996)
  • string_view (1123-1131)
  • string_view (1425-1433)
  • string_view (1440-1448)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
data_store_service_client.cpp (3)
data_store_service_client.h (3)
  • need_bootstrap_ (688-859)
  • KvPartitionIdOf (629-634)
  • bind_data_shard_with_ng_ (689-813)
eloq_data_store_service/data_store_service.cpp (16)
  • BatchWriteRecords (853-901)
  • BatchWriteRecords (853-857)
  • BatchWriteRecords (1118-1183)
  • BatchWriteRecords (1118-1131)
  • Read (399-432)
  • Read (399-402)
  • Read (434-477)
  • Read (434-442)
  • ScanNext (903-959)
  • ScanNext (903-918)
  • ScanNext (961-993)
  • ScanNext (961-964)
  • DeleteRange (569-616)
  • DeleteRange (569-573)
  • DeleteRange (618-671)
  • DeleteRange (618-625)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
data_store_service_client.h (3)
eloq_data_store_service/data_store_service_config.h (2)
  • DataStoreServiceClusterManager (317-513)
  • uint32_t (107-110)
data_store_service_client.cpp (24)
  • FindFreeNodeIndex (3250-3269)
  • FindFreeNodeIndex (3250-3250)
  • InitBucketsInfo (3129-3188)
  • InitBucketsInfo (3129-3133)
  • OnLeaderStart (2969-2986)
  • OnLeaderStart (2969-2970)
  • OnLeaderStop (2988-3002)
  • OnLeaderStop (2988-2988)
  • OnStartFollowing (3011-3061)
  • OnStartFollowing (3011-3015)
  • GetShardIdByPartitionId (3089-3112)
  • GetShardIdByPartitionId (3089-3090)
  • IsLocalShard (3079-3087)
  • IsLocalShard (3079-3079)
  • GetOwnerNodeIndexOfShard (3190-3195)
  • GetOwnerNodeIndexOfShard (3190-3191)
  • GetAllDataShards (3114-3127)
  • GetAllDataShards (3114-3114)
  • UpdateOwnerNodeIndexOfShard (3197-3248)
  • UpdateOwnerNodeIndexOfShard (3197-3198)
  • HandleShardingError (3271-3305)
  • HandleShardingError (3271-3272)
  • UpgradeShardVersion (3307-3355)
  • UpgradeShardVersion (3307-3310)
eloq_data_store_service/data_store_service_config.cpp (4)
  • GetShardIdByPartitionId (802-807)
  • GetShardIdByPartitionId (802-803)
  • HandleShardingError (950-992)
  • HandleShardingError (950-951)
eloq_data_store_service/internal_request.h (2)
eloq_data_store_service/data_store_service_config.h (1)
  • uint32_t (107-110)
data_store_service_client_closure.h (31)
  • Reset (81-85)
  • Reset (144-153)
  • Reset (210-216)
  • Reset (269-274)
  • Reset (385-392)
  • Reset (600-619)
  • Reset (933-948)
  • Reset (1154-1177)
  • Reset (1393-1408)
  • Reset (1598-1624)
  • Reset (1627-1658)
  • Reset (1660-1694)
  • Reset (1933-2251)
  • Reset (2328-2347)
  • Reset (2606-2612)
  • Reset (3002-3010)
  • vector (1078-1081)
  • vector (1121-1128)
  • vector (2253-2256)
  • string (804-809)
  • string (2173-2178)
  • string_view (311-368)
  • string_view (752-755)
  • string_view (767-770)
  • string_view (792-797)
  • string_view (1291-1294)
  • string_view (1533-1536)
  • string_view (2133-2136)
  • string_view (2138-2141)
  • string_view (2241-2246)
  • string_view (2453-2456)
eloq_data_store_service/data_store_service.h (4)
eloq_data_store_service/data_store_service.cpp (12)
  • StartService (242-312)
  • StartService (242-242)
  • Read (399-432)
  • Read (399-402)
  • Read (434-477)
  • Read (434-442)
  • ScanClose (995-1028)
  • ScanClose (995-998)
  • ScanClose (1030-1062)
  • ScanClose (1030-1035)
  • TTLWrapperCache (72-77)
  • TTLWrapperCache (79-88)
data_store_service_client.cpp (6)
  • Read (1088-1097)
  • Read (1088-1093)
  • Read (3511-3527)
  • Read (3511-3516)
  • ScanClose (3778-3802)
  • ScanClose (3778-3783)
eloq_data_store_service/data_store_service_config.cpp (6)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
eloq_data_store_service/data_store.h (1)
  • DataStore (42-143)
🪛 Clang (14.0.6)
eloq_data_store_service/data_store_service_util.h

[error] 24-24: 'atomic' file not found

(clang-diagnostic-error)

🔇 Additional comments (16)
eloq_data_store_service/data_store_service_util.h (1)

24-29: Based on my verification, I now have a clear picture. The includes are used by files that depend on this header through transitive includes, not by this file directly. Let me finalize the review:

Clarify transitive include pattern and verify atomic compilation issue.

The newly added includes are used by dependent files (data_store_service_scanner.h, data_store_service.h, data_store_service_config.h) but through transitive includes rather than direct usage in data_store_service_util.h itself. Specifically:

  • data_store_service_scanner.h uses <atomic> and appears to rely on util.h to provide it
  • data_store_service.h uses <atomic>, <map>, <set>, and <unordered_map>, but already includes some of these directly

While this transitive include pattern makes the includes functional, it violates the principle that each header file should directly include only what it uses. Each dependent file should include what it directly needs rather than relying on a util header to provide them.

Additionally, the static analysis error reporting <atomic> not found should be investigated—this is a standard C++ header and should be available.

Recommendations:

  1. Remove these includes from data_store_service_util.h and have each dependent file include what it directly uses
  2. Verify whether the <atomic> compilation error is a real blocker or a false positive in your build environment
data_store_service_client_closure.cpp (5)

72-73: LGTM: Consistent key extraction refactoring.

The change from Key().back() to Key() is applied consistently across all callbacks, indicating a systematic refactoring of the key handling approach.

Also applies to: 96-96


265-265: LGTM: Simplified key handling.

Using key_str directly from the scan result aligns with the key handling refactor throughout the file.


278-287: LGTM: Correct shard-aware routing implementation.

The code correctly derives the shard ID from bucket→partition→shard mapping and propagates it to the continuation scan. This aligns with the multi-shard support objectives.


516-521: LGTM: Shard context preserved across batch continuations.

The code correctly extracts and propagates the shard ID from the closure to the next batch write operation, maintaining shard affinity throughout the batch processing flow.


619-619: LGTM: Comprehensive shard context propagation.

Shard IDs are consistently extracted from closures and propagated to all downstream Read/ScanNext operations across multiple callbacks, ensuring proper shard affinity throughout operation chains.

Also applies to: 688-688, 821-821, 919-919, 1016-1016, 1134-1134, 1241-1241, 1310-1310, 1419-1419, 1445-1445

eloq_data_store_service/rocksdb_data_store_common.cpp (2)

283-298: LGTM: Clean per-shard shutdown implementation.

The method properly cleans up resources with defensive null checks and allows for DataStore reuse by nullifying the worker pool (which will be re-created in Initialize()). The ForceEraseScanIters call ensures scan iterators are cleaned up for this specific shard.


444-455: LGTM: Cleaner BuildKey API.

The refactoring to extract the key first and then pass it directly to BuildKey (instead of passing the entire ReadRequest) results in a cleaner, more explicit API with better separation of concerns.

eloq_data_store_service/main.cpp (1)

321-323: StartService signature change verified as consistent

The signature update has been properly applied: the header declaration, implementation, and call site in main.cpp all align with the new bool create_db_if_missing parameter. No additional call sites or inconsistencies detected.

data_store_service_client_closure.h (7)

499-510: LGTM: Partition ID type change is consistent.

The change from uint32_t to int32_t for partition_id aligns with the broader shard-aware refactoring and is consistent between the method signature and member variable.

Also applies to: 532-532


660-662: LGTM: Shard ID propagation is correct.

The shard_id is properly set in the RPC request (line 660) and correctly used in retry logic (line 688). The direct usage of shard_id_ instead of deriving it from partition_id_ is the correct approach for per-shard routing.

Also applies to: 688-688


1156-1177: LGTM: DeleteRangeClosure properly updated for shard awareness.

The closure correctly:

  • Accepts both partition_id (int32_t) and shard_id (uint32_t) in Reset()
  • Populates shard_id in the RPC request
  • Uses shard_id_ directly in retry logic
  • Exposes public accessors for both IDs

Also applies to: 1200-1200, 1228-1228, 1296-1304


1605-1606: LGTM: BatchWriteRecordsClosure shard-aware updates are complete.

All Reset() overloads properly accept and store both partition_id (int32_t) and shard_id (uint32_t). The RPC request population and retry logic correctly use shard_id_. Public accessors are properly exposed.

Also applies to: 1629-1694, 1777-1777, 1850-1858


1933-1982: LGTM: ScanNextClosure correctly implements shard-aware routing.

The closure properly:

  • Accepts and stores both partition_id (int32_t) and shard_id (uint32_t)
  • Sets shard_id in the RPC request
  • Uses shard_id_ in owner node updates during retry
  • Provides public accessors with correct return types

Also applies to: 2008-2008, 2055-2055, 2123-2131


2930-2949: LGTM: FetchArchivesCallbackData partition_id type updated consistently.

The constructor parameter and member variable are both changed to int32_t, maintaining consistency with the shard-aware refactoring.


933-977: LGTM: FlushDataClosure correctly handles multi-shard operations.

The closure uses a vector of shard_ids_ and processes them sequentially. The RPC request correctly sets the current shard_id (line 974), and retry logic properly uses it for owner node updates (line 998-999).

Also applies to: 998-999

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
data_store_service_client.cpp (1)

2969-3060: Fix misleading comment in OnLeaderStop.

The comment at line 2995 says "swith to read only in case of data store status is read write," but CloseDataStore() actually performs a full close transition (ReadWrite → ReadOnly → Closed). The shard lifecycle logic is sound and correctly binds ng_id to shard operations when bind_data_shard_with_ng_ is enabled. The 1:1 ng_id/shard_id mapping is implicit and validated by the assert at line 3034 in OnStartFollowing.

Update line 2995 comment to accurately describe the full close operation:

// Close the data store shard (transitions ReadWrite → ReadOnly → Closed)
🧹 Nitpick comments (2)
data_store_service_client.cpp (2)

3101-3108: Redundant check after assertion.

Line 3102 asserts that the bucket exists, then line 3103 checks the same condition. Since assertions are enabled in debug builds and the bucket_infos_ map is guaranteed to contain valid bucket_ids, either remove the assert (if you want graceful error handling) or remove the if check (if invariant is guaranteed).

Current pattern:

auto it = bucket_infos_.find(bucket_id);
assert(it != bucket_infos_.end());  // Will abort in debug if false
if (it != bucket_infos_.end())      // Redundant given the assert
{
    ...
}

Apply this diff to remove the redundant check:

     auto it = bucket_infos_.find(bucket_id);
     assert(it != bucket_infos_.end());
-    if (it != bucket_infos_.end())
-    {
-        uint32_t shard_id = it->second->BucketOwner();
-        assert(dss_shard_ids_.find(shard_id) != dss_shard_ids_.end());
-        return shard_id;
-    }
+    uint32_t shard_id = it->second->BucketOwner();
+    assert(dss_shard_ids_.find(shard_id) != dss_shard_ids_.end());
+    return shard_id;
-    LOG(ERROR) << "Bucket not found for partition_id=" << partition_id
-               << " (bucket_id=" << bucket_id << ")";
-    return UINT32_MAX;

3193-3197: Unreachable return after LOG(FATAL).

LOG(FATAL) terminates the process, making the subsequent return UINT32_MAX unreachable. Either change to LOG(ERROR) if you want to return an error value, or remove the return statement if process termination is intended.

If graceful error handling is desired:

     if (shard_id >= dss_shards_.size())
     {
-        LOG(FATAL) << "shard_id " << shard_id << " exceeds array bounds";
+        LOG(ERROR) << "shard_id " << shard_id << " exceeds array bounds";
         return UINT32_MAX;
     }

Or if process termination is intended, remove the unreachable return:

     if (shard_id >= dss_shards_.size())
     {
         LOG(FATAL) << "shard_id " << shard_id << " exceeds array bounds";
-        return UINT32_MAX;
     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ff11e0 and 33a62c9.

📒 Files selected for processing (2)
  • data_store_service_client.cpp (89 hunks)
  • eloq_data_store_service/internal_request.h (55 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • eloq_data_store_service/internal_request.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client.cpp
🧬 Code graph analysis (1)
data_store_service_client.cpp (2)
data_store_service_client.h (5)
  • TxPort2DssPort (134-137)
  • need_bootstrap_ (688-859)
  • KvPartitionIdOf (629-634)
  • bind_data_shard_with_ng_ (689-813)
  • Connect (154-159)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
🔇 Additional comments (5)
data_store_service_client.cpp (5)

3129-3188: LGTM: Thread-safe deterministic bucket assignment.

The implementation correctly uses std::mt19937 with deterministic seeding to generate consistent bucket-to-shard mappings. The consistent hashing approach with 64 virtual nodes per shard provides good load distribution, and the collision handling ensures each virtual node position is unique.


3989-4003: LGTM: Proper error handling for Sequences table.

Good improvement over the previous assert(false). The code now gracefully returns false with a clear error message if the Sequences table is not hash-partitioned, allowing callers to handle the failure appropriately.


2368-2368: LGTM: Consistent shard routing in archive copy operation.

The CopyBaseToArchive method correctly propagates the is_range_partitioned flag when computing both kv_part_id and shard_id for reading base table records. The concurrent read management with MAX_FLYING_READ_COUNT throttling prevents overwhelming the system while copying data.

Also applies to: 2402-2415


3114-3127: LGTM: Thread-safe shard enumeration with proper locking.

The method now correctly acquires a shared_lock before iterating over dss_shard_ids_, addressing the concurrency concern from the previous review. The lock is held only during the copy operation, and the returned vector provides a consistent snapshot of active shard IDs.


419-420: I need to verify a few specific cases to confirm correctness. Let me examine the contexts more closely:

I need to examine the specific contexts for a few questionable cases. Let me get the surrounding code:

Based on my analysis of all 42 calls to GetShardIdByPartitionId in the output, I can now provide the verification result:

Analysis Complete: No Critical Issues Found

The is_range_partition parameter is consistently and correctly used across all call sites:

  1. Catalog operations (partition_id=0): All 10 calls correctly pass false (partition 0 is always hash-partitioned)

    • Lines 1543, 1589, 1643, 1700, 1748, 1773, 4323, 4548, 4595, and one additional
  2. Catalog/KvPartitionIdOf pattern: All 14 calls using KvPartitionIdOf(ccm_table_name) correctly pass false

    • Lines 597, 627, 669, 819, 940, 988, 1385, 3843, 3896, 4052, and others
  3. Range partition operations: All 2 explicit calls correctly pass true

    • Line 1052: KvPartitionIdOf(partition_id, true) with GetShardIdByPartitionId(..., true)
    • Line 1148: KvPartitionIdOf(range_partition_id, true) with GetShardIdByPartitionId(..., true)
  4. Schema-derived operations: All 4 calls correctly derive from !IsHashPartitioned()

    • Lines 3387-3390, 3500-3503: Match pattern for both KvPartitionIdOf and GetShardIdByPartitionId calls
    • Lines 420, 2406: Compute consistency correctly
  5. Hash archive/KvPartitionIdOf(table_name) implicit false: All 10+ calls correctly pass false

    • Lines 2572, 2683, 2779, 2781, 2818, 3997, 4325, and others

The parameter consistently aligns with the partition type being passed—catalog operations and hash partitions use false, while range partitions use true. The pattern is deterministic and secure against routing errors.

@lzxddz lzxddz force-pushed the dss_multi_shard-rebase branch from 33a62c9 to deb72ed Compare November 17, 2025 06:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
data_store_service_client.cpp (1)

2540-2751: Deprecated archive APIs now log instead of asserting but still perform work

FetchArchives(...) and FetchVisibleArchive(...) (the table-schema-based variants) now log warnings that they “should not be used (noop)” instead of asserting, but they still execute full ScanNext/Decode logic and return real data. If these paths are truly deprecated in favor of ccmap, consider either:

  • Returning early after the warning, or
  • Updating the comments/log messages to reflect that they are deprecated but still functional.
eloq_data_store_service/rocksdb_cloud_data_store_factory.h (1)

48-84: Guard against empty object_path_ before calling .back()

Calling std::string::back() on an empty string is undefined behavior in C++17. At lines 50–53, the code calls object_path_.back() without checking if the string is empty:

if (shard_cloud_config.object_path_.back() != '/')
{
    shard_cloud_config.object_path_.append("/");
}

Even if current configurations always provide a non-empty path, this is a brittle assumption. Add a guard:

-        if (shard_cloud_config.object_path_.back() != '/')
-        {
-            shard_cloud_config.object_path_.append("/");
-        }
+        if (!shard_cloud_config.object_path_.empty() &&
+            shard_cloud_config.object_path_.back() != '/')
+        {
+            shard_cloud_config.object_path_.push_back('/');
+        }
+        else if (shard_cloud_config.object_path_.empty())
+        {
+            shard_cloud_config.object_path_ = "/";
+        }
🧹 Nitpick comments (1)
data_store_service_client.cpp (1)

32-83: Topology setup and bootstrap gating now behave sanely

Including <random>, replacing the old fatal assert(false) in SetupConfig() with version-checked, idempotent shard-upgrade logic, and short-circuiting Connect() when need_bootstrap_ is false collectively fix prior crash-on-start behavior and make topology upgrades retryable. Version compare uses <= so you may see redundant UpgradeShardVersion calls at equal version, but those are cheap and guarded; if you want to avoid duplicate logs, switch the condition to < new_version.

Also applies to: 109-149, 222-242

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33a62c9 and deb72ed.

📒 Files selected for processing (4)
  • data_store_service_client.cpp (101 hunks)
  • eloq_data_store_service/internal_request.h (55 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp (5 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store_factory.h (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client.cpp
🧬 Code graph analysis (4)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (3)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Shutdown (283-298)
  • Shutdown (283-283)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Shutdown (91-106)
  • Shutdown (91-91)
eloq_data_store_service/rocksdb_cloud_data_store_factory.h (2)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • RocksDBCloudDataStore (138-155)
  • RocksDBCloudDataStore (157-163)
eloq_data_store_service/rocksdb_cloud_data_store.h (1)
  • RocksDBCloudDataStore (38-161)
eloq_data_store_service/internal_request.h (3)
eloq_data_store_service/data_store_service_config.h (2)
  • uint32_t (107-110)
  • EloqDS (98-514)
data_store_service_client_closure.h (30)
  • Reset (81-85)
  • Reset (144-153)
  • Reset (210-216)
  • Reset (269-274)
  • Reset (385-392)
  • Reset (600-619)
  • Reset (933-948)
  • Reset (1154-1177)
  • Reset (1393-1408)
  • Reset (1598-1624)
  • Reset (1627-1658)
  • Reset (1660-1694)
  • Reset (1933-2251)
  • Reset (2328-2347)
  • Reset (2606-2612)
  • Reset (3002-3010)
  • string (804-809)
  • string (2173-2178)
  • string_view (311-368)
  • string_view (752-755)
  • string_view (767-770)
  • string_view (792-797)
  • string_view (1291-1294)
  • string_view (1533-1536)
  • string_view (2133-2136)
  • string_view (2138-2141)
  • string_view (2241-2246)
  • string_view (2453-2456)
  • EloqDS (39-42)
  • EloqDS (44-184)
eloq_data_store_service/data_store.h (1)
  • EloqDS (28-145)
data_store_service_client.cpp (2)
data_store_service_client.h (4)
  • TxPort2DssPort (134-137)
  • need_bootstrap_ (688-859)
  • KvPartitionIdOf (629-634)
  • bind_data_shard_with_ng_ (689-813)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
🔇 Additional comments (20)
eloq_data_store_service/internal_request.h (7)

36-69: Shard-aware write request interfaces look consistent

WriteRecordsRequest and its RPC/local implementations now carry uint32_t shard_id explicitly and use it in SetFinish() via DecreaseWriteReqCount(...). Types (int32_t for partition_id, uint32_t for shard_id) and initialisation to UINT32_MAX are consistent across Clear/Reset paths; no functional issues spotted here.

Also applies to: 70-179


329-469: FlushData shard wiring is coherent

FlushDataRequest now exposes GetShardId(), with RPC mapping from req_->shard_id() and local requests storing shard_id_ (reset to UINT32_MAX in Clear()) and using it for DecreaseWriteReqCount. This matches the client changes that fan out flushes per shard.


471-654: Partition/signature fixes and shard_id on DeleteRange are correct

DeleteRangeRequest now uses int32_t GetPartitionId() and adds GetShardId(), with both RPC and local implementations returning the proto/local fields and SetFinish() decrementing per-shard write counters. This addresses previous signed/unsigned mismatches while keeping the API uniform across read/write/delete.


663-863: ReadRequest and implementations: key/partition/shard accessors align with client usage

The base ReadRequest now exposes GetKey(), GetPartitionId() (int32_t), and GetShardId() (uint32_t), with RPC/Local variants mapping cleanly to proto fields and local members. Lifetime of the std::string_view members is unchanged from previous design; shard_id defaults and resets are consistent with other request types.


865-1134: Create/Drop table shard propagation matches new client DropTable/FlushData flows

Adding GetShardId() to CreateTableRequest/DropTableRequest and wiring shard_id through RPC/local variants (including DecreaseWriteReqCount(shard_id) in SetFinish) is consistent with the per-shard Drop/Flush logic in the client. The use of UINT32_MAX as an invalid sentinel is uniform across request types.


1136-1518: ScanRequest: partition_id type and shard_id addition align with sharded scanners

ScanRequest now uses int32_t GetPartitionId() and adds GetShardId(), with RPC/local implementations correctly mapping proto and local state. This matches the new ScanNext(..., partition_id, shard_id, ...) client signature and should keep routing correct for both hash and range scans.


1520-1682: Snapshot-for-backup shard plumbing is well-integrated

CreateSnapshotForBackupRequest’s new GetShardId() (and the RPC/local implementations) allow the client’s per-shard snapshot fanout to address the right data store instance and decrement per-shard write counters in SetFinish. The default shard_id_{UINT32_MAX} and Clear/Reset logic are consistent with other request types.

data_store_service_client.cpp (9)

151-210: TxConfigsToDssClusterConfig: mapping NG configs to shards is robust

The function now builds a nodes_map from ng member configs, adds only candidate nodes as shard members, updates the primary from ng_leaders (with an existence check), and sets this_node/read-write status when the local node is a candidate leader. This addresses earlier concerns about redundant leader_id == node_id checks and unguarded nodes_map.at(...) access. Behavior looks correct given the provided cluster-manager API.


591-686: Metadata/catalog/statistics operations correctly include shard_id

FetchTableCatalog, Fetch(Current)TableStatistics, UpsertTableStatistics, and the associated DeleteRange cleanup now compute kv_partition_id via KvPartitionIdOf and derive data_shard_id via GetShardIdByPartitionId(..., /*is_range=*/false). All subsequent Read/Scan/BatchWrite/DeleteRange calls pass the same partition/shard pair, aligning metadata access with the new sharding scheme.

Also applies to: 742-921


1318-1337: CopyBaseToArchive: new shard computations and logging are reasonable

CopyBaseToArchive now derives kv_part_id and shard_id per base record using the is_range_partitioned flag, and uses those to call the sharded Read(...). The extra LOG(WARNING) when seeing EloqKv for MVCC is helpful, and the batch_size accounting (key+value sizes) correctly feeds into the constructed FlushTaskEntry. Flow control (MAX_FLYING_READ_COUNT and condition variable) remains unchanged.

Also applies to: 2352-2537


2855-2903: Shard-scoped snapshot and lifecycle hooks use the new IsLocalShard/AllDataShards helpers correctly

CreateSnapshotForBackup now calls GetAllDataShards() and fans out snapshot creation per shard, with CreateSnapshotForBackupInternal dispatching local vs remote based on IsLocalShard(shard_id) and using GetOwnerNodeIndexOfShard for RPC routing. Leader/follower transitions (OnLeaderStart, OnLeaderStop, OnStartFollowing) respect bind_data_shard_with_ng_ and open/close the appropriate data store shard via ng_id. This is aligned with the per-shard DataStore constructor changes.

Also applies to: 2969-3063, 3080-3088


3193-3203: GetOwnerNodeIndexOfShard: new FATAL guard is appropriate for out-of-range ids

The added bounds check (shard_id >= dss_shards_.size()) with LOG(FATAL) prevents undefined access to dss_shards_ in release builds, which is critical for RPC routing. This is an improvement over relying solely on assertions.


3363-3402: Core Read/Delete/Scan/BatchWrite call paths are cleanly shard-aware

The main data-plane methods (FetchRecord, FetchSnapshot, Read, DeleteRange, ScanNext, ScanClose, BatchWriteRecords and their *_Internal variants) consistently:

  • Compute kv_partition_id from logical partition_id with the appropriate is_range_partition flag.
  • Derive shard_id via GetShardIdByPartitionId.
  • Use IsLocalShard(shard_id) to choose between local data_store_service_ calls and RPC via GetOwnerNodeIndexOfShard(shard_id) and dss_nodes_[node_index].Channel().

This wiring matches the updated closure/request interfaces and should give correct per-shard routing across all operations.

Also applies to: 3485-3515, 3517-3533, 3535-3564, 3566-3620, 3710-3742, 3744-3782, 3784-3838


3840-3999: Catalog/range/statistics maintenance updated to use shard_id consistently

InitTableRanges, DeleteTableRanges, InitTableLastRangePartitionId, DeleteTableStatistics, UpsertCatalog, and DeleteCatalog all now:

  • Compute a KV partition id (often 0 for global catalogs).
  • Use GetShardIdByPartitionId(partition_id or kv_partition_id, false) to choose the data shard.
  • Pass both partition and shard into BatchWriteRecords/DeleteRange.

That keeps system catalog and statistics maintenance aligned with the general sharding scheme.

Also applies to: 4051-4105, 4107-4615


286-471: PutAll: partition→shard routing is consistent for hash and range tables

All BatchWriteRecords call sites throughout data_store_service_client.cpp pass both partition_id and shard_id parameters. Hash and range partitions are routed consistently via KvPartitionIdOf(..., false/true) → partition id, and GetShardIdByPartitionId(partition_id, is_range_partitioned) → shard id. The same partition/shard pair threads through BatchWriteRecords and subsequent callbacks.


934-1166: Range tables: shard routing pattern verified as coherent

All five functions consistently apply the shard routing pattern:

  1. System table operations (FetchTableRanges, FetchRangeSlices, UpdateRangeSlices): Use KvPartitionIdOf(table_name) with GetShardIdByPartitionId(..., false), routing via MapHashPartitionIdToBucketId.

  2. Range partition operations (DeleteOutOfRangeData, LoadRangeSlice): Use KvPartitionIdOf(partition_id, true) with GetShardIdByPartitionId(..., true), routing via MapRangeIdToBucketId.

  3. Caller (UpsertRanges): Delegates to UpdateRangeSlices, maintaining consistency.

The GetShardIdByPartitionId implementation correctly interprets the is_range_partition flag to select the appropriate mapping function, and both KvPartitionIdOf overloads properly forward their inputs for consistent downstream routing.

eloq_data_store_service/rocksdb_cloud_data_store.cpp (4)

159-159: LGTM: Destructor condition simplified consistently.

The removal of data_store_service_ from the shutdown condition aligns with the pattern in RocksDBDataStore (see rocksdb_data_store.cpp:68-82). The service pointer is not owned by this class and doesn't require cleanup in the destructor.


167-185: Shutdown flow properly refactored with parent delegation.

The call to RocksDBDataStoreCommon::Shutdown() at line 167 correctly delegates query_worker_pool_ cleanup (which includes nulling) and ForceEraseScanIters invocation. The subsequent db-specific cleanup properly nulls all owned resources under lock.

Note: The parent's Shutdown() doesn't null data_store_service_ after calling ForceEraseScanIters. While data_store_service_ isn't deleted (it's not owned), nulling it would provide extra defense against use-after-free if Shutdown() is called multiple times or the object is reused.

Based on learnings


355-377: Good timing instrumentation improvement.

Using steady_clock for the NewCloudFileSystem timing measurement is the correct choice—it's monotonic and unaffected by system clock adjustments. The LOG(INFO) ensures timing is visible in production for observability.


632-655: Consistent timing improvements for DB open operation.

The switch to steady_clock and LOG(INFO) is consistent with the NewCloudFileSystem timing changes. This provides reliable timing measurements unaffected by system clock changes and ensures visibility in production logs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
data_store_service_client.cpp (1)

3131-3190: Add precondition check for empty node_groups to prevent undefined behavior.

If node_groups is empty, rand_num_to_ng remains empty after the generation loop. Line 3180-3182 will then dereference rand_num_to_ng.begin()->second on an empty map when it == rand_num_to_ng.end(), causing undefined behavior.

Apply this diff to add the missing guard:

 void DataStoreServiceClient::InitBucketsInfo(
     const std::set<uint32_t> &node_groups,
     uint64_t version,
     std::unordered_map<uint16_t, std::unique_ptr<txservice::BucketInfo>>
         &ng_bucket_infos)
 {
+    if (node_groups.empty())
+    {
+        LOG(ERROR) << "InitBucketsInfo called with empty node_groups";
+        ng_bucket_infos.clear();
+        return;
+    }
+
     ng_bucket_infos.clear();
     std::map<uint16_t, uint32_t> rand_num_to_ng;

Based on learnings (bucket_infos_ immutability was clarified in PR #80).

🧹 Nitpick comments (3)
data_store_service_client.cpp (3)

1809-1821: Clarify deprecation status or remove implementation.

DropKvTableAsync logs that it "should not be used (noop)" but continues to execute the full async drop logic. If this method is truly deprecated and shouldn't be called:

  • Remove the implementation body and return early after the warning, or
  • Remove the method entirely

If it may still be invoked, the warning is misleading.

 void DataStoreServiceClient::DropKvTableAsync(const std::string &kv_table_name)
 {
-    // FIXME(lzx): this function may not be used now, delete it.
     LOG(WARNING) << "DropKvTableAsync should not be used (noop)";
-
-    AsyncDropTableCallbackData *callback_data =
-        new AsyncDropTableCallbackData();
-    callback_data->kv_table_name_ = kv_table_name;
-    DropTable(std::string_view(callback_data->kv_table_name_.data(),
-                               callback_data->kv_table_name_.size()),
-              callback_data,
-              &AsyncDropTableCallback);
+    return;  // Early return if truly deprecated
 }

Also applies to: 2554-2644, 2664-2751


2459-2488: Consolidate redundant EloqKv MVCC warnings.

Lines 2462 and 2486-2487 both log the same warning "EloqKv engine not support mvcc feature" within the same conditional branch. This creates redundant log spam. Consider consolidating to a single warning or removing if MVCC with EloqKv is truly unsupported and this code path should not execute.

                 if (table_name.Engine() == txservice::TableEngine::EloqKv)
                 {
-                    // mvcc is not used for EloqKV
                     LOG(WARNING) << "EloqKv engine not support mvcc feature";
                     txservice::TxObject *tx_object =
                         static_cast<txservice::TxObject *>(record.get());
                     record = tx_object->DeserializeObject(val.data(), offset);
                 }
                 // ... later in same branch ...
                 if (table_name.Engine() == txservice::TableEngine::EloqKv)
                 {
-                    // should not be here
-                    LOG(WARNING)
-                        << "EloqKv engine not support mvcc feature";
                     ref.SetNonVersionedPayload(record.get());
                 }

Also applies to: 2631, 2742


2972-3062: Document the ng_id-to-shard_id mapping assumption.

OnLeaderStart, OnLeaderStop, and OnStartFollowing now accept ng_id parameters and directly treat them as dss_shard_id (line 3031). This assumes a 1:1 mapping between node groups and data store shards. Consider adding a comment documenting this invariant, especially at line 3031 where the assignment occurs.

     // Treat leader_node_id as dss_leader_node_id
     uint32_t dss_leader_node_id = leader_node_id;
+    // In multi-shard mode, node group ID maps 1:1 to data store shard ID
     uint32_t dss_shard_id = ng_id;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between deb72ed and a669993.

📒 Files selected for processing (1)
  • data_store_service_client.cpp (101 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client.cpp
🧬 Code graph analysis (1)
data_store_service_client.cpp (2)
data_store_service_client.h (4)
  • TxPort2DssPort (134-137)
  • need_bootstrap_ (688-859)
  • KvPartitionIdOf (629-634)
  • bind_data_shard_with_ng_ (689-813)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
🔇 Additional comments (1)
data_store_service_client.cpp (1)

3089-3114: LGTM: Shard routing logic is correct and handles edge cases.

GetShardIdByPartitionId correctly maps partition IDs to shard IDs through bucket ownership. The method:

  • Distinguishes range vs. hash partitions (lines 3093-3101)
  • Maps partition → bucket → shard using the bucket_infos_ map
  • Returns UINT32_MAX with error logging for missing buckets (lines 3111-3113)
  • Uses asserts for debug verification while handling errors gracefully in release builds

The consistent propagation of shard_id throughout all read/write/scan operations (lines 418-419, 596, 625-626, etc.) correctly routes requests to the appropriate shards.

Based on learnings (bucket_infos_ is immutable after initialization per PR #80).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
data_store_service_client.cpp (3)

3320-3329: Replace LOG(FATAL) with error handling for missing shard.

LOG(FATAL) at line 3327 will crash the process if shard_id is not found. The return statement is unreachable. Consider using LOG(ERROR) and returning false to indicate failure, allowing callers to handle the error appropriately.

     if (shard_id >= dss_shards_.size())
     {
-        LOG(FATAL) << "Shard id not found, shard_id: " << shard_id;
-        return true;
+        LOG(ERROR) << "Shard id not found, shard_id: " << shard_id;
+        return false;
     }

2554-2563: Confirm warnings are contradictory—functions actively used but marked "should not be used."

The review comment is valid. Both FetchArchives (line 2554) and FetchVisibleArchive (line 2664) log warnings stating they "should not be used" and "(noop)", yet both functions execute full implementations and are actively called:

  • FetchArchives is invoked at line 3392 when only_fetch_archives_ is true
  • FetchVisibleArchive is invoked at line 3505 under the same condition
  • Both are also called from data_store_service_client_closure.cpp
  • Both have corresponding implementations in handler classes (DynamoHandler, BigTableHandler, RocksDBHandler)

The "(noop)" label is misleading—these functions perform substantial work (encoding, scanning, deserialization). The TODO at line 2673 ("Remove this function if not needed") reinforces that this inconsistency was known.

Recommendation: Either remove the warnings and clarify these functions handle archive-only fetch paths, or add early returns after the warnings if they're truly deprecated.


3312-3318: Implement proper topology change handling instead of LOG(FATAL) crash.

Line 3315 LOG(FATAL) crashes the process when cluster topology undergoes a full node group change. The else branch should implement topology update logic similar to DataStoreServiceClusterManager::HandleShardingError (lines 972-990 in eloq_data_store_service/data_store_service_config.cpp), which parses the new cluster configuration and applies it via an Update() call. Callers in the closure handlers (multiple locations) expect the function to return normally so they can proceed with retry logic, not terminate the process.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a669993 and d6e88d6.

📒 Files selected for processing (1)
  • data_store_service_client.cpp (101 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client.cpp
🧬 Code graph analysis (1)
data_store_service_client.cpp (2)
data_store_service_client.h (4)
  • TxPort2DssPort (134-137)
  • need_bootstrap_ (688-859)
  • KvPartitionIdOf (629-634)
  • bind_data_shard_with_ng_ (689-813)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
🔇 Additional comments (5)
data_store_service_client.cpp (5)

3139-3144: Good defensive programming: empty node_groups check.

The precondition check for empty node_groups is good defensive programming, even though a past review comment from the developer stated it should never be empty. This guard prevents undefined behavior if the invariant is violated and provides a clear error message.


3153-3174: LGTM: Thread-safe deterministic RNG.

The use of std::mt19937 with deterministic seeding (line 3154) and std::uniform_int_distribution (lines 3155-3156) properly addresses the thread-safety concern from previous reviews. The local RNG instance per node group ensures no shared state, and deterministic seeding guarantees consistent bucket-to-shard mapping across restarts.


3117-3130: LGTM: Thread-safe access with shared_lock.

The addition of std::shared_lock<std::shared_mutex> (line 3121) properly protects concurrent read access to dss_shard_ids_, addressing the thread-safety concern from previous reviews. The lock is held only during the copy operation, keeping the critical section minimal.


3998-4011: LGTM: Proper error handling replaces assert.

The conditional check for hash partitioning (lines 4001-4011) with error logging and return false properly addresses the previous review concern about assert(false). The code now fails gracefully if the Sequences table is unexpectedly not hash-partitioned, allowing callers to handle the error.


2973-3063: LGTM: Shard lifecycle management correctly gated.

The bind_data_shard_with_ng_ flag checks (lines 2976-2979, 2994-2997, 3020-3023) properly gate shard-specific operations. When binding is disabled, the methods return early, avoiding unnecessary work. The logic for opening/closing data stores and updating cluster manager state on leadership changes looks correct.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
data_store_service_client.cpp (1)

2561-2563: FetchArchives/FetchVisibleArchive marked as deprecated but still functional.

Both methods log warnings indicating they "should not be used because all archive versions are fetched from ccmap (noop)," yet still implement full functionality with shard routing. Consider removing the implementations if they are truly unused, or remove the warning if they remain necessary.

Also applies to: 2674-2676

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6e88d6 and dea26cb.

📒 Files selected for processing (2)
  • data_store_service_client.cpp (101 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store_factory.h (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client.cpp
🧬 Code graph analysis (2)
eloq_data_store_service/rocksdb_cloud_data_store_factory.h (2)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • RocksDBCloudDataStore (138-155)
  • RocksDBCloudDataStore (157-163)
eloq_data_store_service/rocksdb_cloud_data_store.h (1)
  • RocksDBCloudDataStore (38-161)
data_store_service_client.cpp (2)
data_store_service_client.h (5)
  • TxPort2DssPort (134-137)
  • need_bootstrap_ (688-859)
  • KvPartitionIdOf (629-634)
  • bind_data_shard_with_ng_ (689-813)
  • Connect (154-159)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
🔇 Additional comments (15)
eloq_data_store_service/rocksdb_cloud_data_store_factory.h (1)

54-68: LGTM! Per-shard object path construction is correct.

The per-shard cloud config setup properly handles empty paths, trailing slashes, and constructs unique shard paths. The constructor call matches the updated signature with shard_id and data_store_service parameters.

data_store_service_client.cpp (14)

32-32: LGTM! Required header for thread-safe RNG.

The <random> header is necessary for std::mt19937 and std::uniform_int_distribution used in InitBucketsInfo for thread-safe random number generation.


151-211: LGTM! TxConfigsToDssClusterConfig correctly refactored for shard topology.

The method now properly:

  • Guards nodes_map.at(leader_id) access with existence check (lines 189-190)
  • Consolidates the duplicate leader_id == node_id checks into a single block (lines 195-200)
  • Sets up cluster manager with shard membership, primary nodes, and this node configuration

224-227: LGTM! Bootstrap guard prevents unnecessary initialization.

The need_bootstrap_ check allows Connect to skip initialization when bootstrap is not required, which is appropriate for shard-aware configurations.


251-254: LGTM! Replaced fatal assert with warning.

The method now logs a warning and continues instead of crashing, consistent with the PR's pattern of converting fatal asserts to graceful no-ops.


404-429: LGTM! PutAll correctly routes batches to shards.

The partition-level batching properly computes shard_id using GetShardIdByPartitionId with the appropriate is_range_partitioned flag, ensuring correct shard routing for each partition's writes.


596-604: LGTM! Consistent shard routing across all metadata and data operations.

All methods now correctly:

  • Compute shard_id via GetShardIdByPartitionId(partition_id, is_range_partition)
  • Pass shard_id to underlying Read/Scan/BatchWrite/DeleteRange operations
  • Distinguish hash vs. range partitioning appropriately

The shard routing pattern is consistent and correctly propagated throughout the codebase.

Also applies to: 625-634, 667-675, 938-960, 986-1005, 1050-1061, 1146-1163, 818-823, 842-843, 872-873, 906-907, 1384-1385, 1407-1408, 1447-1448, 1542-1549, 1588-1589, 1642-1643, 1699-1700, 1747-1754, 1772-1773, 2367-2413, 2142-2145, 2192-2193, 2211-2211, 2288-2289


2973-3063: LGTM! Leadership event handlers now support per-shard operations.

The methods correctly:

  • Accept ng_id parameter for shard identification
  • Guard shard operations with bind_data_shard_with_ng_ flag
  • Open/close data stores per shard during leadership transitions
  • Update cluster topology and refresh config in OnStartFollowing

3090-3115: LGTM! GetShardIdByPartitionId correctly maps partitions to shards.

The method:

  • Correctly computes bucket_id for hash vs. range partitions using appropriate mapper functions
  • Looks up the shard owner via bucket_infos_ (immutable after init, no lock needed per developer clarification)
  • Returns UINT32_MAX with error logging on missing bucket instead of asserting

Based on learnings: bucket_infos_ is immutable after initialization.


3117-3130: LGTM! GetAllDataShards properly synchronized.

The method correctly acquires a shared lock on dss_shard_ids_mutex_ before reading and copying dss_shard_ids_, ensuring thread-safe access. The returned vector preserves the set's sorted order.


3132-3197: LGTM! InitBucketsInfo uses thread-safe RNG with proper error handling.

The method correctly:

  • Guards against empty node_groups with early return and error log
  • Uses thread-safe std::mt19937 seeded per node group for deterministic virtual node placement
  • Implements consistent hashing with 64 virtual nodes per node group
  • Assigns buckets to the first virtual node position on the ring

The RNG is now thread-safe and deterministic, fixing the previous rand()/srand() issue.


3199-3209: LGTM! GetOwnerNodeIndexOfShard handles bounds errors gracefully.

The method now uses LOG(ERROR) and returns UINT32_MAX instead of LOG(FATAL) when shard_id exceeds bounds, allowing callers to handle the error gracefully.


3998-4011: LGTM! InitTableLastRangePartitionId handles non-hash-partitioned Sequences gracefully.

The method now returns false with an error log instead of asserting when the Sequences table is not hash-partitioned, allowing callers to handle the error appropriately.


3541-3570: LGTM! Internal routing methods correctly dispatch to local or remote shards.

All internal routing methods (ReadInternal, DeleteRangeInternal, FlushDataInternal, DropTableInternal, ScanNextInternal, ScanCloseInternal, BatchWriteRecordsInternal) consistently:

  • Check IsLocalShard(shard_id) to determine routing
  • Pass shard_id to local data_store_service_ calls
  • Set up RPC with correct node index for remote calls
  • Propagate shard_id through all code paths

Also applies to: 3596-3626, 3642-3669, 3686-3713, 3750-3788, 3816-3844, 4150-4189


3395-3407: LGTM! All public APIs and utility methods correctly implement shard routing.

The remaining methods properly:

  • Compute shard_id for their operations using GetShardIdByPartitionId
  • Pass shard_id through to internal routing functions
  • Handle both local and remote shard access
  • Apply shard routing to catalog, range, statistics, and table operations

The shard-aware refactoring is complete and consistent across the entire client.

Also applies to: 3440-3488, 3490-3521, 3523-3539, 3572-3594, 3628-3640, 3715-3748, 3790-3814, 3846-3898, 3900-3951, 4057-4111, 4113-4148, 4270-4356, 4539-4587, 4589-4632

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
eloq_data_store_service/rocksdb_data_store_common.cpp (1)

274-282: Clarify query_worker_pool_ lifecycle and align the Shutdown comment with actual behavior

Initialize() now either creates the pool or calls query_worker_pool_->Initialize() when it already exists, while Shutdown() only stops the pool via query_worker_pool_->Shutdown() and leaves the unique_ptr non-null. The in-code comment in Shutdown() says the pool “will be re-created in Initialize()”, but in practice it’s being reinitialized, not re-created, which can be confusing when reasoning about reuse and SubmitWork after Shutdown().

Since ThreadWorkerPool::Shutdown() (see thread_worker_pool.cpp, lines 103-121) joins threads and clears workers_ but keeps the object alive, the intended contract seems to be:

  • No calls that enqueue work between Shutdown() and the next Initialize().
  • On reuse, Initialize() must be called to recreate worker threads.

I’d suggest at least aligning the comment with the actual behavior, and optionally documenting this contract in the header so misuse is harder:

-        query_worker_pool_->Shutdown();
-        // If the data store be reused, query_worker_pool_ will be re-created in
-        // Initialize().
+        query_worker_pool_->Shutdown();
+        // If the data store is reused, query_worker_pool_ will be reinitialized
+        // in Initialize().

Optionally, you could also add an assertion or debug check inside ThreadWorkerPool::SubmitWork or at the start of operations here to fail fast if someone tries to submit work after Shutdown() but before Initialize().

Also applies to: 287-301

eloq_data_store_service/data_store_service.h (1)

631-641: Align GetDataStore with shard_id bounds and ownership check

GetDataStore should ideally mirror the safety checks in IsOwnerOfShard / FetchDSShardStatus:

  • Validate shard_id < data_shards_.size()
  • Optionally verify shard_id_ == shard_id before returning the store

For example:

DataStore *GetDataStore(uint32_t shard_id)
{
-    if (data_shards_.at(shard_id).shard_id_ == shard_id)
-    {
-        return data_shards_.at(shard_id).data_store_.get();
-    }
-    else
-    {
-        return nullptr;
-    }
+    if (shard_id >= data_shards_.size())
+    {
+        return nullptr;
+    }
+    auto &ds_ref = data_shards_[shard_id];
+    if (ds_ref.shard_id_ == shard_id)
+    {
+        return ds_ref.data_store_.get();
+    }
+    return nullptr;
}

This keeps ownership and bounds handling consistent and avoids accidental out-of-range access.

data_store_service_client.cpp (1)

2459-2487: Clarify EloqKv handling in MVCC archive code paths.

Multiple LOG(WARNING) statements (lines 2462, 2486-2487, 2631, 2742) indicate that EloqKv engine does not support MVCC features, yet the code contains logic to handle EloqKv tables in archive operations. If EloqKv truly doesn't support MVCC and these paths should never be reached, consider either:

  1. Adding assertions or returning errors early when EloqKv tables attempt MVCC operations
  2. Removing the EloqKv-specific handling if it's truly dead code
  3. Adding comments explaining why this fallback logic exists

This will make the intended behavior clearer and prevent confusion during future maintenance.

♻️ Duplicate comments (1)
data_store_service_client.cpp (1)

3104-3110: Replace debug-only asserts with runtime error checks.

The assert statements at lines 3105 and 3109 will crash the process in debug builds if a bucket is not found or shard_id is missing from dss_shard_ids_. This is inconsistent with the error handling at lines 3112-3114 that logs and returns UINT32_MAX. Replace the asserts with runtime checks that log errors and fall through to the existing error path, ensuring consistent behavior across debug and release builds.

-    auto it = bucket_infos_.find(bucket_id);
-    assert(it != bucket_infos_.end());
-    if (it != bucket_infos_.end())
+    auto it = bucket_infos_.find(bucket_id);
+    if (it != bucket_infos_.end())
     {
         uint32_t shard_id = it->second->BucketOwner();
-        assert(dss_shard_ids_.find(shard_id) != dss_shard_ids_.end());
+        if (dss_shard_ids_.find(shard_id) == dss_shard_ids_.end())
+        {
+            LOG(ERROR) << "Shard not found for shard_id=" << shard_id;
+            return UINT32_MAX;
+        }
         return shard_id;
     }
🧹 Nitpick comments (12)
eloq_data_store_service/rocksdb_data_store_common.cpp (1)

1150-1161: Optional: reserve enough space in BuildKey to avoid reallocations

BuildKey currently reserves table_name.size() + 2 + key.size(), which underestimates capacity by not accounting for the decimal representation of partition_id. This is harmless but can force an extra reallocation in hot paths.

You could cheaply tighten this up by materializing the partition-id string once and using its size in the reservation:

 std::string RocksDBDataStoreCommon::BuildKey(const std::string_view table_name,
                                              uint32_t partition_id,
                                              const std::string_view key)
 {
     std::string tmp_key;
-    tmp_key.reserve(table_name.size() + 2 + key.size());
+    std::string partition_str = std::to_string(partition_id);
+    tmp_key.reserve(table_name.size() + 2 + partition_str.size() + key.size());
     tmp_key.append(table_name);
     tmp_key.append(KEY_SEPARATOR);
-    tmp_key.append(std::to_string(partition_id));
+    tmp_key.append(partition_str);
     tmp_key.append(KEY_SEPARATOR);
     tmp_key.append(key);
     return tmp_key;
 }

Pure micro-optimization, feel free to defer if not latency-sensitive.

eloq_data_store_service/thread_worker_pool.cpp (3)

28-44: Ctor calling Initialize() constrains lifecycle; clarify or decouple

The ctor unconditionally calls Initialize(), so threads are always started on construction, even though Initialize() is also exposed publicly for reuse. That’s workable, but it means:

  • ThreadWorkerPool cannot be constructed in a “cold” state and then explicitly Initialize()d; it’s always hot.
  • The public Initialize() is effectively only for post‑Shutdown() reuse.

If the intent is to allow lazy startup, consider either:

  • Removing the ctor call and letting owners call Initialize() explicitly, or
  • Documenting that the public Initialize() is only for reinitialization after Shutdown().

34-84: Initialize() reuse semantics are okay but busy-wait loop is wasteful

The guarded init logic (while (shutdown_indicator_) sleep; if (workers_.empty()) spawn) correctly avoids racing with Shutdown() and prevents double-spawn. However, the spin-sleep loop is a minor inefficiency on contended shutdown/start paths.

If you want to avoid even brief busy-waiting, consider a small helper cv or reusing work_queue_cv_ to signal completion of shutdown instead of polling.


104-122: Callers must ensure Shutdown() is invoked before destruction

Shutdown() now cleanly joins workers, clears workers_, and resets shutdown_indicator_ so the pool can be reused. But the destructor is = default and does not call Shutdown(), so destroying a pool with live threads remains undefined.

Given this, either:

  • Make it explicit in comments that owners are responsible for calling Shutdown() before destruction, or
  • Consider invoking Shutdown() in the destructor (defensively guarded) to enforce RAII semantics.
eloq_data_store_service/data_store_service.h (3)

603-613: Guard write request counters against invalid shard_id

IncreaseWriteReqCount / DecreaseWriteReqCount also use data_shards_.at(shard_id) with no bounds check. Once you add a central shard-id guard (see previous comment), these should reuse it or at least early-return on invalid shard_id to avoid undefined behavior.

You might also want to use memory_order_relaxed here unless some other code relies on the release semantics; they are just counters.


615-620: Verify IsOwnerOfShard semantics across all DSShardStatus values

IsOwnerOfShard now returns true for any status that is not Closed:

const auto &ds_ref = data_shards_.at(shard_id);
return ds_ref.shard_status_.load(std::memory_order_acquire) !=
       DSShardStatus::Closed;

This means transient states such as Starting (and any future non-Closed states) are treated as “owned”. Call sites like Read/Scan* then separately gate on ReadOnly/ReadWrite, so behavior becomes:

  • non-Closed but not open → “owned but DB_NOT_OPEN”
  • Closed → “not owner → REQUESTED_NODE_NOT_OWNER”

If that’s intentional, it’s fine; if not, consider tightening to explicit allowed states (e.g., ReadWrite and ReadOnly) and/or also checking shard_id_ for extra safety, similar to GetDataStore.


671-704: DataShard::ShutDown lifecycle looks reasonable but document expectations

The new DataShard struct cleanly encapsulates per-shard state and provides a ShutDown() that:

  • Calls data_store_->Shutdown() without nulling data_store_
  • Clears and destroys scan_iter_cache_

This is fine, but the comment about not nulling data_store_ hints at concurrency assumptions (reads may still use the pointer). It would help future maintainers if you:

  • Document that callers must ensure no new reads are admitted once ShutDown() is entered, and
  • Clarify whether shard_status_ is updated to Closed elsewhere as part of the shutdown protocol.

No functional issues as-is, but the invariants are subtle.

data_store_service_client.h (4)

73-124: Ctor wiring for shard owners and bucket infos looks consistent

The updated ctor:

  • Accepts is_bootstrap and bind_data_shard_with_ng flags and stores them in need_bootstrap_ / bind_data_shard_with_ng_.
  • Builds dss_shards_ from cluster_manager.GetAllShards() with a hard bound check (LOG(FATAL) if shard_id >= dss_shards_.size()).
  • Populates dss_shard_ids_ and initializes bucket_infos_ via InitBucketsInfo.
  • Registers a config-update listener if data_store_service_ is present.

This sets up the shard→node index mapping and bucket mapping coherently. Just ensure callers only pass a DataStoreService* whose lifetime strictly outlives the client.


487-489: GetShardIdByPartitionId helper is well-aligned with bucket_infos_

The new public helper:

uint32_t GetShardIdByPartitionId(int32_t partition_id,
                                 bool is_range_partition) const;

matches the cpp implementation that:

  • Maps partition_id to a bucket id via Sharder,
  • Looks up bucket_infos_,
  • Returns the owning shard id.

It’s a clean abstraction for callers and pairs well with GetAllDataShards(). Just keep in mind that the implementation currently uses assert for bucket_infos_ and dss_shard_ids_ sanity; if you ever expect invalid configs in production, you may want non-assert error handling too.


670-676: UpdateShardOwner declaration is clear but document locking expectations

void UpdateShardOwner(uint32_t shard_id, uint32_t node_id); is a useful helper for adjusting ownership without exposing the underlying arrays.

Given that dss_shard_ids_ is now guarded by dss_shard_ids_mutex_ and dss_shards_ is shared across threads, it would be good to document in the header that UpdateShardOwner is responsible for doing any necessary locking internally (as is already done for other helpers in the cpp).


747-756: dss_shards_, dss_shard_ids_, and bucket_infos_ layout is sensible

The shard/bucket metadata fields:

std::array<std::atomic<uint32_t>, 1000> dss_shards_;
std::shared_mutex dss_shard_ids_mutex_;
std::set<uint32_t> dss_shard_ids_;
std::unordered_map<uint16_t, std::unique_ptr<txservice::BucketInfo>> bucket_infos_;

mirror the per-shard layout on the service side and give you:

  • Fast shard→node index lookup (dss_shards_)
  • A thread-safe set of known local shard ids (dss_shard_ids_)
  • A stable mapping from bucket id to bucket info (bucket_infos_)

Note: the comment above dss_shards_ still says “Now only support one shard”; consider updating it to describe the 0–999 shard range and relation to shard_count/topology config.

data_store_service_client.cpp (1)

2554-2643: Consider removing or documenting deprecated archive fetch methods.

Both FetchArchives (lines 2554-2643) and FetchVisibleArchive (lines 2664-2751) log warnings stating they "should not be used because all archive versions are fetched from ccmap" and contain TODO comments about potential removal. If these methods are truly obsolete, consider removing them to reduce maintenance burden and code complexity. If they might be needed for backward compatibility or specific edge cases, document when and why they should be used instead of the preferred approach.

Also applies to: 2664-2751

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dea26cb and 6c34bce.

📒 Files selected for processing (6)
  • data_store_service_client.cpp (102 hunks)
  • data_store_service_client.h (12 hunks)
  • eloq_data_store_service/data_store_service.h (14 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.cpp (3 hunks)
  • eloq_data_store_service/thread_worker_pool.cpp (2 hunks)
  • eloq_data_store_service/thread_worker_pool.h (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client.h
  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client.cpp
🧬 Code graph analysis (6)
eloq_data_store_service/thread_worker_pool.h (2)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Initialize (163-285)
  • Initialize (163-163)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Initialize (34-84)
  • Initialize (34-34)
eloq_data_store_service/rocksdb_data_store_common.cpp (3)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Shutdown (104-122)
  • Shutdown (104-104)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (165-186)
  • Shutdown (165-165)
eloq_data_store_service/thread_worker_pool.cpp (1)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Initialize (163-285)
  • Initialize (163-163)
data_store_service_client.h (4)
eloq_data_store_service/data_store_service_config.h (2)
  • DataStoreServiceClusterManager (317-513)
  • uint32_t (107-110)
eloq_data_store_service/data_store_service.h (1)
  • DataStoreService (180-694)
data_store_service_client.cpp (24)
  • FindFreeNodeIndex (3269-3288)
  • FindFreeNodeIndex (3269-3269)
  • InitBucketsInfo (3132-3197)
  • InitBucketsInfo (3132-3136)
  • OnLeaderStart (2973-2990)
  • OnLeaderStart (2973-2974)
  • OnLeaderStop (2992-3005)
  • OnLeaderStop (2992-2992)
  • OnStartFollowing (3014-3063)
  • OnStartFollowing (3014-3018)
  • GetShardIdByPartitionId (3090-3115)
  • GetShardIdByPartitionId (3090-3091)
  • IsLocalShard (3080-3088)
  • IsLocalShard (3080-3080)
  • GetOwnerNodeIndexOfShard (3199-3209)
  • GetOwnerNodeIndexOfShard (3199-3200)
  • GetAllDataShards (3117-3130)
  • GetAllDataShards (3117-3117)
  • UpdateOwnerNodeIndexOfShard (3211-3267)
  • UpdateOwnerNodeIndexOfShard (3211-3212)
  • HandleShardingError (3290-3323)
  • HandleShardingError (3290-3291)
  • UpgradeShardVersion (3325-3372)
  • UpgradeShardVersion (3325-3328)
eloq_data_store_service/data_store_service_config.cpp (4)
  • GetShardIdByPartitionId (802-807)
  • GetShardIdByPartitionId (802-803)
  • HandleShardingError (950-992)
  • HandleShardingError (950-951)
eloq_data_store_service/data_store_service.h (4)
eloq_data_store_service/data_store_service.cpp (12)
  • StartService (242-312)
  • StartService (242-242)
  • Read (399-432)
  • Read (399-402)
  • Read (434-477)
  • Read (434-442)
  • ScanClose (995-1028)
  • ScanClose (995-998)
  • ScanClose (1030-1062)
  • ScanClose (1030-1035)
  • TTLWrapperCache (72-77)
  • TTLWrapperCache (79-88)
eloq_data_store_service/rocksdb_data_store_common.cpp (5)
  • Read (436-497)
  • Read (436-436)
  • key (852-852)
  • ScanClose (1067-1091)
  • ScanClose (1067-1067)
eloq_data_store_service/data_store_service_config.cpp (6)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
eloq_data_store_service/data_store.h (1)
  • DataStore (42-143)
data_store_service_client.cpp (2)
data_store_service_client.h (4)
  • TxPort2DssPort (134-137)
  • need_bootstrap_ (688-859)
  • KvPartitionIdOf (629-634)
  • bind_data_shard_with_ng_ (689-813)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
🔇 Additional comments (15)
eloq_data_store_service/rocksdb_data_store_common.cpp (1)

445-462: Read path change to use req->GetKey() + BuildKey(table_name, partition_id, key) looks sound

The new read path:

  • Extracts table_name, partition_id, and key from ReadRequest.
  • Builds the RocksDB key via BuildKey(table_name, partition_id, key).
  • Uses the resulting key_str for the db->Get call and error logging.

This is consistent with how DeleteRange and ScanNext construct keys using the same BuildKey/BuildKeyPrefix helpers, and the lifetime of table_name/key is safe because PoolableGuard req keeps the request alive for the duration of the lambda.

No issues from a correctness or lifetime standpoint.

eloq_data_store_service/thread_worker_pool.h (1)

36-58: Public Initialize() API looks consistent with implementation

Exposing Initialize() publicly matches the new reuse pattern in RocksDBDataStoreCommon::Initialize() and the cpp implementation, which is idempotent (checks workers_.empty()). No issues at the declaration level.

eloq_data_store_service/data_store_service.h (1)

521-529: Delegating PrepareShardingError to cluster_manager_ is the right fix

Forwarding PrepareShardingError to cluster_manager_.PrepareShardingError restores the rich topology information (new primary, cluster config, shard version) needed for clients to reroute correctly. This addresses the earlier simplified implementation that broke client-side routing.

data_store_service_client.h (11)

139-144: TxConfigsToDssClusterConfig parameterization is clear

The updated signature for TxConfigsToDssClusterConfig (explicit uint32_t node_id and NG leader map) makes the conversion from tx-service configs to DSS cluster config more explicit. No issues at the declaration level.


395-403: Leader/follower hooks correctly coordinate with DataStoreService

The new overrides:

  • OnLeaderStart / OnLeaderStop call OpenDataStore(ng_id) / CloseDataStore(ng_id) when bind_data_shard_with_ng_ is true and data_store_service_ is non-null, then (for leader start) Connect().
  • OnStartFollowing closes the shard locally, updates cluster-manager primary and shard version, then calls SetupConfig and Connect().

This wiring cleanly ties tx-service leadership changes to per-shard data store lifecycle when co-located. The control flow and conditions look sound.


501-507: Shard-aware Read() signature is consistent with DataStoreService

The internal Read now takes both partition_id and shard_id:

void Read(const std::string_view kv_table_name,
          const int32_t partition_id,
          const uint32_t shard_id,
          const std::string_view key,
          void *callback_data,
          DataStoreCallback callback);

This aligns with the service side (DataStoreService::Read overload with partition_id + shard_id) and with the sharding config helpers in DataStoreServiceClusterManager. No issues with the interface.


510-525: BatchWriteRecords gains shard_id for proper routing

Similarly, adding uint32_t shard_id to the internal BatchWriteRecords signature matches the public service API and allows correct local-vs-remote routing per shard. The rest of the parameter list is unchanged, so call sites can be updated mechanically.


554-561: DeleteRange internal API now shard-aware and matches service

The extra uint32_t shard_id parameter on the internal DeleteRange aligns with the shard-aware service declaration and makes routing consistent with reads and writes. No functional concerns here.


574-588: ScanNext internal path correctly includes shard_id

Adding shard_id to the internal ScanNext helper mirrors the public RPC and service overloads and is necessary for per-shard scan routing. The rest of the parameters (session, search conditions, batch size) are unchanged.


592-599: ScanClose internal path also properly carries shard_id

ScanClose now includes int32_t partition_id and uint32_t shard_id, matching DataStoreService::ScanClose and ensuring scans are closed against the correct shard’s iterator cache.


627-634: KvPartitionIdOf hashing change is reasonable

Using std::hash<std::string_view>() on the table name’s string_view and feeding that into MapKeyHashToHashPartitionId is a straightforward way to derive a stable hash partition id for system tables:

auto hash_code = std::hash<std::string_view>()(sv);
return txservice::Sharder::MapKeyHashToHashPartitionId(hash_code);

Given that range/hash partitioning elsewhere already uses the sharder, this looks consistent.


648-653: IsLocalShard wrapper is thin and consistent

IsLocalShard(uint32_t shard_id) simply delegates to data_store_service_->IsOwnerOfShard(shard_id) when co-located, returning false otherwise. This is a minimal, clear abstraction over the service-side ownership check.


659-668: Shard metadata helpers and InitBucketsInfo declaration match cpp usage

The trio:

  • uint32_t GetOwnerNodeIndexOfShard(uint32_t shard_id) const;
  • std::vector<uint32_t> GetAllDataShards();
  • void InitBucketsInfo(const std::set<uint32_t> &node_groups, uint64_t version, std::unordered_map<uint16_t, std::unique_ptr<txservice::BucketInfo>> &ng_bucket_infos);

matches the cpp implementations where:

  • GetOwnerNodeIndexOfShard checks shard_id against dss_shards_.size() and returns UINT32_MAX on overflow.
  • GetAllDataShards uses dss_shard_ids_mutex_ for thread-safe iteration.
  • InitBucketsInfo populates bucket_infos_ deterministically from node groups.

The declaration set looks coherent and thread-safety concerns around dss_shard_ids_ are addressed via the shared mutex.


685-690: New flags need_bootstrap_ and bind_data_shard_with_ng_ are well-scoped

Storing need_bootstrap_ and bind_data_shard_with_ng_ as simple booleans tied to ctor parameters is straightforward and keeps bootstrap/binding policy local to the client. Combined with the leadership callbacks, they cleanly gate shard–NG coupling behavior.

data_store_service_client.cpp (1)

3132-3197: LGTM: Thread-safe and deterministic bucket assignment.

The InitBucketsInfo implementation correctly uses std::mt19937 for thread-safe, deterministic random number generation, handles the empty node_groups case gracefully, and implements consistent hashing with virtual nodes. The deterministic behavior (same node group ID produces same bucket assignments) is crucial for multi-node consistency.

Comment on lines 582 to 590
DSShardStatus FetchDSShardStatus(uint32_t shard_id)
{
if (shard_id_ == shard_id)
if (data_shards_.at(shard_id).shard_id_ == shard_id)
{
return shard_status_;
return data_shards_.at(shard_id).shard_status_.load(
std::memory_order_acquire);
}
return DSShardStatus::Closed;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add shard_id bounds check in FetchDSShardStatus to avoid out-of-range access

FetchDSShardStatus uses data_shards_.at(shard_id) without validating shard_id against data_shards_.size(). A malformed or misrouted request with shard_id >= 1000 will hit std::array::at out-of-range, which is at best a process abort.

Consider guarding all inline helpers that index data_shards_:

DSShardStatus FetchDSShardStatus(uint32_t shard_id)
{
-    if (data_shards_.at(shard_id).shard_id_ == shard_id)
+    if (shard_id >= data_shards_.size())
+    {
+        return DSShardStatus::Closed;
+    }
+
+    if (data_shards_[shard_id].shard_id_ == shard_id)
     {
-        return data_shards_.at(shard_id).shard_status_.load(
-            std::memory_order_acquire);
+        return data_shards_[shard_id].shard_status_.load(
+            std::memory_order_acquire);
     }
     return DSShardStatus::Closed;
}

Apply the same pattern to IncreaseWriteReqCount, DecreaseWriteReqCount, IsOwnerOfShard, and GetDataStore to harden all shard-indexed accesses.

🤖 Prompt for AI Agents
In eloq_data_store_service/data_store_service.h around lines 582 to 590,
FetchDSShardStatus indexes data_shards_ with at() without bounds checking; add a
guard that verifies shard_id < data_shards_.size() before accessing
data_shards_.at(shard_id) and return DSShardStatus::Closed if out of range, then
use the same defensive pattern in IncreaseWriteReqCount, DecreaseWriteReqCount,
IsOwnerOfShard, and GetDataStore to validate shard_id and avoid out-of-range
accesses (perform size check first, only then call at() and proceed with
existing logic).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
eloq_data_store_service/data_store_service.cpp (3)

404-432: Consider adding bounds validation for shard_id.

Throughout the RPC handlers, shard_id is extracted from requests and used directly with data_shards_.at(shard_id) (lines 414, 451, 495, 586, etc.). The at() method throws std::out_of_range if the index is invalid. While IsOwnerOfShard(shard_id) provides some validation, consider adding explicit bounds checking before dereferencing to prevent potential exceptions from malformed requests or configuration mismatches.

if (shard_id >= data_shards_.size()) {
    brpc::ClosureGuard done_guard(done);
    auto *result = response->mutable_result();
    result->set_error_code(::EloqDS::remote::DataStoreError::INVALID_SHARD_ID);
    result->set_error_msg("Invalid shard ID");
    return;
}

1010-1020: Remove assert in production error path.

Line 1014 contains assert(false) in an error path that can be reached if the shard status is neither ReadWrite nor ReadOnly. This will crash the service in production builds where asserts are enabled. The error condition is already being handled correctly by returning an error response.

     if (shard_status != DSShardStatus::ReadWrite &&
         shard_status != DSShardStatus::ReadOnly)
     {
-        assert(false);
         brpc::ClosureGuard done_guard(done);
         auto *result = response->mutable_result();

1044-1054: Remove assert in production error path.

Line 1049 contains assert(false) in an error path similar to line 1014. This will crash the service if the shard status is neither ReadWrite nor ReadOnly during a scan close operation.

     if (shard_status != DSShardStatus::ReadWrite &&
         shard_status != DSShardStatus::ReadOnly)
     {
-        assert(false);
         brpc::ClosureGuard done_guard(done);
         result->set_error_code(::EloqDS::remote::DataStoreError::DB_NOT_OPEN);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c34bce and 429175b.

📒 Files selected for processing (1)
  • eloq_data_store_service/data_store_service.cpp (45 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • eloq_data_store_service/data_store_service.cpp
🧬 Code graph analysis (1)
eloq_data_store_service/data_store_service.cpp (4)
eloq_data_store_service/data_store_service.h (4)
  • IsOwnerOfShard (615-620)
  • PrepareShardingError (525-529)
  • IncreaseWriteReqCount (603-607)
  • DecreaseWriteReqCount (609-613)
eloq_data_store_service/data_store_service_config.cpp (6)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
data_store_service_client.cpp (12)
  • Read (1087-1096)
  • Read (1087-1092)
  • Read (3528-3544)
  • Read (3528-3533)
  • DeleteRange (3577-3599)
  • DeleteRange (3577-3584)
  • ScanNext (3720-3753)
  • ScanNext (3720-3734)
  • ScanClose (3795-3819)
  • ScanClose (3795-3800)
  • BatchWriteRecords (4118-4153)
  • BatchWriteRecords (4118-4131)
eloq_data_store_service/rocksdb_data_store_common.cpp (15)
  • Read (436-497)
  • Read (436-436)
  • key (852-852)
  • DeleteRange (359-434)
  • DeleteRange (359-359)
  • start_key (387-387)
  • start_key (699-699)
  • end_key (402-402)
  • end_key (704-704)
  • ScanNext (755-1065)
  • ScanNext (755-755)
  • ScanClose (1067-1091)
  • ScanClose (1067-1067)
  • BatchWriteRecords (499-661)
  • BatchWriteRecords (499-500)
🔇 Additional comments (6)
eloq_data_store_service/data_store_service.cpp (6)

216-221: LGTM! Per-shard initialization looks correct.

The constructor properly initializes each shard in the data_shards_ array with its ID and Closed status, preparing for multi-shard operation.


254-289: LGTM! Per-shard startup logic is sound.

The startup flow correctly iterates through owned shards, creates per-shard data stores with appropriate error handling, and initializes per-shard scan iterator caches. The atomic status transitions using compare-exchange ensure thread safety.


375-378: Good defensive check for scan iterator cache.

The null check before creating scan_iter_cache_ handles the case where a shard may be reconnected after partial initialization, preventing resource leaks.


1083-1132: LGTM! Scan iterator cache methods properly handle per-shard caches.

All scan iterator cache methods consistently check for null cache before dereferencing, providing appropriate defensive behavior for partially initialized shards.


1893-1927: Excellent observability for shard opening operations.

The timing measurements and comprehensive logging with shard IDs, status, and duration provide valuable debugging information for troubleshooting multi-shard startup issues.


2311-2330: Proper synchronization for read-write to read-only transition.

The implementation correctly uses compare-exchange for atomic status transitions and waits for all in-flight write requests to complete before finalizing the switch to ReadOnly mode, ensuring data consistency.

@lzxddz lzxddz force-pushed the dss_multi_shard-rebase branch from 429175b to 85d911c Compare November 18, 2025 07:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
eloq_data_store_service/data_store_service.cpp (1)

314-397: Avoid leaving shard_status_ stuck in Starting on data store open failure.

In ConnectAndStartDataStore, all early failure returns (CreateDataStore/Initialize/StartDB failures) are taken after the shard status has been CAS'ed from Closed to Starting, but before it is updated to the target open_mode. On those paths:

  • shard_ref.shard_status_ remains Starting.
  • Subsequent calls will hit the while (expect_status == DSShardStatus::Starting) loop (Lines 333–338) and spin forever, because nothing ever moves the shard out of Starting.
  • IsOwnerOfShard will also treat this shard as owned (non‑Closed), but data_store_ may be null, so the only guard against dereference is the shard_status check in each RPC handler.

Reset the status back to Closed (or some explicit terminal error state) before returning false so the shard can be retried or treated cleanly as closed. For example:

    if (shard_ref.data_store_ == nullptr)
    {
        shard_ref.data_store_ = data_store_factory_->CreateDataStore(
            create_db_if_missing, data_shard_id, this, true);
        if (shard_ref.data_store_ == nullptr)
        {
            LOG(ERROR) << "Failed to create data store";
+           DSShardStatus expected = DSShardStatus::Starting;
+           shard_ref.shard_status_.compare_exchange_strong(
+               expected, DSShardStatus::Closed, std::memory_order_release);
            return false;
        }
    }
    else
    {
        bool res = shard_ref.data_store_->Initialize();
        if (!res)
        {
            LOG(ERROR) << "Failed to initialize data store";
+           DSShardStatus expected = DSShardStatus::Starting;
+           shard_ref.shard_status_.compare_exchange_strong(
+               expected, DSShardStatus::Closed, std::memory_order_release);
            return false;
        }

        res = shard_ref.data_store_->StartDB();
        if (!res)
        {
            LOG(ERROR) << "Failed to start db instance in data store service";
+           DSShardStatus expected = DSShardStatus::Starting;
+           shard_ref.shard_status_.compare_exchange_strong(
+               expected, DSShardStatus::Closed, std::memory_order_release);
            return false;
        }
    }

You can also factor this into a small helper to keep the pattern consistent across all failure exits.

♻️ Duplicate comments (4)
data_store_service_scanner.h (1)

207-221: Fix partition_id type mismatch in SinglePartitionScanner; new data_shard_id wiring looks good.

  • Reset now takes int32_t partition_id and uint32_t data_shard_id, assigns both, and Clear() resets data_shard_id_ along with the rest of the state. That part is consistent with the per-shard scan design.

  • However, partition_id_ is now declared as int32_t, while GetPartitionId() still returns uint32_t:

    uint32_t GetPartitionId()
    {
        return partition_id_;
    }
    ...
    int32_t partition_id_;

    This is inconsistent with the proto (int32 partition_id) and internal_request.h getters, and it can silently reinterpret negative/sentinel values if they ever arise.

Update the getter to return int32_t to match the member and the rest of the stack, and adjust any callers that expect an unsigned type:

-    uint32_t GetPartitionId()
+    int32_t GetPartitionId()
     {
         return partition_id_;
     }

The data_shard_id_ addition and its reset in Clear() look correct.

Also applies to: 235-238, 248-259, 264-267

eloq_data_store_service/ds_request.proto (1)

92-97: Avoid renumbering existing proto fields; BatchWriteRecordsRequest likely still breaks wire compatibility.

  • ReadRequest, DeleteRangeRequest, and ScanRequest add shard_id at new field numbers and switch partition_id to int32 while keeping the original field numbers. Those changes are generally compatible as long as all code treats partition IDs as non-negative.
  • In BatchWriteRecordsRequest, you now have:
    string kv_table_name = 1;
    int32 partition_id   = 2;
    uint32 shard_id      = 3;
    bool skip_wal        = 4;
    repeated Item items  = 5;
    If earlier releases had skip_wal = 3 and items = 4 (which is very likely given the previous review), inserting shard_id at 3 and shifting the existing fields changes their wire numbers and will cause old and new clients/servers to misinterpret each other’s messages (e.g., old peers reading shard_id as skip_wal, or new peers reading old skip_wal as shard_id).

To stay wire-compatible, keep the original numbers for existing fields and assign shard_id an unused number instead, e.g.:

string kv_table_name = 1;
int32 partition_id   = 2;
bool skip_wal        = 3;  // keep original number
repeated Item items  = 4;  // keep original number
uint32 shard_id      = 5;  // or another unused number

Please confirm the historical numbering from your existing deployed version and adjust accordingly. If you truly intend a breaking change, it should be versioned and coordinated across all dependent services.

Also applies to: 132-146, 161-168, 302-320

eloq_data_store_service/data_store_service.h (1)

582-590: Add shard_id bounds checks before indexing data_shards_

All these helpers index data_shards_ with .at(shard_id) without checking shard_id < data_shards_.size(). A malformed or misrouted request (or a bug in the caller’s shard routing) would turn into a std::out_of_range and likely abort the process. This was previously flagged and is still present.

Recommend hardening them with a cheap bounds guard and switching to operator[] once bounds are validated, e.g.:

DSShardStatus FetchDSShardStatus(uint32_t shard_id)
{
-    if (data_shards_.at(shard_id).shard_id_ == shard_id)
+    if (shard_id >= data_shards_.size())
+    {
+        return DSShardStatus::Closed;
+    }
+
+    const auto &ds_ref = data_shards_[shard_id];
+    if (ds_ref.shard_id_ == shard_id)
     {
-        return data_shards_.at(shard_id).shard_status_.load(
-            std::memory_order_acquire);
+        return ds_ref.shard_status_.load(std::memory_order_acquire);
     }
     return DSShardStatus::Closed;
}

void IncreaseWriteReqCount(uint32_t shard_id)
{
-    data_shards_.at(shard_id).ongoing_write_requests_.fetch_add(
-        1, std::memory_order_release);
+    if (shard_id >= data_shards_.size())
+    {
+        return;
+    }
+    data_shards_[shard_id].ongoing_write_requests_.fetch_add(
+        1, std::memory_order_release);
}

void DecreaseWriteReqCount(uint32_t shard_id)
{
-    data_shards_.at(shard_id).ongoing_write_requests_.fetch_sub(
-        1, std::memory_order_release);
+    if (shard_id >= data_shards_.size())
+    {
+        return;
+    }
+    data_shards_[shard_id].ongoing_write_requests_.fetch_sub(
+        1, std::memory_order_release);
}

bool IsOwnerOfShard(uint32_t shard_id) const
{
-    const auto &ds_ref = data_shards_.at(shard_id);
+    if (shard_id >= data_shards_.size())
+    {
+        return false;
+    }
+    const auto &ds_ref = data_shards_[shard_id];
    return ds_ref.shard_status_.load(std::memory_order_acquire) !=
           DSShardStatus::Closed;
}

DataStore *GetDataStore(uint32_t shard_id)
{
-    if (data_shards_.at(shard_id).shard_id_ == shard_id)
+    if (shard_id >= data_shards_.size())
+    {
+        return nullptr;
+    }
+    if (data_shards_[shard_id].shard_id_ == shard_id)
     {
-        return data_shards_.at(shard_id).data_store_.get();
+        return data_shards_[shard_id].data_store_.get();
     }
     else
     {
         return nullptr;
     }
}

This keeps misrouted shard_ids from turning into crashes and is consistent with the bound checks you've added on the client side.

Also applies to: 603-607, 609-613, 615-620, 631-641

eloq_data_store_service/data_store_service.cpp (1)

1854-1891: Shard close still logs state‑transition failures without enforcing a closed terminal state.

CloseDataStore now logs when SwitchReadWriteToReadOnly or SwitchReadOnlyToClosed fail, but it still treats those errors as a no‑op and returns without verifying that ds_ref.shard_status_ actually reaches Closed. This means shards can remain in ReadWrite or ReadOnly if a transition fails, with no retry or explicit error surfaced to callers.

This behavior (log‑only, no fail‑fast) was already flagged in previous reviews; consider either:

  • enforcing that a caller can detect failure (e.g., return a status), and/or
  • adding bounded retries / a final state check to guarantee that CloseDataStore either closes the shard or reports failure explicitly.
🧹 Nitpick comments (11)
eloq_data_store_service/rocksdb_data_store_common.cpp (1)

287-301: Consider documenting Shutdown() reuse semantics.

The method shuts down the worker pool but intentionally does not null query_worker_pool_ or data_store_service_, allowing Initialize() to reuse the pool. While the inline comment at line 293 hints at this, consider adding a brief doc comment on the method declaration in the header explaining the reuse pattern.

In rocksdb_data_store_common.h, add a doc comment:

/**
 * @brief Shutdown the data store resources.
 * 
 * Shuts down the query worker pool and erases scan iterators.
 * Does not null internal pointers to allow reuse via Initialize().
 */
void Shutdown() override;
data_store_service_client.h (2)

632-634: Prefer direct hash computation over temporary.

The KvPartitionIdOf method creates a temporary hash and then calls MapKeyHashToHashPartitionId. Consider computing the hash inline to avoid the intermediate variable.

 int32_t KvPartitionIdOf(const txservice::TableName &table) const
 {
-    std::string_view sv = table.StringView();
-    auto hash_code = std::hash<std::string_view>()(sv);
-    return txservice::Sharder::MapKeyHashToHashPartitionId(hash_code);
+    return txservice::Sharder::MapKeyHashToHashPartitionId(
+        std::hash<std::string_view>()(table.StringView()));
 }

749-756: Document bucket_infos_ immutability and dss_shard_ids_ protection.

Per past review discussion, bucket_infos_ size never changes after initialization, so no mutex is needed. However, dss_shard_ids_ is mutable and protected by dss_shard_ids_mutex_. Add a brief comment clarifying the synchronization guarantees.

     std::array<std::atomic<uint32_t>, 1000> dss_shards_;
     std::atomic<uint64_t> dss_topology_version_{0};
 
+    // Protects dss_shard_ids_ for concurrent access
     std::shared_mutex dss_shard_ids_mutex_;
+    // Mutable set of active shard IDs; access guarded by dss_shard_ids_mutex_
     std::set<uint32_t> dss_shard_ids_;
-    // key is bucket id, value is bucket info.
+    // Immutable after construction; maps bucket_id to BucketInfo; no lock needed
     std::unordered_map<uint16_t, std::unique_ptr<txservice::BucketInfo>>
         bucket_infos_;
eloq_data_store_service/thread_worker_pool.cpp (1)

31-84: ThreadWorkerPool lifecycle and re-init look sound; spin-wait could be tightened.

Constructor + Initialize() + Shutdown() form a consistent lifecycle: workers are started once per generation, block on the queue with a cv, and exit cleanly when shutdown_indicator_ is set, after which Shutdown() joins, clears workers_, and resets the flag so the pool can be reused.

Two minor suggestions:

  • If Initialize() may be called from arbitrary threads, consider documenting that it’s single-threaded or guarding it with a mutex/std::call_once to avoid accidental concurrent init.
  • The while (shutdown_indicator_) loop is a busy wait; if you expect long shutdowns, consider replacing it with a condition variable or making Initialize() only callable after the previous Shutdown() returns.

Also applies to: 104-122

data_store_service_client_closure.cpp (1)

245-267: FetchBucketDataCallback sharding and key/value handling look correct; just ensure AddDataItem matches new key semantics.

  • Moving to std::string key_str/value_str from ScanNextClosure::GetItem and pushing them via AddDataItem(std::move(key_str), std::move(value_str), ts, false) is consistent; key_str/value_str are re-assigned each iteration, so moving them is safe.
  • TTL filtering (ttl > 0 && ttl < now) skips expired entries, matching the previous behavior.
  • The partition/shard routing:
    • MapBucketIdToHashPartitionId(fetch_bucket_data_cc->bucket_id_)
    • KvPartitionIdOf(partition_id, false)
    • GetShardIdByPartitionId(kv_partition_id, false)
      then feeding kv_partition_id and data_shard_id into ScanNext is aligned with the new hash-partition + shard mapping. Given the earlier guarantee that bucket_id stays within a small range, the int32_t partition_id is safe. Based on learnings.

Just double-check that FetchBucketDataCc::AddDataItem is now defined to accept the raw key string (not the previous TxKey/hash-part form) so downstream consumers interpret keys correctly.

Also applies to: 278-285, 285-299

data_store_service_client.cpp (3)

224-227: Align ScheduleTimerTasks documentation with noop implementation

Connect()’s need_bootstrap_ guard and the new ScheduleTimerTasks() body (just LOG(WARNING) + return) are fine behaviorally, but the comment above ScheduleTimerTasks still says “Will assert and log an error if called.”

To avoid confusion for future readers, consider updating the docstring to say that the method is currently a noop that logs a warning, and that timer-based tasks are not yet implemented.

Also applies to: 251-254


1808-1813: “should not be used (noop)” logs are misleading for live code paths

DropKvTableAsync, FetchArchives (both overloads), and FetchVisibleArchive now log that they “should not be used” and/or are “noop”, but they still perform real KV operations (drop table, scans, archive decoding).

If the intent is to deprecate these APIs and truly make them no-ops, they should return immediately after logging without issuing RPCs/reads. If, instead, you just want to discourage use but keep behavior intact, consider rephrasing the messages, e.g.:

  • "DropKvTableAsync is deprecated and may be removed; prefer DropKvTable"
  • "FetchArchives is deprecated; archives are normally fetched from ccmap"

to avoid future confusion about whether these paths actually mutate state.

Also applies to: 1960-1962, 2120-2137, 2554-2563, 2629-2632, 2673-2676, 2739-2743


3090-3094: GetShardIdByPartitionId / InitBucketsInfo: document immutability and usage

One minor suggestion for maintainability: since bucket_infos_ and dss_shard_ids_ are read without locks in GetShardIdByPartitionId and other places, it would help to document the assumption that they are immutable after initialization (and that any future shard scaling must revisit this). A brief comment near their declarations in the header like:

// Immutable after initialization; safe to read without locks.
// If shard scaling is introduced, revisit thread-safety for updates.
std::unordered_map<uint16_t, std::unique_ptr<txservice::BucketInfo>> bucket_infos_;
std::set<uint32_t> dss_shard_ids_;

would make the concurrency model clearer.

Also applies to: 3151-3166

eloq_data_store_service/data_store_service.cpp (2)

217-221: Defensively guard data_shards_ indexing by shard_id in StartService.

StartService indexes data_shards_ via data_shards_[shard_id] for each shard_id returned from cluster_manager_.GetShardsForThisNode() without any bound check. Every other access here uses .at().

If configuration/topology bugs ever produce a shard id ≥ data_shards_.size(), this becomes undefined behavior. Adding an invariant check here would make failures diagnosable and safer:

    for (uint32_t shard_id : dss_shards)
    {
-        auto &ds_ref = data_shards_[shard_id];
+        assert(shard_id < data_shards_.size());
+        auto &ds_ref = data_shards_[shard_id];

or use .at(shard_id) to match the rest of the file.

Also applies to: 254-260


903-959: Consider zeroing TTL on local Read DB_NOT_OPEN branch for consistency.

In the local Read(...) overload, when the shard status is neither ReadOnly nor ReadWrite, you clear record and ts but leave *ttl untouched before returning DB_NOT_OPEN. Downstream callers that read TTL unconditionally could observe a stale value from the previous successful read.

To keep error cases consistent with “not found” semantics (where TTL is set to 0 in RocksDB path), you might also set *ttl = 0; in this branch.

eloq_data_store_service/internal_request.h (1)

329-470: Per‑shard GetShardId + DecreaseWriteReqCount pattern is coherent across all write‑like requests.

For FlushData*, DeleteRange*, CreateTable*, DropTable*, Scan* (write‑side only for scans that mutate iter cache), and CreateSnapshotForBackup*:

  • Each base request type now exposes virtual uint32_t GetShardId() const.
  • RPC implementations read shard id from the proto (req_->shard_id()); local implementations store shard_id_ in Reset and clear it in Clear.
  • SetFinish consistently calls data_store_service_->DecreaseWriteReqCount(...) with the shard id, aligning cleanly with the server‑side IncreaseWriteReqCount(shard_id) in the corresponding DataStoreService entrypoints.

This design should ensure per‑shard write counters are balanced even when multiple shards are active concurrently.

One minor nit (non‑blocking): some SetFinish implementations call DecreaseWriteReqCount(req_->shard_id()) instead of DecreaseWriteReqCount(GetShardId()). Using the interface everywhere would make the pattern slightly more uniform, but it's functionally equivalent.

Also applies to: 492-654, 865-999, 1001-1134, 1136-1518, 1520-1682

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 429175b and 85d911c.

📒 Files selected for processing (20)
  • data_store_service_client.cpp (102 hunks)
  • data_store_service_client.h (12 hunks)
  • data_store_service_client_closure.cpp (17 hunks)
  • data_store_service_client_closure.h (34 hunks)
  • data_store_service_scanner.cpp (3 hunks)
  • data_store_service_scanner.h (4 hunks)
  • eloq_data_store_service/CMakeLists.txt (0 hunks)
  • eloq_data_store_service/data_store_service.cpp (45 hunks)
  • eloq_data_store_service/data_store_service.h (14 hunks)
  • eloq_data_store_service/data_store_service_util.h (1 hunks)
  • eloq_data_store_service/ds_request.proto (5 hunks)
  • eloq_data_store_service/internal_request.h (55 hunks)
  • eloq_data_store_service/main.cpp (1 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp (5 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store_factory.h (1 hunks)
  • eloq_data_store_service/rocksdb_data_store.cpp (1 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.cpp (3 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.h (1 hunks)
  • eloq_data_store_service/thread_worker_pool.cpp (2 hunks)
  • eloq_data_store_service/thread_worker_pool.h (1 hunks)
💤 Files with no reviewable changes (1)
  • eloq_data_store_service/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client_closure.cpp
  • data_store_service_client.cpp
  • data_store_service_client.h
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client_closure.cpp
  • eloq_data_store_service/data_store_service.cpp
  • data_store_service_client.cpp
🧬 Code graph analysis (14)
eloq_data_store_service/thread_worker_pool.h (2)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Initialize (163-285)
  • Initialize (163-163)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Initialize (34-84)
  • Initialize (34-34)
eloq_data_store_service/thread_worker_pool.cpp (1)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Initialize (163-285)
  • Initialize (163-163)
eloq_data_store_service/rocksdb_data_store.cpp (2)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Shutdown (287-301)
  • Shutdown (287-287)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (165-186)
  • Shutdown (165-165)
data_store_service_scanner.h (1)
eloq_data_store_service/internal_request.h (13)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (811-819)
  • string_view (975-983)
  • string_view (1110-1118)
  • string_view (1406-1414)
  • string_view (1421-1429)
  • partition_id_ (647-647)
  • partition_id_ (856-856)
  • partition_id_ (1504-1504)
data_store_service_scanner.cpp (1)
data_store_service_scanner.h (1)
  • data_shard_id_ (266-266)
eloq_data_store_service/rocksdb_data_store_common.h (4)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Shutdown (287-301)
  • Shutdown (287-287)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Shutdown (104-122)
  • Shutdown (104-104)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (165-186)
  • Shutdown (165-165)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
eloq_data_store_service/internal_request.h (3)
eloq_data_store_service/data_store_service_config.h (2)
  • uint32_t (107-110)
  • EloqDS (98-514)
data_store_service_client_closure.h (33)
  • Reset (81-85)
  • Reset (144-153)
  • Reset (210-216)
  • Reset (269-274)
  • Reset (385-392)
  • Reset (600-619)
  • Reset (933-948)
  • Reset (1154-1177)
  • Reset (1393-1408)
  • Reset (1598-1624)
  • Reset (1627-1658)
  • Reset (1660-1694)
  • Reset (1933-2251)
  • Reset (2328-2347)
  • Reset (2606-2612)
  • Reset (3002-3010)
  • vector (1078-1081)
  • vector (1121-1128)
  • vector (2253-2256)
  • string (804-809)
  • string (2173-2178)
  • string_view (311-368)
  • string_view (752-755)
  • string_view (767-770)
  • string_view (792-797)
  • string_view (1291-1294)
  • string_view (1533-1536)
  • string_view (2133-2136)
  • string_view (2138-2141)
  • string_view (2241-2246)
  • string_view (2453-2456)
  • EloqDS (39-42)
  • EloqDS (44-184)
eloq_data_store_service/data_store.h (1)
  • EloqDS (28-145)
eloq_data_store_service/rocksdb_cloud_data_store_factory.h (2)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • RocksDBCloudDataStore (138-155)
  • RocksDBCloudDataStore (157-163)
eloq_data_store_service/rocksdb_cloud_data_store.h (1)
  • RocksDBCloudDataStore (38-161)
eloq_data_store_service/data_store_service.h (6)
eloq_data_store_service/data_store_service.cpp (12)
  • StartService (242-312)
  • StartService (242-242)
  • Read (399-432)
  • Read (399-402)
  • Read (434-477)
  • Read (434-442)
  • ScanClose (995-1028)
  • ScanClose (995-998)
  • ScanClose (1030-1062)
  • ScanClose (1030-1035)
  • TTLWrapperCache (72-77)
  • TTLWrapperCache (79-88)
eloq_data_store_service/rocksdb_data_store_common.cpp (5)
  • Read (436-497)
  • Read (436-436)
  • key (852-852)
  • ScanClose (1067-1091)
  • ScanClose (1067-1067)
data_store_service_client.cpp (6)
  • Read (1087-1096)
  • Read (1087-1092)
  • Read (3528-3544)
  • Read (3528-3533)
  • ScanClose (3795-3819)
  • ScanClose (3795-3800)
eloq_data_store_service/internal_request.h (10)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (811-819)
  • string_view (975-983)
  • string_view (1110-1118)
  • string_view (1406-1414)
  • string_view (1421-1429)
eloq_data_store_service/data_store_service_config.cpp (6)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
eloq_data_store_service/data_store.h (1)
  • DataStore (42-143)
eloq_data_store_service/rocksdb_data_store_common.cpp (4)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Shutdown (104-122)
  • Shutdown (104-104)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (165-186)
  • Shutdown (165-165)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
eloq_data_store_service/data_store_service.cpp (4)
eloq_data_store_service/data_store_service.h (4)
  • IsOwnerOfShard (615-620)
  • PrepareShardingError (525-529)
  • IncreaseWriteReqCount (603-607)
  • DecreaseWriteReqCount (609-613)
eloq_data_store_service/data_store_service_config.cpp (6)
  • IsOwnerOfShard (285-293)
  • IsOwnerOfShard (285-285)
  • IsOwnerOfShard (690-701)
  • IsOwnerOfShard (690-691)
  • PrepareShardingError (904-948)
  • PrepareShardingError (904-905)
eloq_data_store_service/rocksdb_data_store_common.cpp (11)
  • Read (436-497)
  • Read (436-436)
  • key (852-852)
  • start_key (387-387)
  • start_key (699-699)
  • end_key (402-402)
  • end_key (704-704)
  • ScanNext (755-1065)
  • ScanNext (755-755)
  • ScanClose (1067-1091)
  • ScanClose (1067-1067)
data_store_service_client.cpp (8)
  • Read (1087-1096)
  • Read (1087-1092)
  • Read (3528-3544)
  • Read (3528-3533)
  • ScanNext (3720-3753)
  • ScanNext (3720-3734)
  • ScanClose (3795-3819)
  • ScanClose (3795-3800)
data_store_service_client_closure.h (2)
eloq_data_store_service/internal_request.h (13)
  • partition_id_ (647-647)
  • partition_id_ (856-856)
  • partition_id_ (1504-1504)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (811-819)
  • string_view (975-983)
  • string_view (1110-1118)
  • string_view (1406-1414)
  • string_view (1421-1429)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
data_store_service_client.cpp (3)
data_store_service_client.h (4)
  • TxPort2DssPort (134-137)
  • need_bootstrap_ (688-859)
  • KvPartitionIdOf (629-634)
  • bind_data_shard_with_ng_ (689-813)
eloq_data_store_service/data_store_service.cpp (8)
  • BatchWriteRecords (853-901)
  • BatchWriteRecords (853-857)
  • BatchWriteRecords (1134-1199)
  • BatchWriteRecords (1134-1147)
  • Read (399-432)
  • Read (399-402)
  • Read (434-477)
  • Read (434-442)
data_store_service_client_closure.cpp (2)
  • PartitionBatchCallback (493-538)
  • PartitionBatchCallback (493-496)
data_store_service_client.h (1)
data_store_service_client.cpp (24)
  • FindFreeNodeIndex (3269-3288)
  • FindFreeNodeIndex (3269-3269)
  • InitBucketsInfo (3132-3197)
  • InitBucketsInfo (3132-3136)
  • OnLeaderStart (2973-2990)
  • OnLeaderStart (2973-2974)
  • OnLeaderStop (2992-3005)
  • OnLeaderStop (2992-2992)
  • OnStartFollowing (3014-3063)
  • OnStartFollowing (3014-3018)
  • GetShardIdByPartitionId (3090-3115)
  • GetShardIdByPartitionId (3090-3091)
  • IsLocalShard (3080-3088)
  • IsLocalShard (3080-3080)
  • GetOwnerNodeIndexOfShard (3199-3209)
  • GetOwnerNodeIndexOfShard (3199-3200)
  • GetAllDataShards (3117-3130)
  • GetAllDataShards (3117-3117)
  • UpdateOwnerNodeIndexOfShard (3211-3267)
  • UpdateOwnerNodeIndexOfShard (3211-3212)
  • HandleShardingError (3290-3323)
  • HandleShardingError (3290-3291)
  • UpgradeShardVersion (3325-3372)
  • UpgradeShardVersion (3325-3328)
🪛 Clang (14.0.6)
eloq_data_store_service/data_store_service_util.h

[error] 24-24: 'atomic' file not found

(clang-diagnostic-error)

🔇 Additional comments (35)
eloq_data_store_service/thread_worker_pool.h (1)

42-42: LGTM!

Adding Initialize() enables explicit lifecycle management, allowing the thread pool to be reused after Shutdown(). This aligns with the per-shard data store reuse pattern introduced in the broader PR.

eloq_data_store_service/rocksdb_data_store_common.cpp (2)

279-282: LGTM!

The else branch correctly reinitializes an existing worker pool, enabling data store reuse after shutdown. This aligns with the per-shard lifecycle management introduced across the PR.


447-458: LGTM!

The updated Read path now extracts the key via req->GetKey() and passes it directly to BuildKey(table_name, partition_id, key). This aligns with the shard-aware proto changes where keys are passed as key_str fields rather than embedded in request objects.

data_store_service_client.h (1)

74-86: LGTM!

The constructor parameters is_bootstrap and bind_data_shard_with_ng correctly initialize the corresponding member variables, enabling flexible shard binding and bootstrap modes for different deployment scenarios.

eloq_data_store_service/rocksdb_data_store.cpp (1)

69-83: LGTM!

The Shutdown sequence correctly delegates to RocksDBDataStoreCommon::Shutdown() first (handling worker pool and scan iterator cleanup), then acquires the lock and closes the database. The db_ pointer is properly nulled after deletion, preventing use-after-free.

eloq_data_store_service/main.cpp (1)

321-322: LGTM!

The simplified StartService(bool) signature reflects the refactored per-shard initialization flow, where shard-specific setup is now handled internally rather than passed as parameters. This improves encapsulation.

eloq_data_store_service/rocksdb_data_store_common.h (1)

165-165: LGTM!

The Shutdown() override enables centralized resource cleanup for query worker pools and scan iterators, supporting the per-shard lifecycle management pattern introduced in this PR.

data_store_service_scanner.cpp (3)

64-67: LGTM!

The ScanNext call now correctly propagates data_shard_id_ to route scan requests to the appropriate shard, aligning with the per-shard architecture introduced in this PR.


109-111: LGTM!

The ScanClose call now correctly includes data_shard_id_ for shard-aware session cleanup, consistent with the scan opening path.


417-419: LGTM!

The per-partition shard_id computation correctly calls GetShardIdByPartitionId with is_range_partition=false for hash partitions, then passes data_shard_id to Reset. This ensures each partition scanner targets the correct shard.

eloq_data_store_service/data_store_service_util.h (1)

24-30: No issues found; the atomic header is properly supported.

The build environment is correctly configured with C++17 (eloq_data_store_service/CMakeLists.txt:27) and C++20 (build_eloq_store.cmake:3), both of which fully support the standard <atomic> header. The static analysis warning appears to be a false positive, as the code is valid and should compile without issues.

data_store_service_client_closure.cpp (3)

71-82: Key/logging changes to use ReadClosure::Key() are consistent.

Using read_closure->Key() instead of Key().back() for logging and key_str extraction aligns with the refactored closure API and avoids truncating multi-part keys. Lifetime and ownership are unchanged vs the previous string_view-from-key-part pattern, so error logs (decode failures, key-not-found paths) and AddResult calls remain safe.

Also applies to: 96-107, 176-181, 352-355


570-632: ShardId propagation in Scan/Read callbacks is consistent with per-shard routing.

Across callbacks (FetchAllDatabaseCallback, DiscoverAllTableNamesCallback, FetchTableRangesCallback, FetchRangeSlicesCallback, FetchTableStatsCallback, LoadRangeSliceCallback, FetchArchivesCallback, FetchRecordArchivesCallback):

  • Using scan_next_closure->ShardId() (or read_closure->ShardId() in the range-slice flows) when issuing follow-up ScanNext/Read calls preserves the shard affinity of the original request across pagination and multi-step reads.
  • SessionId()/kv_session_id_ handling is unchanged, so scan sessions remain intact.
  • The new kv_start_key_ assignment via std::string_view(key.Data(), key.Size()) in LoadRangeSliceCallback simply copies the bytes into a std::string and does not introduce lifetime issues.

This wiring matches the new shard_id field in ScanRequest/ReadRequest and looks correct end-to-end.

Also applies to: 635-701, 819-833, 912-923, 1132-1145, 1239-1252, 1308-1321, 1417-1430, 1443-1456


493-533: Verification confirms the cast is architecturally safe, but the review's refactoring suggestions remain valid.

The investigation found that PartitionBatchCallback is only ever invoked from BatchWriteRecordsClosure::Run() (line 1754), which calls the callback with this (the closure object) as the second parameter. Since both call sites (cpp:427 and cpp:529) register the callback exclusively with BatchWriteRecords, which always creates a BatchWriteRecordsClosure, the static cast to BatchWriteRecordsClosure* is guaranteed to be valid in the current architecture.

However, the review's concerns and suggestions remain legitimate:

  • The cast lacks null-checking or defensive validation
  • The tight coupling to BatchWriteRecordsClosure could become a maintenance burden if the callback mechanism is refactored
  • Storing data_shard_id in PartitionCallbackData instead would decouple the callback from the concrete closure type, improving architectural resilience
eloq_data_store_service/rocksdb_cloud_data_store_factory.h (1)

54-66: Per-shard cloud object_path_ is wired correctly; confirm the new default for empty paths is acceptable.

  • Copying cloud_config_ to shard_cloud_config and appending "ds_<shard_id>" (after normalizing the trailing slash) cleanly isolates each shard’s RocksDB instance under its own prefix.
  • The RocksDBCloudDataStore constructor call matches the updated signature (cloud_config, config, create_if_missing, tx_enable_cache_replacement, shard_id, data_store_service).

One thing to verify: previously, an empty object_path_ likely meant “use the bucket root/prefix directly”. Now it is coerced to "rocksdb_cloud_object_path/ds_<shard_id>". If any existing deployments rely on an empty object_path_ mapping to the old layout, this change will move data into a different prefix. If that’s intentional for the multi-shard rollout, great; otherwise, you may want to preserve the old behavior for empty paths and require explicit configuration for the new layout.

Also applies to: 67-75

data_store_service_scanner.h (1)

284-285: Using txservice::total_hash_partitions as HASH_PARTITION_COUNT is a good centralization.

Replacing local or conditional definitions with txservice::total_hash_partitions ensures the hash-partition scanner stays in sync with the global partition-count configuration and removes the previous compile-time branching.

eloq_data_store_service/data_store_service.h (3)

190-228: Shard-aware RPC/local APIs look consistent

The added shard_id parameter and int32_t partition_id across Read, BatchWriteRecords, DeleteRange, ScanNext, and ScanClose match the implementations in data_store_service.cpp and the client-side call sites. Routing now consistently uses (partition_id, shard_id) and aligns with internal_request.h and rocksdb_data_store_common.cpp usage. No issues here.

Also applies to: 256-268, 320-327, 403-417, 440-445


525-529: Good: PrepareShardingError now preserves full topology info

Delegating PrepareShardingError to cluster_manager_.PrepareShardingError restores the richer error contract (error type, new cluster config/primary, versioning) instead of a generic message. This keeps client-side sharding/failover logic working.


678-701: Verify shard_id_ initialization for DataShard helpers

DataShard::shard_id_ defaults to UINT32_MAX and is used as a guard in both FetchDSShardStatus and GetDataStore (== shard_id check). In StartService you initialize data_store_, scan_iter_cache_, and shard_status_, but the snippet provided doesn’t show shard_id_ being set there.

If shard_id_ is only assigned in other open paths (e.g., OpenDataStore/ConnectAndStartDataStore), shards created during StartService might still appear “unbound” to these helpers even though data_store_ is valid. Please double-check that every path that successfully opens a shard sets shard_id_ consistently, or relax the equality check if it’s no longer needed as an ownership guard.

Also applies to: 703-703

data_store_service_client.cpp (9)

151-211: TxConfigsToDssClusterConfig: per-node topology mapping looks correct

The refactored TxConfigsToDssClusterConfig builds nodes_map once, adds only candidate nodes as shard members, updates the primary via ng_leaders, and sets ThisNode + switches the shard to ReadWrite when this node is leader. This is a clean separation of concerns and avoids redundant leader_id == node_id checks from the earlier version.

One small nit: you call nodes_map.find(leader_id) twice; caching the iterator after the assert would avoid the duplicate lookup, but this is purely cosmetic.


335-341: PutAll: shard routing via GetShardIdByPartitionId – confirm partition_id semantics

PutAll now routes each partition’s batches using:

  • partition_state->partition_id set from KvPartitionIdOf(...) for both hash and range tables, and
  • GetShardIdByPartitionId(partition_state->partition_id, is_range_partitioned) where is_range_partitioned = !table_name.IsHashPartitioned().

This is structurally sound, but correctness hinges on KvPartitionIdOf(partition_id, /*is_range=*/true/false) returning the same partition id flavor (hash vs range) that txservice::Sharder::Map{HashPartition,RangeId}ToBucketId expect. If KvPartitionIdOf already normalizes to the “KV-partition id” used everywhere else, then this is fine; if it mixes range ids with hash-partition ids, you could misroute to the wrong shard.

Please double-check the KvPartitionIdOf overloads against the Sharder mapping to ensure the combination (partition_state->partition_id, is_range_partitioned) always lands on the intended shard.

Also applies to: 349-375, 404-420


592-605: Shard-aware metadata and maintenance calls are wired consistently

All the catalog/statistics/range/DB maintenance paths now consistently:

  • compute a KV partition via KvPartitionIdOf(...), and
  • derive shard_id via GetShardIdByPartitionId(kv_partition_id, /*is_range=*/false) (or true where appropriate),
  • then pass both into Read, ScanNext, BatchWriteRecords, or DeleteRange.

This matches the server-side DataStoreService signatures and the new bucket-based shard mapping semantics. I don’t see mismatches between is_range_partition flags and the table’s partitioning mode in these call sites.

Also applies to: 619-635, 650-686, 816-821, 870-873, 903-911, 934-949, 973-1005, 1050-1055, 1133-1153, 1384-1408, 1445-1455, 1510-1516, 1618-1661, 1676-1719, 1736-1755, 1760-1787, 1859-1862


2973-2990: Leader/follower hooks: shard–node-group binding assumption

OnLeaderStart, OnLeaderStop, and OnStartFollowing now conditionally open/close shards on the local DataStoreService when bind_data_shard_with_ng_ is true, and they treat ng_id as the data_shard_id (DSS shard id) throughout. They also update the cluster manager’s primary node and shard version and then call SetupConfig + Connect().

This is a clean and explicit coupling, but it assumes a 1:1 mapping between node-group id and data shard id. Please confirm that this invariant is guaranteed by the control plane; if it ever changes (e.g., multiple DSS shards per node group), these hooks will need to derive the proper shard set instead of using ng_id directly.

Also applies to: 2992-3003, 3014-3063


3080-3085: IsLocalShard is a good abstraction over server-side ownership

IsLocalShard simply delegates to data_store_service_->IsOwnerOfShard(shard_id), centralizing shard ownership semantics in the server. All local-vs-remote routing branches (Read/DeleteRange/ScanNext/ScanClose/FlushData/DropTable/CreateSnapshotForBackup) use this helper consistently, which keeps the client side free of duplicated status logic.

No issues here.


3090-3115: Bucket-based shard mapping & InitBucketsInfo: invariants look sound

The new shard lookup:

  • maps partition_id to a bucket_id via txservice::Sharder::MapRangeIdToBucketId or MapHashPartitionIdToBucketId depending on is_range_partition, and
  • uses bucket_infos_[bucket_id]->BucketOwner() to pick a shard, validating that the shard id exists in dss_shard_ids_.

InitBucketsInfo:

  • rejects empty node_groups early,
  • generates up to 64 virtual node positions per node group using local std::mt19937 RNGs (no shared global state), and
  • fills ng_bucket_infos so that every bucket_id in [0, total_range_buckets) is assigned to some group, wrapping to the first virtual node when iterating past the last random position.

Given the learnings that bucket_id is guaranteed < UINT16_MAX, the use of uint16_t here is safe, and the ring-walk logic matches the comment (“first virtual node larger than bucket_id” with wrap-around).

A couple of minor notes:

  • GetShardIdByPartitionId returns UINT32_MAX on a missing bucket; callers that propagate this as a shard_id should treat it as an error instead of blindly routing (see separate comment on GetOwnerNodeIndexOfShard).
  • The comment near dss_shard_ids_ and bucket_infos_ could explicitly state that these structures are immutable after initialization, justifying unsynchronized reads outside GetAllDataShards().

Overall, the design looks correct.

Also applies to: 3117-3130, 3132-3197


4003-4016: InitTableLastRangePartitionId: safer handling of Sequences partitioning

The updated InitTableLastRangePartitionId now:

  • computes kv_partition_id/data_shard_id using GetShardIdByPartitionId only if Sequences::table_name_ is hash-partitioned, and
  • logs an error and returns false otherwise.

This removes the previous unconditional assert(false) and unreachable code, which is a clear improvement for robustness. The write to the sequences KV table then uses (kv_partition_id, data_shard_id) consistently.

Looks good.

Also applies to: 4030-4041


3548-3560: Shard-aware local/remote routing in data path looks correct

In ReadInternal, DeleteRangeInternal, ScanNextInternal, ScanCloseInternal, and BatchWriteRecordsInternal you now route based on IsLocalShard(shard_id) and thread shard_id through to either:

  • the local DataStoreService (Read, DeleteRange, ScanNext, ScanClose, BatchWriteRecords), or
  • remote RPCs using GetOwnerNodeIndexOfShard(shard_id).

Combined with the earlier changes to client-facing APIs (always passing both partition_id and shard_id), this yields a coherent per-shard data path.

Aside from the missing guard on GetOwnerNodeIndexOfShard in the remote branches (covered in a separate comment), the routing logic itself looks correct.

Also applies to: 3604-3615, 3758-3777, 3825-3833, 4160-4176


4547-4573: Catalog upsert/delete now shard-aware without changing semantics

UpsertCatalog and DeleteCatalog now compute shard_id via GetShardIdByPartitionId(partition_id, false) (with partition_id fixed at 0), and pass it alongside the partition into BatchWriteRecords. This matches how other catalog operations (e.g., FetchTable, DiscoverAllTableNames) derive shard ids and doesn’t alter the logical behavior—catalogs are still effectively stored in a single shard.

No issues here.

Also applies to: 4597-4624

eloq_data_store_service/data_store_service.cpp (1)

1893-1927: Per‑shard OpenDataStore logging looks good.

The new OpenDataStore helper cleanly gates on ds_ref.shard_status_ == Closed, delegates to ConnectAndStartDataStore, and logs success/failure with shard id, current status, and duration. This should be very helpful for diagnosing shard‑level open latency.

eloq_data_store_service/internal_request.h (1)

36-68: Shard‑aware write request interface and accounting look consistent.

Introducing GetShardId() on WriteRecordsRequest and plumbing it through WriteRecordsRpcRequest and WriteRecordsLocalRequest (with a stored shard_id_) gives a clean way to:

  • route writes by shard in the store, and
  • call DecreaseWriteReqCount(shard_id) in SetFinish to mirror the per‑shard IncreaseWriteReqCount(shard_id) in DataStoreService.

The Clear/Reset implementations correctly initialize shard_id_ (to UINT32_MAX and then to the real ID) and all overrides use the proto shard_id field or the stored shard_id_ consistently.

Also applies to: 70-179, 181-327

data_store_service_client_closure.h (5)

439-537: ReadClosure’s shard‑aware routing and accessors look correct.

The ReadClosure changes:

  • Add explicit partition_id_ (int32_t) and shard_id_ (uint32_t) members, reset to sentinel values in Clear().
  • Populate both fields in Reset(...), and use them in PrepareRequest to set partition_id and shard_id on the proto.
  • Use shard_id_ when calling UpdateOwnerNodeIndexOfShard on RPC failures and sharding errors.
  • Expose PartitionId(), ShardId(), and Key() accessors, which align with how other closures surface this context.

This keeps the client’s routing state per‑shard and per‑partition, and correctly ties retry logic to the shard id.

Also applies to: 592-899


1130-1366: DeleteRangeClosure now correctly carries shard_id through request and retry paths.

For DeleteRangeClosure:

  • Reset(...) stores both partition_id_ (int32_t) and shard_id_ (uint32_t).
  • PrepareRequest sets both partition_id and shard_id on the outgoing proto.
  • On RPC failure, Run() uses shard_id_ with UpdateOwnerNodeIndexOfShard, matching the new per‑shard owner mapping.
  • New accessors PartitionId() and ShardId() expose this state to callers.

The Clear/reset logic keeps shard_id_ in a known sentinel state when the closure returns to the pool, which is good for debugging.


1585-1892: BatchWriteRecordsClosure’s per‑shard state is wired consistently.

BatchWriteRecordsClosure now:

  • Tracks partition_id_ as int32_t and shard_id_ as uint32_t, reset to sentinel values in Clear()/Reset().
  • Accepts shard_id in both overloads of Reset(...) and stores it.
  • Sets partition_id and shard_id on the proto in PrepareRequest.
  • Uses shard_id_ when calling UpdateOwnerNodeIndexOfShard on RPC failures.
  • Exposes PartitionId() and ShardId() const accessors for external use.

This matches the server‑side expectation that batch writes are routed by shard id and aligns retry/sharding updates with the correct shard.


1894-2294: ScanNextClosure correctly threads shard_id and partition_id through scan requests.

In ScanNextClosure:

  • partition_id_ is now int32_t and shard_id_ is uint32_t, both reset in Clear().
  • Reset(...) stores both and PrepareRequest sets partition_id and shard_id on the ScanRequest proto.
  • On RPC failures, Run() uses shard_id_ with UpdateOwnerNodeIndexOfShard, so sharding metadata is updated for the correct shard.
  • New accessors PartitionId() and ShardId() provide this context to callers, which is useful for higher‑level routing/metrics.

Session id and search‑condition handling remain unchanged, so scan semantics should be preserved while becoming shard‑aware.


2982-3022: FetchArchivesCallbackData and CreateSnapshotForBackupCallbackData type adjustments align with upstream changes.

FetchArchivesCallbackData now stores partition_id_ as int32_t, and CreateSnapshotForBackupCallbackData remains unchanged except for minor formatting. This lines up with the int32_t partition id change in internal_request.h and maintains consistent typing across client and server.

Comment on lines 3097 to +3205
uint32_t DataStoreServiceClient::GetOwnerNodeIndexOfShard(
uint32_t shard_id) const
{
if (shard_id >= dss_shards_.size())
{
LOG(ERROR) << "shard_id " << shard_id << " exceeds array bounds";
return UINT32_MAX;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: guard GetOwnerNodeIndexOfShard error paths before indexing dss_nodes_

GetOwnerNodeIndexOfShard and UpdateOwnerNodeIndexOfShard now log an error and return an out-of-band value when shard_id >= dss_shards_.size():

if (shard_id >= dss_shards_.size())
{
    LOG(ERROR) << "shard_id " << shard_id << " exceeds array bounds";
    return UINT32_MAX;  // or false
}

However, all the call sites assume the returned node index is valid and immediately index into dss_nodes_:

  • ReadInternal (remote path)
  • DeleteRangeInternal
  • FlushDataInternal
  • DropTableInternal
  • ScanNextInternal
  • ScanCloseInternal
  • CreateSnapshotForBackupInternal (remote branch)

If shard_id is ever invalid or if GetShardIdByPartitionId falls back to UINT32_MAX, these paths will end up doing dss_nodes_[UINT32_MAX].Channel(), which is undefined behavior and likely a crash. Previously, the LOG(FATAL) made this fail-fast; after downgrading to LOG(ERROR) you need an explicit guard.

A minimal pattern for each remote branch would be:

-    uint32_t node_index = GetOwnerNodeIndexOfShard(read_closure->ShardId());
+    uint32_t node_index = GetOwnerNodeIndexOfShard(read_closure->ShardId());
+    if (node_index == UINT32_MAX)
+    {
+        // Fill error into result/closure and complete the RPC instead of crashing.
+        read_closure->PrepareError(
+            ::EloqDS::remote::DataStoreError::REQUESTED_NODE_NOT_OWNER,
+            "Invalid shard_id or missing owner node");
+        read_closure->Run();
+        return;
+    }
     read_closure->SetRemoteNodeIndex(node_index);
     auto *channel = dss_nodes_[node_index].Channel();

and similarly for the other closures (DeleteRangeClosure, FlushDataClosure, DropTableClosure, ScanNextClosure, CreateSnapshotForBackupClosure).

This preserves your softer error-handling while avoiding out-of-bounds access when the shard topology is inconsistent or the mapping fails.

Also applies to: 3214-3218, 3330-3334, 3366-3367, 3548-3574, 3604-3630, 3662-3673, 3707-3717, 3781-3792, 3837-3848, 2918-2930

🤖 Prompt for AI Agents
In data_store_service_client.cpp around lines 3199-3205 (and similarly at
3214-3218, 3330-3334, 3366-3367, 3548-3574, 3604-3630, 3662-3673, 3707-3717,
3781-3792, 3837-3848, and 2918-2930) the function now returns UINT32_MAX on
invalid shard_id but callers immediately index dss_nodes_ with that value; add
an explicit guard after calling GetOwnerNodeIndexOfShard (and after
GetShardIdByPartitionId where it can return UINT32_MAX) to detect the
out-of-band value, log an error, set the RPC/closure to failed (or invoke the
appropriate error callback), and return early from the remote branch so you
never do dss_nodes_[owner_idx].Channel(); apply this pattern to each listed
closure/remote path to prevent out-of-bounds access while preserving the softer
error handling.

Comment on lines 88 to +110
for (auto &[shard_id, shard] : all_shards)
{
if (shard_id >= dss_shards_.size())
{
LOG(FATAL) << "Shard id " << shard_id
<< " is out of range, should expand the hard-coded "
"dss_shards_ size.";
}
uint32_t node_idx = FindFreeNodeIndex();
auto &node_ref = dss_nodes_[node_idx];
node_ref.Reset(shard.nodes_[0].host_name_,
shard.nodes_[0].port_,
shard.version_);
dss_shards_[shard_id].store(shard_id);
dss_shards_[shard_id].store(node_idx);
dss_shard_ids_.insert(shard_id);
}

// init bucket infos
InitBucketsInfo(dss_shard_ids_, 0, bucket_infos_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Good: Fatal error on shard_id bounds violation during construction.

The constructor now validates shard_id < dss_shards_.size() and logs a fatal error if exceeded. This prevents undefined behavior during initialization. Combined with bounds checking in runtime accessors like GetOwnerNodeIndexOfShard (see data_store_service_client.cpp:3198-3208), the array access pattern is safer.

However, as noted in past reviews, ensure config validation limits shard_count to ≤ 1000 during parsing to prevent invalid configurations from reaching this point. Verify that data_store_service_config.cpp enforces this constraint.


🏁 Script executed:

#!/bin/bash
# Verify shard_count validation in config parsing
echo "=== Checking config validation for shard_count bounds ==="
rg -n "shard_count.*1000|shard_count.*>.*999|MAX.*SHARD" eloq_data_store_service/data_store_service_config.cpp -C 3

echo ""
echo "=== Checking for shard_count assignment/validation ==="
ast-grep --pattern 'shard_count_ = $_' 

Length of output: 595


I need to examine the actual validation logic around these shard_count assignments to determine if the review comment's concern is valid.


🏁 Script executed:

#!/bin/bash
# Examine context around shard_count assignment from config file
echo "=== Context around shard_count assignment (lines 65-75) ==="
sed -n '65,75p' eloq_data_store_service/data_store_service_config.cpp

echo ""
echo "=== Context around shard_count assignment from shards_map (lines 180-190) ==="
sed -n '180,190p' eloq_data_store_service/data_store_service_config.cpp

echo ""
echo "=== Searching for validation functions or checks on shard_count ==="
rg -n "shard_count" eloq_data_store_service/data_store_service_config.cpp | head -20

echo ""
echo "=== Looking for dss_shards_ initialization to determine its size ==="
rg -n "dss_shards_.*size\|dss_shards_\s*\(" eloq_data_store_service/data_store_service_client.h -C 2

Length of output: 1391


🏁 Script executed:

#!/bin/bash
# Examine context around shard_count assignments
echo "=== Context around config file read (lines 65-75) ==="
sed -n '65,75p' eloq_data_store_service/data_store_service_config.cpp

echo ""
echo "=== Context around shards_map assignment (lines 180-190) ==="
sed -n '180,190p' eloq_data_store_service/data_store_service_config.cpp

echo ""
echo "=== Search for any validation or bounds checks on shard_count ==="
rg -n "shard_count.*[<>]|shard_count.*validate|shard_count.*check" eloq_data_store_service/ -C 2

echo ""
echo "=== Looking for MAX_SHARD or similar constants ==="
rg -n "MAX.*SHARD|SHARD.*MAX|1000" eloq_data_store_service/data_store_service_client.h -B 2 -A 2

echo ""
echo "=== Check dss_shards_ declaration and size ==="
rg -n "dss_shards_\s*\[|std::array.*dss_shards_" eloq_data_store_service/data_store_service_client.h -C 1

Length of output: 1517


I need to locate the correct path for the header file and dss_shards_ declaration. Let me search for the file.


🏁 Script executed:

#!/bin/bash
# Find the correct path for data_store_service_client.h
echo "=== Finding data_store_service_client.h ==="
fd -t f "data_store_service_client.h" 

echo ""
echo "=== Finding dss_shards_ declaration ==="
rg -n "dss_shards_\s*\[|std::array.*dss_shards_|dss_shards_\s*;" -C 2

echo ""
echo "=== Looking for array size constants (1000, MAX) ==="
rg -n "1000|MAX_SHARD" . -C 1 | head -30

Length of output: 4792


Config validation for shard_count is missing—add bounds check during parsing.

Verification confirms the concern: data_store_service_config.cpp (lines 69-70) reads shard_count from the config file with no upper bounds validation. While the constructor adds a runtime guard (shard_id < dss_shards_.size()), dss_shards_ is fixed at 1000 elements (line 749 of data_store_service_client.h), and invalid configs can be parsed and stored without detection.

Add validation during config parsing to reject shard_count > 1000 before it reaches the client, ensuring invalid configurations fail fast at parse time rather than at runtime.

🤖 Prompt for AI Agents
In data_store_service_config.cpp around lines 69-70, add validation to reject
parsed shard_count values greater than the hard limit (1000) before storing them
in the config: after reading shard_count, check if shard_count == 0 or
shard_count > 1000 and return/log a clear parsing/configuration error (or throw)
so the config load fails fast; update any config parsing return path to
propagate the failure to callers so invalid configs are not accepted and later
cause runtime FATALs in data_store_service_client.h.

data store service support multi shard

fix error on leader changed

change total_range_buckets and total_hash_partitions to 1024

get shard_id base kv_partition_id

update shards object path on s3, "object_path/ds_0, object_path/ds_1 ..."

Don't save the bucket_id as the prefix of key in datastore.

Remove the compile option 'USE_ONE_ELOQDSS_PARTITION_ENABLED'.

Add log to print open rocksdb cloud time

Replace srand/rand with std::mt19937 to calcualte the distribution of buckets

fix RocksdbCloudDataStore shutdown

change TTLWrapperCache to unique_ptr on data_shard

add arg bind_data_shard_with_ng

don't set data_store_ to null on shutdown
@lzxddz lzxddz force-pushed the dss_multi_shard-rebase branch from 511ee98 to dab3aa8 Compare November 20, 2025 12:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
eloq_data_store_service/data_store_service.cpp (1)

352-385: Race between ConnectAndStartDataStore and ProcessSyncFileCache around is_file_sync_running_

ConnectAndStartDataStore waits for file cache sync to finish before opening a shard:

while (shard_ref.is_file_sync_running_.load(std::memory_order_relaxed))
{
    bthread_usleep(10000);
}

but ProcessSyncFileCache only sets this flag after some initial work:

// after building directory iterator
ds_ref.is_file_sync_running_.store(true, std::memory_order_release);
for (const auto &entry : dir_ite) {
    if (ds_ref.shard_status_.load(...) != DSShardStatus::Closed) {
        ...
        ds_ref.is_file_sync_running_.store(false, std::memory_order_release);
        ...
        return;
    }
    ...
}

A timing window exists where:

  1. SyncFileCache RPC has been accepted for a Closed shard but ProcessSyncFileCache has not yet set is_file_sync_running_ = true;
  2. ConnectAndStartDataStore sees is_file_sync_running_ == false, skips the wait, and starts opening the RocksDB instance;
  3. ProcessSyncFileCache then starts deleting/downloading SST files in db_path concurrently with the DB open.

This can lead to undefined behavior or corruption at the filesystem/DB level.

Consider tightening the contract by:

  • Setting is_file_sync_running_ to true before queuing work (in SyncFileCache) or at the very start of ProcessSyncFileCache, and
  • Clearing it in a single RAII guard at all exit paths (success and error), e.g.:
void DataStoreService::ProcessSyncFileCache(SyncFileCacheLocalRequest *req,
                                            uint32_t shard_id)
{
    auto &ds_ref = data_shards_.at(shard_id);
    ds_ref.is_file_sync_running_.store(true, std::memory_order_release);
    auto guard = gsl::finally([&ds_ref] {
        ds_ref.is_file_sync_running_.store(false, std::memory_order_release);
    });

    ...
}

and changing SyncFileCache to pass shard_id explicitly. This ensures ConnectAndStartDataStore never races with an in‑progress file sync.

Also applies to: 1405-1477

♻️ Duplicate comments (3)
eloq_data_store_service/ds_request.proto (1)

135-149: Fix BatchWriteRecordsRequest field renumbering to preserve wire compatibility.

BatchWriteRecordsRequest currently renumbers existing fields:

  • skip_wal: 3 → 4
  • items: 4 → 5
  • New shard_id: 3

Changing field numbers of existing fields breaks compatibility with any deployed code that still uses the old schema. Protobuf field numbers are part of the serialization contract and must be stable.

You should keep the original numbers for existing fields and assign shard_id a new, unused tag (e.g., 6), for example:

  string kv_table_name = 1;
  int32 partition_id = 2;
  bool skip_wal = 3;        // keep original tag
  repeated Item items = 4;  // keep original tag
  uint32 shard_id = 6;      // new field number

Then wire all new code to use the new shard_id field.

data_store_service_scanner.h (1)

235-238: Align GetPartitionId() return type with the int32_t member.

SinglePartitionScanner now stores:

int32_t partition_id_;

but GetPartitionId() still returns uint32_t. This causes an implicit signed→unsigned conversion on every call and is inconsistent with:

  • The proto definitions (int32 partition_id), and
  • Other interfaces like internal_request.h that expose int32_t partition_id.

Even though current partition IDs are non‑negative and within range, keeping the types aligned avoids subtle bugs and makes the API clearer.

Recommend changing the getter to:

int32_t GetPartitionId()
{
    return partition_id_;
}

(and updating any callers that depend on an unsigned return type).

Also applies to: 264-267

eloq_data_store_service/data_store_service.h (1)

603-611: Add bounds checks before indexing data_shards_ to avoid out-of-range aborts

The helper methods:

  • FetchDSShardStatus(uint32_t shard_id),
  • IncreaseWriteReqCount(uint32_t shard_id),
  • DecreaseWriteReqCount(uint32_t shard_id),
  • IsOwnerOfShard(uint32_t shard_id) const,
  • GetDataStore(uint32_t shard_id),

all use data_shards_.at(shard_id) without validating that shard_id < data_shards_.size(). Since many RPC handlers take shard_id directly from the request and call these helpers, a malformed or misrouted request with a large shard id will trigger std::array::at out-of-range and likely terminate the process.

To harden the service, add bounds checks and treat out-of-range shard ids as “closed / not owner”, e.g.:

DSShardStatus FetchDSShardStatus(uint32_t shard_id)
{
    if (shard_id >= data_shards_.size()) {
        return DSShardStatus::Closed;
    }

    const auto &ds_ref = data_shards_[shard_id];
    if (ds_ref.shard_id_ == shard_id) {
        return ds_ref.shard_status_.load(std::memory_order_acquire);
    }
    return DSShardStatus::Closed;
}

void IncreaseWriteReqCount(uint32_t shard_id)
{
    if (shard_id >= data_shards_.size()) {
        return;
    }
    data_shards_[shard_id].ongoing_write_requests_.fetch_add(
        1, std::memory_order_release);
}

// Apply similar guards in DecreaseWriteReqCount, IsOwnerOfShard, and GetDataStore.

This preserves existing semantics for valid shard ids while making the service robust against bad input.

Also applies to: 624-634, 636-641, 652-662

🧹 Nitpick comments (6)
eloq_data_store_service/rocksdb_cloud_data_store_factory.h (1)

54-66: Shard-specific object_path_ handling changes behavior; double-check defaults and reporting

The per‑shard copy/normalization looks good, but there are two behavioral points worth confirming:

  • Line 58: when cloud_config_.object_path_ is empty, you now force "rocksdb_cloud_object_path/" as the base prefix before appending ds_<shard_id>. If existing deployments relied on an empty object_path_ meaning “bucket root” (or some other implicit default), this will redirect new shards into a different S3 subprefix and may not see existing data. If backward compatibility is important, consider either:

    • Keeping the empty‑string semantics and only appending ds_<shard_id> (e.g., "ds_0", "ds_1", …), or
    • Making the new default prefix an explicit config option rather than hard‑coding it here.
  • Lines 100‑103: GetS3ObjectPath() still exposes the unsharded cloud_config_.object_path_, while RocksDBCloudDataStore instances now use the per‑shard shard_cloud_config.object_path_ (which includes /ds_<shard_id>). If any callers assume GetS3ObjectPath() reflects the actual DB path, this discrepancy could be confusing. It may be worth either documenting that this returns the base (unsharded) prefix, or updating it to make the sharding semantics explicit where it’s used.

Also applies to: 68-68, 100-103

rocksdb_handler.cpp (1)

2105-2129: ng_id parameter currently unused in RocksDBHandler leader/follower callbacks

OnLeaderStart(uint32_t ng_id, ...) and OnStartFollowing(uint32_t ng_id, ...) correctly adopt the new interface but ignore ng_id in their bodies. If this is only a placeholder for future sharding logic, consider explicitly marking it unused (e.g., /*ng_id*/ or [[maybe_unused]]) to avoid compiler warnings in stricter builds.

Also applies to: 2174-2199

eloq_data_store_service/thread_worker_pool.cpp (1)

28-32: Re-init semantics are reasonable; consider explicit work-queue clearing policy.

The new Initialize() plus the updated Shutdown() give you a reusable pool object:

  • Ctor immediately calls Initialize(), so first-use behavior is unchanged.
  • Shutdown() now joins all workers, clears workers_, and resets shutdown_indicator_ to false, allowing Initialize() to spin up a fresh set of threads on the same pool instance.

Two points to be explicit about:

  1. Initialize() spin-waits while shutdown_indicator_ is true. This is fine under the intended usage (no concurrent Initialize()/Shutdown()), but worth documenting as non–thread safe for concurrent callers.

  2. Shutdown() does not clear work_queue_. Any unprocessed work present when shutdown_indicator_ is set will remain in the deque, and after a subsequent Initialize() those leftovers will be picked up by the new workers. If the intent is “drain then forget” vs. “drain across restarts”, you may want to either:

    • explicitly clear work_queue_ in Shutdown() under the same mutex, or
    • add a comment that this cross-restart delivery is intentional.

Given the current usage in RocksDBDataStoreCommon::Initialize(), this is more of a behavioral clarification than a bug.

Also applies to: 34-84, 104-122

eloq_data_store_service/rocksdb_data_store_common.cpp (2)

274-283: Query worker pool reuse and datastore Shutdown sequencing look sound

The new reuse path in Initialize() (re‑initializing an existing query_worker_pool_ instead of always recreating it) plus the explicit Shutdown() that stops the pool and then erases scan iterators align well with the per‑shard open/close flow driven by ConnectAndStartDataStore and DataShard::ShutDown. This should allow clean reopen of a shard without leaking threads or leaving stale iterators behind.

If ThreadWorkerPool::Initialize() has any preconditions after Shutdown(), it’d be worth documenting them; otherwise this pattern looks fine as‑is.

Also applies to: 287-301


436-459: Guard against negative partition ids in Read when casting to uint32_t

Read() now obtains the partition via int32_t GetPartitionId() (through the interface) and immediately stores it in a uint32_t:

uint32_t partition_id = req->GetPartitionId();

If a caller ever passes a negative partition_id, this will silently wrap to a large unsigned value and produce an unexpected key prefix.

Consider defensively asserting or clamping before the cast, e.g.:

int32_t part = req->GetPartitionId();
assert(part >= 0);
uint32_t partition_id = static_cast<uint32_t>(part);

or return a DB_NOT_OPEN / argument error if the invariant partition_id >= 0 is not guaranteed elsewhere.

eloq_data_store_service/internal_request.h (1)

471-655: Read/Scan/CreateSnapshot requests now correctly carry shard id; only minor robustness nit

The expanded interfaces:

  • ReadRequest with GetKey(), GetPartitionId(), and GetShardId(),
  • ScanRequest with GetPartitionId() and GetShardId(),
  • CreateSnapshotForBackupRequest with GetShardId(),

and their implementations in the RPC and local variants (ReadRpcRequest/ReadLocalRequest, ScanRpcRequest/ScanLocalRequest, CreateSnapshotForBackupRpcRequest/CreateSnapshotForBackupLocalRequest) are consistent with:

  • ds_request.proto field usage (key_str, partition_id, shard_id, etc.),
  • The datastore APIs (RocksDBDataStoreCommon::Read, ScanNext, ScanClose, snapshot methods),
  • The per-shard write-counting logic (snapshot SetFinish uses DecreaseWriteReqCount(shard_id)).

One small optional improvement: ScanLocalRequest::GetSearchConditions assumes search_conditions_ is non-null, whereas GetSearchConditionsSize() already guards on null. To make it self-contained and future-proof, you could add:

if (!search_conditions_) {
    return nullptr;
}

at the top of GetSearchConditions.

Overall, the shard-aware request interfaces look correct.

Also applies to: 663-863, 865-1134, 1136-1518, 1520-1682

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 511ee98 and dab3aa8.

📒 Files selected for processing (22)
  • data_store_service_client.cpp (102 hunks)
  • data_store_service_client.h (12 hunks)
  • data_store_service_client_closure.cpp (17 hunks)
  • data_store_service_client_closure.h (34 hunks)
  • data_store_service_scanner.cpp (3 hunks)
  • data_store_service_scanner.h (4 hunks)
  • eloq_data_store_service/CMakeLists.txt (0 hunks)
  • eloq_data_store_service/data_store_service.cpp (54 hunks)
  • eloq_data_store_service/data_store_service.h (14 hunks)
  • eloq_data_store_service/data_store_service_util.h (1 hunks)
  • eloq_data_store_service/ds_request.proto (5 hunks)
  • eloq_data_store_service/internal_request.h (55 hunks)
  • eloq_data_store_service/main.cpp (1 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp (5 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store_factory.h (1 hunks)
  • eloq_data_store_service/rocksdb_data_store.cpp (1 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.cpp (3 hunks)
  • eloq_data_store_service/rocksdb_data_store_common.h (1 hunks)
  • eloq_data_store_service/thread_worker_pool.cpp (2 hunks)
  • eloq_data_store_service/thread_worker_pool.h (1 hunks)
  • rocksdb_handler.cpp (2 hunks)
  • rocksdb_handler.h (1 hunks)
💤 Files with no reviewable changes (1)
  • eloq_data_store_service/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (2)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp
  • eloq_data_store_service/main.cpp
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-17T06:36:11.137Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 80
File: rocksdb_handler.cpp:1226-1335
Timestamp: 2025-09-17T06:36:11.137Z
Learning: In the EloqData store_handler codebase, the maximum bucket ID is guaranteed to be smaller than UINT16_MAX (65535), so bucket_id + 1 operations do not risk integer overflow when used in FetchBucketData methods.

Applied to files:

  • data_store_service_client_closure.cpp
  • data_store_service_client.h
  • data_store_service_client.cpp
📚 Learning: 2025-09-17T11:08:35.322Z
Learnt from: liunyl
Repo: eloqdata/store_handler PR: 83
File: data_store_service_client_closure.cpp:408-448
Timestamp: 2025-09-17T11:08:35.322Z
Learning: In the DataStoreServiceClient batch write system, the global coordinator (SyncPutAllData) manages concurrency at the partition level, not at the individual batch level. The global unfinished_request_cnt_ is only decremented when an entire partition is fully completed or failed, not after each individual batch within a partition.

Applied to files:

  • data_store_service_client_closure.cpp
  • eloq_data_store_service/data_store_service.cpp
  • data_store_service_client.cpp
🧬 Code graph analysis (17)
data_store_service_client_closure.cpp (1)
data_store_service_client.cpp (1)
  • callback_data (2576-2582)
data_store_service_scanner.h (1)
eloq_data_store_service/internal_request.h (9)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (811-819)
  • partition_id_ (647-647)
  • partition_id_ (856-856)
  • partition_id_ (1504-1504)
eloq_data_store_service/rocksdb_data_store_common.h (5)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (167-188)
  • Shutdown (167-167)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Shutdown (287-301)
  • Shutdown (287-287)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Shutdown (104-122)
  • Shutdown (104-104)
rocksdb_handler.cpp (4)
  • Shutdown (2324-2359)
  • Shutdown (2324-2324)
  • Shutdown (3405-3437)
  • Shutdown (3405-3405)
eloq_data_store_service/rocksdb_cloud_data_store_factory.h (2)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • RocksDBCloudDataStore (140-157)
  • RocksDBCloudDataStore (159-165)
eloq_data_store_service/rocksdb_cloud_data_store.h (1)
  • RocksDBCloudDataStore (38-175)
rocksdb_handler.h (2)
data_store_service_client.cpp (4)
  • OnLeaderStart (2973-2990)
  • OnLeaderStart (2973-2974)
  • OnStartFollowing (3014-3063)
  • OnStartFollowing (3014-3018)
rocksdb_handler.cpp (4)
  • OnLeaderStart (2105-2129)
  • OnLeaderStart (2105-2105)
  • OnStartFollowing (2174-2199)
  • OnStartFollowing (2174-2178)
rocksdb_handler.cpp (1)
data_store_service_client.cpp (4)
  • OnLeaderStart (2973-2990)
  • OnLeaderStart (2973-2974)
  • OnStartFollowing (3014-3063)
  • OnStartFollowing (3014-3018)
eloq_data_store_service/rocksdb_data_store_common.cpp (4)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (167-188)
  • Shutdown (167-167)
eloq_data_store_service/rocksdb_data_store.cpp (2)
  • Shutdown (69-83)
  • Shutdown (69-69)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Shutdown (104-122)
  • Shutdown (104-104)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
eloq_data_store_service/thread_worker_pool.h (2)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Initialize (163-285)
  • Initialize (163-163)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Initialize (34-84)
  • Initialize (34-34)
eloq_data_store_service/internal_request.h (2)
eloq_data_store_service/data_store_service_config.h (1)
  • uint32_t (107-110)
data_store_service_client_closure.h (31)
  • Reset (81-85)
  • Reset (144-153)
  • Reset (210-216)
  • Reset (269-274)
  • Reset (385-392)
  • Reset (600-619)
  • Reset (933-948)
  • Reset (1154-1177)
  • Reset (1393-1408)
  • Reset (1598-1624)
  • Reset (1627-1658)
  • Reset (1660-1694)
  • Reset (1933-2251)
  • Reset (2328-2347)
  • Reset (2606-2612)
  • Reset (3002-3010)
  • vector (1078-1081)
  • vector (1121-1128)
  • vector (2253-2256)
  • string (804-809)
  • string (2173-2178)
  • string_view (311-368)
  • string_view (752-755)
  • string_view (767-770)
  • string_view (792-797)
  • string_view (1291-1294)
  • string_view (1533-1536)
  • string_view (2133-2136)
  • string_view (2138-2141)
  • string_view (2241-2246)
  • string_view (2453-2456)
data_store_service_scanner.cpp (1)
data_store_service_scanner.h (1)
  • data_shard_id_ (266-266)
eloq_data_store_service/data_store_service.cpp (2)
eloq_data_store_service/data_store_service.h (4)
  • IsOwnerOfShard (636-641)
  • PrepareShardingError (546-550)
  • IncreaseWriteReqCount (624-628)
  • DecreaseWriteReqCount (630-634)
data_store_service_client.cpp (10)
  • Read (1087-1096)
  • Read (1087-1092)
  • Read (3528-3544)
  • Read (3528-3533)
  • DeleteRange (3577-3599)
  • DeleteRange (3577-3584)
  • ScanNext (3720-3753)
  • ScanNext (3720-3734)
  • ScanClose (3795-3819)
  • ScanClose (3795-3800)
data_store_service_client_closure.h (2)
eloq_data_store_service/internal_request.h (13)
  • partition_id_ (647-647)
  • partition_id_ (856-856)
  • partition_id_ (1504-1504)
  • string_view (106-109)
  • string_view (126-129)
  • string_view (273-276)
  • string_view (606-614)
  • string_view (621-629)
  • string_view (811-819)
  • string_view (975-983)
  • string_view (1110-1118)
  • string_view (1406-1414)
  • string_view (1421-1429)
eloq_data_store_service/data_store_service_util.h (1)
  • shard_id_ (123-123)
eloq_data_store_service/thread_worker_pool.cpp (1)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Initialize (163-285)
  • Initialize (163-163)
eloq_data_store_service/data_store_service.h (5)
eloq_data_store_service/data_store_service.cpp (12)
  • StartService (265-350)
  • StartService (265-265)
  • Read (443-476)
  • Read (443-446)
  • Read (478-521)
  • Read (478-486)
  • ScanClose (1039-1072)
  • ScanClose (1039-1042)
  • ScanClose (1074-1106)
  • ScanClose (1074-1079)
  • TTLWrapperCache (78-83)
  • TTLWrapperCache (85-94)
data_store_service_client.cpp (6)
  • Read (1087-1096)
  • Read (1087-1092)
  • Read (3528-3544)
  • Read (3528-3533)
  • ScanClose (3795-3819)
  • ScanClose (3795-3800)
eloq_data_store_service/rocksdb_data_store_common.cpp (5)
  • Read (436-497)
  • Read (436-436)
  • key (852-852)
  • ScanClose (1067-1091)
  • ScanClose (1067-1067)
eloq_data_store_service/data_store_service_config.cpp (6)
  • PrepareShardingError (918-962)
  • PrepareShardingError (918-919)
  • IsOwnerOfShard (286-294)
  • IsOwnerOfShard (286-286)
  • IsOwnerOfShard (691-702)
  • IsOwnerOfShard (691-692)
eloq_data_store_service/data_store.h (1)
  • DataStore (42-143)
data_store_service_client.h (1)
data_store_service_client.cpp (22)
  • FindFreeNodeIndex (3269-3288)
  • FindFreeNodeIndex (3269-3269)
  • InitBucketsInfo (3132-3197)
  • InitBucketsInfo (3132-3136)
  • OnLeaderStart (2973-2990)
  • OnLeaderStart (2973-2974)
  • OnLeaderStop (2992-3005)
  • OnLeaderStop (2992-2992)
  • OnStartFollowing (3014-3063)
  • OnStartFollowing (3014-3018)
  • GetShardIdByPartitionId (3090-3115)
  • GetShardIdByPartitionId (3090-3091)
  • GetOwnerNodeIndexOfShard (3199-3209)
  • GetOwnerNodeIndexOfShard (3199-3200)
  • GetAllDataShards (3117-3130)
  • GetAllDataShards (3117-3117)
  • UpdateOwnerNodeIndexOfShard (3211-3267)
  • UpdateOwnerNodeIndexOfShard (3211-3212)
  • HandleShardingError (3290-3323)
  • HandleShardingError (3290-3291)
  • UpgradeShardVersion (3325-3372)
  • UpgradeShardVersion (3325-3328)
eloq_data_store_service/rocksdb_data_store.cpp (3)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (2)
  • Shutdown (167-188)
  • Shutdown (167-167)
eloq_data_store_service/rocksdb_data_store_common.cpp (2)
  • Shutdown (287-301)
  • Shutdown (287-287)
eloq_data_store_service/thread_worker_pool.cpp (2)
  • Shutdown (104-122)
  • Shutdown (104-104)
data_store_service_client.cpp (2)
data_store_service_client.h (4)
  • TxPort2DssPort (134-137)
  • need_bootstrap_ (688-859)
  • KvPartitionIdOf (629-634)
  • bind_data_shard_with_ng_ (689-813)
eloq_data_store_service/data_store_service.cpp (16)
  • BatchWriteRecords (897-945)
  • BatchWriteRecords (897-901)
  • BatchWriteRecords (1178-1243)
  • BatchWriteRecords (1178-1191)
  • Read (443-476)
  • Read (443-446)
  • Read (478-521)
  • Read (478-486)
  • ScanNext (947-1003)
  • ScanNext (947-962)
  • ScanNext (1005-1037)
  • ScanNext (1005-1008)
  • DeleteRange (613-660)
  • DeleteRange (613-617)
  • DeleteRange (662-715)
  • DeleteRange (662-669)
🪛 Clang (14.0.6)
eloq_data_store_service/data_store_service_util.h

[error] 24-24: 'atomic' file not found

(clang-diagnostic-error)

🔇 Additional comments (28)
eloq_data_store_service/thread_worker_pool.h (1)

42-42: Initialize() addition matches reusable pool design

Exposing Initialize() publicly is consistent with RocksDBDataStoreCommon::Initialize() reusing an existing ThreadWorkerPool after Shutdown(). Given the cpp implementation only starts threads when workers_.empty() and shutdown_indicator_ is false, repeated Initialize()/Shutdown() cycles look safe and idempotent.

eloq_data_store_service/rocksdb_data_store_common.h (1)

165-166: Common Shutdown() centralizes worker‑pool and scan‑iter cleanup

Adding RocksDBDataStoreCommon::Shutdown() to stop query_worker_pool_ and call ForceEraseScanIters(shard_id_) cleanly separates common lifecycle from DB‑specific teardown done in derived classes. Given ThreadWorkerPool::Shutdown() is idempotent and the pointer is kept for reuse via Initialize(), this override looks correct.

rocksdb_handler.h (1)

531-537: Handler interface updated consistently to be ng‑aware

The added ng_id parameter in OnLeaderStart/OnStartFollowing matches the updated DataStoreHandler/DataStoreServiceClient interfaces, keeping RocksDBHandler aligned with the new per‑NG callback contract. No further changes needed here.

data_store_service_client_closure.h (6)

499-505: Archive read callback now uses signed partition_id consistently

Switching ReadBaseForArchiveCallbackData::AddResult and partition_id_ to int32_t aligns this callback with the rest of the codebase (internal requests and proto using signed partition IDs). The change is local and only widens the accepted domain; looks safe.

Also applies to: 532-537


600-619: ReadClosure shard‑aware routing and idempotent reset look correct

  • Reset(...) now captures int32_t partition_id_, uint32_t shard_id_, and std::string_view key_, and PrepareRequest sets both partition_id and shard_id plus key_str on the RPC.
  • Shard re‑routing on transport failure now correctly calls UpdateOwnerNodeIndexOfShard(shard_id_, remote_node_index_, ...), matching the new shard‑based ownership model.
  • Clear() initializing partition_id_ to INT32_MAX and shard_id_ to UINT32_MAX gives a clear “invalid” sentinel for debugging and reuse without impacting normal flows.

Overall, the refactor cleanly separates partition vs shard and uses shard_id consistently in retry logic.

Also applies to: 631-639, 661-663, 688-689, 757-770, 886-890


1154-1177: DeleteRangeClosure correctly propagates shard_id into RPC and retries

The updated Reset(...) signature and state storing int32_t partition_id_ plus uint32_t shard_id_, and wiring shard_id_ into DeleteRangeRequest::set_shard_id(...) and UpdateOwnerNodeIndexOfShard(shard_id_, ...), bring DeleteRange into line with the shard‑centric routing model. The new ShardId() accessor is also useful for local handling paths. No functional issues spotted.

Also applies to: 1197-1201, 1226-1229, 1296-1305, 1356-1358


1605-1607: BatchWriteRecordsClosure shard‑aware write path looks consistent

  • Both single‑record and batched Reset(...) overloads now accept and store a uint32_t shard_id, used in BatchWriteRecordsRequest::set_shard_id(shard_id_).
  • Retry on transport failure updates ownership via UpdateOwnerNodeIndexOfShard(shard_id_, remote_node_index_, ...), matching the other closures.
  • Clear() uses INT32_MAX/UINT32_MAX sentinels and the new PartitionId()/ShardId() const accessors expose the stored routing context cleanly.

The changes are cohesive with the rest of the shard‑aware client logic.

Also applies to: 1629-1631, 1660-1664, 1675-1681, 1714-1716, 1777-1778, 1850-1858, 1868-1871


1913-1915: ScanNextClosure: shard_id threading and retry behavior are wired correctly

  • Reset(...) now takes int32_t partition_id and uint32_t shard_id and stores them; PrepareRequest writes both fields into ScanRequest (partition_id + shard_id).
  • On RPC/channel errors, UpdateOwnerNodeIndexOfShard(shard_id_, remote_node_index_, ...) plus retry mirrors the pattern used in Read/Delete/Write.
  • Accessors PartitionId()/ShardId() keep callers shard‑aware for local scans as well.

Search condition handling and session_id logic remain unchanged; only routing context has been extended, which looks correct.

Also applies to: 1935-1938, 1956-1958, 2008-2009, 2055-2056, 2123-2126, 2128-2131, 2275-2277


2931-2950: FetchArchivesCallbackData partition_id type now matches other components

FetchArchivesCallbackData now stores partition_id_ as int32_t, matching other request/closure/types. Since this value is only forwarded and compared elsewhere, the signedness change should be a no‑op behaviorally and simplifies interop with updated protos and internal requests.

eloq_data_store_service/data_store_service_util.h (1)

24-29: Additional standard headers are fine; static‑analysis error is likely tooling‑related

Including <atomic>, <map>, <memory>, <set>, and <unordered_map> here is standard practice and prepares this utility header for richer data structures/state. The reported 'atomic' file not found from clang static analysis is almost certainly due to how the tool is invoked (e.g., wrong language mode or include paths) rather than an issue with these includes themselves.

eloq_data_store_service/rocksdb_data_store.cpp (1)

69-83: Revised Shutdown() ordering cleanly separates common vs DB teardown

Calling RocksDBDataStoreCommon::Shutdown() first to stop the query worker pool and erase scan iterators, then taking db_mux_ and closing/deleting db_, gives a sensible and race‑resistant shutdown sequence: no new work can be enqueued once the pool is stopped, and all DB access is serialized during close. The destructor’s conditional Shutdown() (which may run after an explicit Shutdown) is safe, since both the worker‑pool Shutdown and DB close paths are effectively idempotent after the first call.

data_store_service_scanner.cpp (2)

64-81: Shard-aware ScanNext/ScanClose wiring looks consistent.

SinglePartitionScanner now threads data_shard_id_ into client->ScanNext and client->ScanClose, which matches the new shard-aware client surface and the added data_shard_id_ member on the scanner. Given Reset always sets data_shard_id_ before any fetch/close, these calls look safe and correct.

Also applies to: 102-115


413-423: Confirm partition_id → shard_id mapping correctness in Init().

Here part_cnt (0…HASH_PARTITION_COUNT-1) is:

  • cast to int32_t and passed to GetShardIdByPartitionId, and
  • also stored as partition_id_ via Reset.

This assumes that the logical hash-partition id and the KV partition id expected by GetShardIdByPartitionId are identical. That matches the existing usage pattern but is worth double‑checking against KvPartitionIdOf semantics to avoid subtle misrouting if those ever diverge.

eloq_data_store_service/ds_request.proto (2)

164-171: New shard_id fields in DeleteRangeRequest and ScanRequest look compatible.

For DeleteRangeRequest and ScanRequest, shard_id is appended at new, higher tag numbers (6 and 12 respectively), and existing fields keep their original numbers. This is the correct pattern for evolving the schema with additional routing context; older peers will ignore shard_id, and newer peers will default it to 0 when talking to older senders.

Also applies to: 305-323


95-100: Review comment is incorrect—no cardinality change to key_str occurred.

The review claims ReadRequest.key_str was changed from repeated bytes to bytes (singular), but evidence shows this never happened:

  1. Current proto file: bytes key_str = 3; (singular)
  2. Multi-shard commit (dfd01f7): bytes key_str = 3; (identical, singular)
  3. C++ code: Uses set_key_str() and key_str() (singular accessors), not add_key_str() or repeated patterns

The field has always been singular. The review may have conflated the actual BatchWriteRecordsRequest field renumbering (skip_wal and items) with a non-existent change to ReadRequest.key_str.

Likely an incorrect or invalid review comment.

data_store_service_client_closure.cpp (5)

71-82: Key handling via ReadClosure::Key() looks safe and clearer.

Using read_closure->Key() (and storing it as std::string_view) instead of accessing Key().back():

  • Avoids the somewhat opaque “last element” semantics and directly reflects the single-key nature of these callbacks.
  • Makes log messages and archive decode errors include the exact packed key bytes, which is useful for debugging.

Assuming ReadClosure::Key() already returns a stable view for the lifetime of the callback (as it did before), these changes are behaviorally safe.

Also applies to: 96-106, 177-181


278-299: Shard-aware pagination for FetchBucketData is wired correctly.

When the bucket isn’t drained and bucket_data_items_ is still empty, you now:

  • Map bucket_id_ → hash partition (MapBucketIdToHashPartitionId),
  • Map that to a KV partition via KvPartitionIdOf,
  • Derive data_shard_id with GetShardIdByPartitionId(kv_partition_id, false), and
  • Use both kv_partition_id and data_shard_id in the recursive ScanNext call.

Given the retrieved learning that bucket IDs are bounded well below UINT16_MAX, the int32_t intermediates are safe. This preserves the invariant that all subsequent scans for this bucket stay on the same KV partition and shard.


516-532: PartitionBatchCallback now reuses the original shard_id; this matches partition-scoped coordination.

The callback now casts closure to BatchWriteRecordsClosure, reads ShardId() once, and reuses that data_shard_id for all follow‑up BatchWriteRecords calls for the same partition.

This keeps all batches for a partition on the same shard and aligns with the existing design where the global coordinator tracks completion at the partition level, not per-batch. As long as PartitionBatchCallback is only ever invoked with a BatchWriteRecordsClosure (which matches the call sites), this change is correct.


617-631: ShardId propagation across ScanNext/Read continuations looks coherent.

All the changed continuations (FetchAllDatabaseCallback, DiscoverAllTableNamesCallback, FetchTableRangesCallback, FetchRangeSlicesCallback, FetchTableStatsCallback, LoadRangeSliceCallback, FetchArchivesCallback, FetchRecordArchivesCallback) now:

  • Extract ShardId() from the current ScanNextClosure or ReadClosure, and
  • Pass it into subsequent ScanNext/Read calls along with the same partition/session id.

This is the right pattern for multi-shard routing: once the initial request determined the shard, all pagination and follow-up reads stay pinned to that shard rather than re-deriving it. Provided the initial closures are constructed with the correct shard_id (per DataStoreServiceClient changes), this wiring looks consistent.

Also applies to: 686-700, 819-833, 917-923, 1013-1020, 1132-1145, 1239-1252, 1308-1321, 1417-1431, 1443-1456


245-276: Verify: FetchBucketDataCc and AddDataItem cannot be located in the repository.

The review comment raises concerns about a key format change, but critical components cannot be found for verification:

  • FetchBucketDataCc class definition not found in any codebase files
  • AddDataItem method signature not found in headers
  • DecodeKvKeyForHashPart function confirmed NOT to exist

Additionally, the reviewed code at lines 264-265 uses the same pattern already established in rocksdb_handler.cpp:1383, suggesting this is not a novel change. The commented code shows a previous TxKey variant, but rocksdb_handler.cpp already uses raw key_str—indicating the current pattern may have been working.

Please confirm:

  1. Whether FetchBucketDataCc is defined externally or generated
  2. What AddDataItem expects downstream (raw KV key or wrapped form)
  3. Whether the rocksdb_handler.cpp usage (same pattern as lines 264-265) has been validated
data_store_service_scanner.h (2)

207-221: SinglePartitionScanner now carries data_shard_id; lifecycle/reset behavior is sound.

Reset now initializes both partition_id_ (int32_t) and data_shard_id_ (uint32_t), and Clear() zeroes them along with the rest of the scanner state. Combined with the use in data_store_service_scanner.cpp (Init → Reset → FetchNextBatch/ScanClose), this ensures:

  • data_shard_id_ is always set before use, and
  • pooled scanners are fully scrubbed when returned to the ObjectPool.

This looks correct for per-shard scanner reuse.

Also applies to: 248-259, 264-273


279-286: HASH_PARTITION_COUNT tied to txservice::total_hash_partitions is a good centralization.

Defining:

static constexpr uint32_t HASH_PARTITION_COUNT = txservice::total_hash_partitions;

removes the previous local duplication/ifdefs and keeps the scanner aligned with the system-wide partition count. This simplifies configuration and reduces the chance of mismatched constants.

eloq_data_store_service/data_store_service.cpp (4)

76-154: TTLWrapperCache lifecycle and concurrency look correct

The TTL-based iterator cache:

  • Uses a dedicated ThreadWorkerPool with a single worker to run TTLCheckWorker().
  • Protects ttl_wrapper_cache_, in_use_, and last_access_time_ behind bthread::Mutex, both in the worker and in Emplace/Borrow/Return/Clear/ForceEraseIters.
  • Shuts down the worker in ~TTLWrapperCache() before calling Clear(), ensuring no background thread races with destruction.

Given ForceEraseIters() is only used from shutdown paths (e.g., ForceEraseScanIters from RocksDBDataStoreCommon::Shutdown), the combination of ForceEraseIters() followed by Clear() is safe (the map is already empty, so the InUse()==false assertion doesn’t fire).

Overall this design is reasonable for managing scan iterators per shard.

Also applies to: 188-207, 1127-1176


443-945: Per-shard gating and ownership checks across read and write paths are consistent

The refactor of Read, FlushData, DeleteRange, CreateTable, DropTable, and BatchWriteRecords to:

  • Derive shard_id from the RPC/local request,
  • Guard with IsOwnerOfShard(shard_id) and PrepareShardingError,
  • Then load ds_ref.shard_status_ and require appropriate states (ReadOnly/ReadWrite for reads/scans, ReadWrite only for writes), and
  • Call into ds_ref.data_store_ with per‑shard request objects

is internally consistent and matches the new multi‑shard design. The write paths pair IncreaseWriteReqCount(shard_id) with the corresponding DecreaseWriteReqCount calls in SetFinish() of the request wrappers, maintaining the per‑shard in‑flight write counters.


1341-1397: Shard-status checks in SyncFileCache and ProcessSyncFileCache are sensible

The new SyncFileCache / ProcessSyncFileCache implementation:

  • Only processes requests when the target shard’s status is Closed, treating non-Closed shards as non-standby and logging a warning.
  • Validates the storage path via data_store_factory_ before mutating the filesystem.
  • During directory cleanup and S3 download, repeatedly checks that ds_ref.shard_status_ remains Closed and aborts if the shard is opened mid-operation.
  • Properly finishes the RPC via SyncFileCacheLocalRequest::Finish() and, on the S3 path, ensures is_file_sync_running_ is reset.

Aside from the is_file_sync_running_ race noted separately, the shard-status gating and I/O error handling here look correct and aligned with the intended “only sync when DB is closed on a standby” semantics.

Also applies to: 1459-1477, 1505-1512, 1571-1579


2888-2987: FileCacheSyncWorker multi-shard iteration and primary-only sync behavior look correct

The new FileCacheSyncWorker:

  • Iterates all entries in data_shards_ and only considers shards whose shard_status_ is ReadWrite, i.e., primaries.
  • Skips shards with data_store_ == nullptr, then dynamic_casts to RocksDBCloudDataStore and continues only if the cast succeeds.
  • Collects cached SST files via CollectCachedSstFiles, then sends a SyncFileCache RPC with the per-shard shard_id and file list to each non-self member node.

This aligns with the expectation that primaries push cache state to their standbys. The per-shard status check prevents followers or closed shards from attempting to sync, and the dynamic_cast keeps this path specific to cloud-backed stores.

eloq_data_store_service/internal_request.h (1)

36-327: Shard-aware write requests and per-shard write counters are wired correctly

The additions of GetShardId() to WriteRecordsRequest, FlushDataRequest, and DeleteRangeRequest, plus the corresponding implementations in:

  • WriteRecordsRpcRequest / WriteRecordsLocalRequest,
  • FlushDataRpcRequest / FlushDataLocalRequest,
  • DeleteRangeRpcRequest / DeleteRangeLocalRequest,

ensure that each write request now carries its shard id explicitly. The SetFinish methods consistently call:

data_store_service_->DecreaseWriteReqCount(GetShardId());
// or DecreaseWriteReqCount(shard_id_);

which matches the per-shard IncreaseWriteReqCount(shard_id) calls in DataStoreService. Clear()/Reset() correctly initialize shard_id_ to UINT32_MAX and then to the real shard on Reset, preventing stale counters.

This looks coherent with the new multi-shard design.

Also applies to: 329-469

eloq_data_store_service/data_store_service.h (1)

692-728: Per-shard DataShard encapsulation and shutdown logic align with multi-shard design

The new DataShard struct cleanly encapsulates all per-shard state:

  • shard_id_, data_store_, shard_status_, ongoing_write_requests_,
  • scan_iter_cache_ (TTLWrapperCache instance),
  • is_file_sync_running_ to coordinate with file-cache sync.

DataShard::ShutDown() calls data_store_->Shutdown() and clears the scan iterator cache (destroying its TTL worker) without nulling data_store_, which is appropriate given the 2‑phase close semantics (status moves to Closed and the pointer remains until DataStoreService itself is destroyed). This matches the usage in the service destructor and the shard lifecycle helpers.

Comment on lines +3090 to 3115
uint32_t DataStoreServiceClient::GetShardIdByPartitionId(
int32_t partition_id, bool is_range_partition) const
{
return IsLocalShard(GetShardIdByPartitionId(partition_id));
uint16_t bucket_id;
if (is_range_partition)
{
bucket_id = txservice::Sharder::MapRangeIdToBucketId(partition_id);
}
else
{
bucket_id =
txservice::Sharder::MapHashPartitionIdToBucketId(partition_id);
}

auto it = bucket_infos_.find(bucket_id);
assert(it != bucket_infos_.end());
if (it != bucket_infos_.end())
{
uint32_t shard_id = it->second->BucketOwner();
assert(dss_shard_ids_.find(shard_id) != dss_shard_ids_.end());
return shard_id;
}
LOG(ERROR) << "Bucket not found for partition_id=" << partition_id
<< " (bucket_id=" << bucket_id << ")";
return UINT32_MAX;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Do not return a sentinel shard_id that callers ignore

GetShardIdByPartitionId now logs and returns UINT32_MAX when a bucket lookup fails, but every caller still assumes the value is usable and immediately indexes dss_shards_/dss_nodes_. In a bootstrap configuration with zero shards (or any inconsistent topology), we hit the LOG(ERROR) path, return UINT32_MAX, and then execute dss_nodes_[UINT32_MAX].Channel() in the remote paths (e.g. ReadInternal, BatchWriteRecordsInternal), which is undefined behaviour.

Previously this path LOG(FATAL)'d, so we never proceeded with an invalid shard_id. Please either keep the fatal guard or propagate a real failure to callers and abort the request before touching dss_nodes_. For example:

-    auto it = bucket_infos_.find(bucket_id);
-    assert(it != bucket_infos_.end());
-    if (it != bucket_infos_.end())
-    {
-        uint32_t shard_id = it->second->BucketOwner();
-        assert(dss_shard_ids_.find(shard_id) != dss_shard_ids_.end());
-        return shard_id;
-    }
-    LOG(ERROR) << "Bucket not found for partition_id=" << partition_id
-               << " (bucket_id=" << bucket_id << ")";
-    return UINT32_MAX;
+    auto it = bucket_infos_.find(bucket_id);
+    if (it == bucket_infos_.end())
+    {
+        LOG(FATAL) << "Bucket not found for partition_id=" << partition_id
+                   << " (bucket_id=" << bucket_id << ")";
+    }
+    uint32_t shard_id = it->second->BucketOwner();
+    if (dss_shard_ids_.find(shard_id) == dss_shard_ids_.end())
+    {
+        LOG(FATAL) << "Shard " << shard_id
+                   << " not present in dss_shard_ids_ for partition_id="
+                   << partition_id << " (bucket_id=" << bucket_id << ")";
+    }
+    return shard_id;

Without reinstating a hard guard (or updating every caller to bail when UINT32_MAX is returned), we're still one invalid config away from corrupting memory.

@lzxddz lzxddz merged commit 9fc1352 into main Nov 21, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] DataStoreService supports multi data shards

2 participants