Skip to content

Update cluster config#256

Merged
githubzilla merged 5 commits intomainfrom
update_cluster_config
Nov 4, 2025
Merged

Update cluster config#256
githubzilla merged 5 commits intomainfrom
update_cluster_config

Conversation

@githubzilla
Copy link
Collaborator

@githubzilla githubzilla commented Nov 3, 2025

Summary by CodeRabbit

  • Refactor
    • Startup now builds DSS cluster configuration in memory, establishes node/leader identities, and fetches or constructs cluster topology at runtime. Datastore connections are deferred unless bootstrapping or the node becomes leader/follower, improving startup flexibility and reliability.
  • Chores
    • Updated submodule references.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 3, 2025

Walkthrough

Replaces on-disk DSS config with an in‑memory DSS cluster config in src/redis_service.cpp; computes dss_node_id and dss_leader_id, populates topology (peer-fetch or local build), passes IDs into DataStoreService::StartService(...), and makes store_hd_.Connect() conditional. Two submodule pointers updated.

Changes

Cohort / File(s) Summary
DSS cluster config & startup
src/redis_service.cpp
Removed file-based dss_config.ini usage and introduced in-memory EloqDS::DataStoreServiceClusterManager ds_config. Added dss_node_id / dss_leader_id declarations and logic to compute/set them. Populate topology via TxConfigsToDssClusterConfig(...) or FetchConfigFromPeer(...). Updated DataStoreService::StartService call to include (bool bootstrap, uint32_t dss_leader_id, uint32_t dss_node_id). Made store_hd_.Connect() run only for bootstrap or single-node, deferring otherwise.
Submodule pointer updates
store_handler (manifest), tx_service (manifest)
Advanced submodule commit references for store_handler and tx_service (manifest-only pointer updates). No code/interface changes in those submodules within this diff.

Sequence Diagram(s)

sequenceDiagram
    actor Operator
    participant RedisSvc as redis_service.cpp
    participant ConfigMaker as Config builder / Peer fetch
    participant DataStore as DataStoreService

    Operator->>RedisSvc: start node
    RedisSvc->>RedisSvc: determine native_ng_id and node identity
    RedisSvc->>RedisSvc: compute dss_node_id and dss_leader_id (bootstrap/single-node logic)
    alt eloq_dss_peer_node set
        RedisSvc->>ConfigMaker: SetThisNode(...) & FetchConfigFromPeer(...)
        ConfigMaker-->>RedisSvc: ds_config populated / error
    else no peer configured
        RedisSvc->>ConfigMaker: TxConfigsToDssClusterConfig(dss_node_id, native_ng_id, ng_configs, dss_leader_id, ds_config)
        ConfigMaker-->>RedisSvc: ds_config populated
    end
    RedisSvc->>DataStore: StartService(bootstrap, dss_leader_id, dss_node_id)
    DataStore-->>RedisSvc: started
    alt bootstrap or single-node
        RedisSvc->>RedisSvc: store_hd_.Connect() immediately
    else non-bootstrap
        RedisSvc->>RedisSvc: defer store_hd_.Connect() until role assigned
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Check all call sites and tests updated for the changed StartService signature.
  • Validate TxConfigsToDssClusterConfig and FetchConfigFromPeer usage to ensure ds_config is correctly populated and errors handled.
  • Review startup timing and potential race conditions around deferred store_hd_.Connect() when role transitions occur.

Possibly related PRs

Poem

🐇 I hopped from INI into RAM tonight,
spun nodes and leaders neat and bright,
fetched peers or built the map,
connections wait for the right clap,
carrots code — the cluster's light! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Update cluster config' is directly related to the main change—replacing on-disk dss_config.ini with in-memory cluster configuration and updating DSS topology initialization.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch update_cluster_config

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/redis_service.cpp (1)

1161-1168: Connection to store_hd_ is missing for non-bootstrap/non-single-node cases—store_hd_ will remain unconnected despite being used.

The code calls store_hd_->AppendPreBuiltTable() (line 1158) before the Connect condition, and for non-bootstrap/non-single-node scenarios, the Connect() call at line 1163 is skipped entirely. Additionally, RegisterKvMetrics() (line 1481) and SetTxService() (line 1484) are called unconditionally without any guarantee that store_hd_ was connected.

The comment at line 1162 states "connect when become leader or follower," but this deferred connection mechanism does not exist in the codebase—there are no OnLeaderStart, OnFollower, or equivalent callbacks that invoke Connect(). This means non-bootstrap/non-single-node nodes will have an unconnected store handler used for table operations, metrics registration, and transaction service binding, which will likely cause failures.

Either move AppendPreBuiltTable() inside the Connect conditional block, or implement the promised deferred connection logic in the leadership/follower initialization code path.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1339b88 and f5540a1.

📒 Files selected for processing (3)
  • src/redis_service.cpp (4 hunks)
  • store_handler (1 hunks)
  • tx_service (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tx_service
  • store_handler
🧰 Additional context used
🧬 Code graph analysis (1)
src/redis_service.cpp (1)
include/redis_service.h (1)
  • node_id (137-137)
🔇 Additional comments (1)
src/redis_service.cpp (1)

1020-1020: Add error handling for TxConfigsToDssClusterConfig and verify empty file path handling.

The redesign switches from on-disk to in-memory DSS cluster configuration. While the empty dss_config_file_path may be intentional, you should verify that EloqDS::DataStoreService correctly handles an empty config file path parameter (line 1137).

Additionally, line 1048 calls TxConfigsToDssClusterConfig() without checking for errors—this could silently fail and produce an invalid ds_config. Add error handling or validation after this call to ensure the config is properly initialized.

#!/bin/bash
# Check if TxConfigsToDssClusterConfig has a return value or error reporting mechanism
rg -n "TxConfigsToDssClusterConfig" --type=cpp --type=h -B2 -A8

Also applies to: 1048-1048, 1135-1139

Comment on lines +1048 to +1054
EloqDS::DataStoreServiceClient::TxConfigsToDssClusterConfig(
dss_node_id,
native_ng_id,
ng_configs,
dss_leader_id,
ds_config);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add error handling for TxConfigsToDssClusterConfig.

The call to TxConfigsToDssClusterConfig lacks error handling, unlike the peer fetch path (lines 1038-1044). If this function can fail, add appropriate error checking and return false on failure to maintain consistency with the peer fetch error handling.

🤖 Prompt for AI Agents
In src/redis_service.cpp around lines 1048 to 1054, the call to
EloqDS::DataStoreServiceClient::TxConfigsToDssClusterConfig has no error
handling; mirror the peer-fetch path at lines 1038–1044 by checking the
function's return value, logging a clear error (including context such as
dss_node_id/native_ng_id) when it fails, and returning false from the
surrounding function on failure so behavior is consistent with the peer fetch
error handling.

@githubzilla githubzilla force-pushed the update_cluster_config branch from 27fef8f to 49de4ff Compare November 4, 2025 04:20
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/redis_service.cpp (1)

1048-1054: Missing error handling on TxConfigsToDssClusterConfig.

Mirror the peer‑fetch path: check the return status, log context, and fail fast on error.

Apply this diff:

-            EloqDS::DataStoreServiceClient::TxConfigsToDssClusterConfig(
-                dss_node_id,
-                native_ng_id,
-                ng_configs,
-                dss_leader_id,
-                ds_config);
+            if (!EloqDS::DataStoreServiceClient::TxConfigsToDssClusterConfig(
+                    dss_node_id,
+                    native_ng_id,
+                    ng_configs,
+                    dss_leader_id,
+                    ds_config)) {
+                LOG(ERROR) << "TxConfigsToDssClusterConfig failed: "
+                           << "dss_node_id=" << dss_node_id
+                           << " native_ng_id=" << native_ng_id
+                           << " leader_id=" << dss_leader_id;
+                return false;
+            }
🧹 Nitpick comments (5)
src/redis_service.cpp (5)

1024-1031: Ensure DSS node ID matches final DSS topology.

Setting dss_node_id = node_id assumes DSS IDs mirror Tx IDs. After building or fetching ds_config, derive/validate the DSS node id from ds_config (e.g., “this node” entry) to avoid ID drift, especially in the peer‑fetch path. Add a sanity check and log both IDs.


1032-1045: Peer‑fetch: validate “this node” presence and improve error context.

After FetchConfigFromPeer(...), verify ds_config contains this node and matches local_ip/TxPort2DssPort(local_tx_port). On failure, log ip, dss_port, node_id, and proposed dss_node_id to ease diagnosis.


1020-1021: Empty dss_config_file_path: confirm supported semantics.

dss_config_file_path is set to "" but passed into DataStoreService. Ensure an empty path is explicitly supported (no attempts to read/write a file). If not, gate usage or add a clear comment.


1056-1058: Remove stale commented‑out file path.

The legacy on‑disk path is dead code. Delete to reduce noise.


1161-1164: Deferred Connect requires a role‑change hook; clarify single‑node flow.

Connecting only for bootstrap || is_single_node implies later Connect() on role changes (leader or follower). Point to the actual callbacks that invoke it; otherwise this can leave the client disconnected in steady state. Also, re the earlier reviewer’s question at Line 1163: verify whether single‑node triggers OnLeaderStart (or equivalent). If not, the explicit Connect() here is the only activation; document this behavior.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27fef8f and 167fd89.

📒 Files selected for processing (3)
  • src/redis_service.cpp (4 hunks)
  • store_handler (1 hunks)
  • tx_service (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tx_service
  • store_handler
🧰 Additional context used
🧬 Code graph analysis (1)
src/redis_service.cpp (1)
include/redis_service.h (1)
  • node_id (137-137)
🔇 Additional comments (1)
src/redis_service.cpp (1)

1144-1144: StartService parameters: UNKNOWN leader on multi‑node path.

On non‑bootstrap, non‑single‑node starts, dss_leader_id remains UNKNOWN. Confirm StartService(..., dss_leader_id, dss_node_id) tolerates this and that leadership will be resolved via DSS membership changes. Also confirm dss_node_id matches ds_config (see earlier comment).

@githubzilla githubzilla merged commit 76334a4 into main Nov 4, 2025
4 of 5 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants