Skip to content

Conversation

@yuwmao
Copy link
Contributor

@yuwmao yuwmao commented Dec 29, 2025

We found a corner case that the new member rejected the join_cluster_req during adding new member. Here is what happened:
T1. The first time leader invited new member to join cluster. The follower received the request, then saved state and called reconfigure to apply cluster config. However the leader didn't receive the resp(timeout).
leader's logs

[12/26/25 09:54:22.824943] [I] [122] [handle_join_leave.cxx:149:invite_srv_to_join_cluster] sent join request to peer 1557217429, 6a887216-0d1d-4e40-81b3-9ed02222b626 [group=1b36bd51-c679-497a-8f42-92cba8713bd8]
...
[12/26/25 09:54:24.825554] [I] [24] [raft_server.cxx:1639:handle_ext_resp_err] receive an rpc error response from peer server, Deadline Exceeded 12 [group=1b36bd51-c679-497a-8f42-92cba8713bd8] 
[12/26/25 09:54:26.025] [storage_mgr] [error] [122] [raft_repl_dev.cpp:490:do_add_member] [traceID=9333964952249785610] [rdev236:1b36bd51-c679-497a-8f42-92cba8713bd8] Add member failed, member=6a887216-0d1d-4e40-81b3-9ed02222b626, err=-1
--

follower's logs

[12/26/25 09:54:25.831] [storage_mgr] [debug] [28] [raft_repl_dev.h:364:become_follower_cb] [traceID=n/a] [rdev9:1b36bd51-c679-497a-8f42-92cba8713bd8] become_follower_cb called! 
[12/26/25 09:54:25.831] [storage_mgr] [info] [28] [raft_repl_dev.cpp:1902:save_config] [traceID=n/a] [rdev9:1b36bd51-c679-497a-8f42-92cba8713bd8] Saved config {"eventual_consistency":false,"log_idx":59074,"prev_log_idx":59073,"servers":[{"aux":"","dc_id":0,"endpoint":"02b868b0-0c40-46d5-9aa5-a03e5c3b0edd","id":402439255,"learner":false,"priority":66},{"aux":"","dc_id":0,"endpoint":"f5018a5a-99a8-4e47-b600-b74a1ac22438","id":53101537,"learner":false,"priority":66},{"aux":"","dc_id":0,"endpoint":"9727e555-0988-43c7-bbfb-109a00b7125a","id":234785214,"learner":true,"priority":100}],"user_ctx":""}
[12/26/25 09:54:25.831170] [I] [28] [handle_commit.cxx:769:reconfigure] new config log idx 59074, prev log idx 59073, cur config log idx 0, prev log idx 0 [group=1b36bd51-c679-497a-8f42-92cba8713bd8] 
[12/26/25 09:54:25.833996] [I] [28] [handle_commit.cxx:854:reconfigure] server 402439255 is added to cluster [group=1b36bd51-c679-497a-8f42-92cba8713bd8] |  
[12/26/25 09:54:25.840217] [I] [28] [handle_commit.cxx:854:reconfigure] server 53101537 is added to cluster [group=1b36bd51-c679-497a-8f42-92cba8713bd8] |  
[12/26/25 09:54:25.844525] [I] [28] [handle_commit.cxx:854:reconfigure] server 234785214 is added to cluster [group=1b36bd51-c679-497a-8f42-92cba8713bd8] |  
[12/26/25 09:54:25.844527] [I] [28] [handle_commit.cxx:948:reconfigure] peer 1557217429 cannot be found, no action for removing [group=1b36bd51-c679-497a-8f42-92cba8713bd8] 
add peer 234785214, 9727e555-0988-43c7-bbfb-109a00b7125a, learner, regular [group=1b36bd51-c679-497a-8f42-92cba8713bd8] | my id: 1557217429, leader: 402439255, term: 2005209 [group=1b36bd51-c679-497a-8f42-92cba8713bd8]

T2 We retried add member operation, leader sent join_cluster_req, while the follower thought it's already in the cluster, and return response with accept=false. The leader received accept=false and considered that the follower rejected the req. Then we are trapped into an endless retry.
follower:

[12/26/25 09:54:56.029128] [I] [28] [handle_join_leave.cxx:170:handle_join_cluster_req] this server is already in a cluster, ignore the request [group=1b36bd51-c679-497a-8f42-92cba8713bd8]

leader:

[12/26/25 09:54:56.028576] [I] [122] [handle_join_leave.cxx:149:invite_srv_to_join_cluster] sent join request to peer 1557217429, 6a887216-0d1d-4e40-81b3-9ed02222b626 [group=1b36bd51-c679-497a-8f42-92cba8713bd8]
[12/26/25 09:54:56.031] [storage_mgr] [warning] [24] [handle_join_leave.cxx:229:handle_join_cluster_resp] new server (1557217429) cannot accept the invitation, give up [group=1b36bd51-c679-497a-8f42-92cba8713bd8]

Since the follower also saved is_catching_up=true, it skipped vote in handle_election_timeout, as a result, it doesn't have a chance to realize it is not in the cluster at all.

ptr<cluster_config> cur_config = get_config();
if (cur_config->get_servers().size() > 1) {
p_in("this server is already in a cluster, ignore the request");
resp->accept( quick_commit_index_.load() + 1 );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't always accept the request. The main purpose of this if condition is to reject requests coming from a different cluster; it should never accept such requests.

Please change the logic so that the carried cluster_config is validated, and the request is accepted only if the cluster_config in the request exactly matches the cluster configuration of this server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the cluster_config carried in the request cannot match exactly. The request should be accepted only if:

req->cluster_config == this->cluster_config - this_server

In other words, please update the logic to accept the request only when the request’s cluster configuration matches the server’s cluster configuration excluding the server itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, will update it.

Copy link
Contributor

@greensky00 greensky00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@greensky00 greensky00 merged commit af42900 into eBay:master Jan 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants