Skip to content

Conversation

@JiaQiTang98
Copy link

Reopen the peer if it was shut down previously.

I found that raft_server::handle_election_timeout will call raft_server::cancel_schedulers when steps_to_down_==0

cancel_schedulers();

This function will shut down all peers in raft_server::peers_:

for (peer_itor it = peers_.begin(); it != peers_.end(); ++it) {
const ptr<peer>& p = it->second;
if (p->get_hb_task()) {
cancel_task(p->get_hb_task());
}
// Shutdown peer to cut off smart pointers.
p->shutdown();

Therefore, if the node becomes the leader, it will call raft_server::enable_hb_for_peer in raft_server::become_leader:

enable_hb_for_peer(*pp);

In raft_server::enable_hb_for_peer, it will reschedule peer::hb_task_;

void raft_server::enable_hb_for_peer(peer& p) {
p.enable_hb(true);
p.resume_hb_speed();
p_tr("peer %d, interval: %d\n", p.get_id(), p.get_current_hb_interval());
schedule_task(p.get_hb_task(), p.get_current_hb_interval());
}

however, peer::hb_task_ was reset in peer::shutdown previously, so peer::hb_task_ is now nullptr, and this will lead to a crash as shown below:

2026.01.26 15:36:00.377212 [ 4308615 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2026.01.26 15:36:00.377236 [ 4308615 ] {} <Fatal> BaseDaemon: (version 26.1.1.1, build id: , git hash: a5a2d547eaff3f390c60ad4b424850864ae7c14f, architecture: arm64) (from thread 4308643) Received signal 11
2026.01.26 15:36:00.377245 [ 4308615 ] {} <Fatal> BaseDaemon: Signal description: Segmentation fault: 11
2026.01.26 15:36:00.377247 [ 4308615 ] {} <Fatal> BaseDaemon: Address: 0x10. Access: <not available on Darwin>. Attempted access has violated the permissions assigned to the memory area.
2026.01.26 15:36:00.377257 [ 4308615 ] {} <Fatal> BaseDaemon: Stack trace: 0x00000001073ffee8 0x00000001073fb778 0x0000000196ef3744 0x0000000111f08a20 0x0000000111f95fe8 0x0000000111f6d21c 0x0000000111f91004 0x0000000111f66e1c 0x0000000111eff56c 0x0000000111efeed0 0x0000000111f00af4 0x0000000111f3567c 0x0000000111f3c264 0x0000000111f0e350 0x0000000111f0df10 0x0000000111f079e0 0x0000000111f166b0 0x000000010746c594 0x000000010747934c 0x0000000196ee9c08 0x0000000196ee4ba8
2026.01.26 15:36:00.377258 [ 4308615 ] {} <Fatal> BaseDaemon: ########################################
2026.01.26 15:36:00.377260 [ 4308615 ] {} <Fatal> BaseDaemon: (version 26.1.1.1, build id: , git hash: a5a2d547eaff3f390c60ad4b424850864ae7c14f) (from thread 4308643) (no query) Received signal Segmentation fault: 11 (11)
2026.01.26 15:36:00.377260 [ 4308615 ] {} <Fatal> BaseDaemon: Address: 0x10. Access: <not available on Darwin>. Attempted access has violated the permissions assigned to the memory area.
2026.01.26 15:36:00.377261 [ 4308615 ] {} <Fatal> BaseDaemon: Stack trace: 0x00000001073ffee8 0x00000001073fb778 0x0000000196ef3744 0x0000000111f08a20 0x0000000111f95fe8 0x0000000111f6d21c 0x0000000111f91004 0x0000000111f66e1c 0x0000000111eff56c 0x0000000111efeed0 0x0000000111f00af4 0x0000000111f3567c 0x0000000111f3c264 0x0000000111f0e350 0x0000000111f0df10 0x0000000111f079e0 0x0000000111f166b0 0x000000010746c594 0x000000010747934c 0x0000000196ee9c08 0x0000000196ee4ba8
2026.01.26 15:36:00.430302 [ 4308615 ] {} <Fatal> BaseDaemon: 0. _ZN10StackTraceC1ERK17__darwin_ucontext @ 0x00000001073ffee8
2026.01.26 15:36:00.430315 [ 4308615 ] {} <Fatal> BaseDaemon: 1. _ZL13signalHandleriP9__siginfoPv @ 0x00000001073fb778
2026.01.26 15:36:00.430316 [ 4308615 ] {} <Fatal> BaseDaemon: 2. _sigtramp @ 0x0000000196ef3744
2026.01.26 15:36:00.430323 [ 4308615 ] {} <Fatal> BaseDaemon: 3. _ZN6nuraft12asio_service8scheduleERNSt3__110shared_ptrINS_12delayed_taskEEEi @ 0x0000000111f08a20
2026.01.26 15:36:00.430325 [ 4308615 ] {} <Fatal> BaseDaemon: 4. _ZN6nuraft11raft_server18enable_hb_for_peerERNS_4peerE @ 0x0000000111f95fe8
2026.01.26 15:36:00.430327 [ 4308615 ] {} <Fatal> BaseDaemon: 5. _ZN6nuraft11raft_server13become_leaderEv @ 0x0000000111f6d21c
2026.01.26 15:36:00.430329 [ 4308615 ] {} <Fatal> BaseDaemon: 6. _ZN6nuraft11raft_server16handle_vote_respERNS_8resp_msgE @ 0x0000000111f91004
2026.01.26 15:36:00.430331 [ 4308615 ] {} <Fatal> BaseDaemon: 7. _ZN6nuraft11raft_server16handle_peer_respERNSt3__110shared_ptrINS_8resp_msgEEERNS2_INS_13rpc_exceptionEEE @ 0x0000000111f66e1c
2026.01.26 15:36:00.430333 [ 4308615 ] {} <Fatal> BaseDaemon: 8. _ZN6nuraft10cmd_resultINSt3__110shared_ptrINS_8resp_msgEEENS2_INS_13rpc_exceptionEEEE10set_resultERS4_RS6_NS_15cmd_result_codeE @ 0x0000000111eff56c
2026.01.26 15:36:00.430335 [ 4308615 ] {} <Fatal> BaseDaemon: 9. _ZN6nuraft4peer17handle_rpc_resultENSt3__110shared_ptrIS0_EENS2_INS_10rpc_clientEEERNS2_INS_7req_msgEEERNS2_INS_10cmd_resultINS2_INS_8resp_msgEEENS2_INS_13rpc_exceptionEEEEEEEbmRSB_RSD_ @ 0x0000000111efeed0
2026.01.26 15:36:00.430338 [ 4308615 ] {} <Fatal> BaseDaemon: 10. _ZNSt3__110__function13__policy_funcIFvRNS_10shared_ptrIN6nuraft8resp_msgEEERNS2_INS3_13rpc_exceptionEEEEE11__call_funcB8se210105INS_6__bindIMNS3_4peerEFvNS2_ISE_EENS2_INS3_10rpc_clientEEERNS2_INS3_7req_msgEEERNS2_INS3_10cmd_resultIS5_S8_EEEEbmS6_S9_EJPSE_RSF_RSH_SK_SO_RbRmRKNS_12placeholders4__phILi1EEERKNSX_ILi2EEEEEEEEvPKNS0_16__policy_storageES6_S9_ @ 0x0000000111f00af4
2026.01.26 15:36:00.430340 [ 4308615 ] {} <Fatal> BaseDaemon: 11. _ZN6nuraft15asio_rpc_client13response_readERNSt3__110shared_ptrINS_7req_msgEEERNS1_8functionIFvRNS2_INS_8resp_msgEEERNS2_INS_13rpc_exceptionEEEEEERNS2_INS_6bufferEEENS1_10error_codeEm @ 0x0000000111f3567c
2026.01.26 15:36:00.430347 [ 4308615 ] {} <Fatal> BaseDaemon: 12. _ZN5boost4asio6detail23reactive_socket_recv_opINS0_17mutable_buffers_1ENS1_7read_opINS0_19basic_stream_socketINS0_2ip3tcpENS0_15any_io_executorEEES3_PKNS0_14mutable_bufferENS1_14transfer_all_tENSt3__16__bindIMN6nuraft15asio_rpc_clientEFvRNSE_10shared_ptrINSG_7req_msgEEERNSE_8functionIFvRNSI_INSG_8resp_msgEEERNSI_INSG_13rpc_exceptionEEEEEERNSI_INSG_6bufferEEENSE_10error_codeEmEJRNSI_ISH_EESL_SV_SY_RKNSE_12placeholders4__phILi1EEERKNS15_ILi2EEEEEEEES8_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm @ 0x0000000111f3c264
2026.01.26 15:36:00.430349 [ 4308615 ] {} <Fatal> BaseDaemon: 13. _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE @ 0x0000000111f0e350
2026.01.26 15:36:00.430351 [ 4308615 ] {} <Fatal> BaseDaemon: 14. _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE @ 0x0000000111f0df10
2026.01.26 15:36:00.430353 [ 4308615 ] {} <Fatal> BaseDaemon: 15. _ZN6nuraft17asio_service_impl12worker_entryEv @ 0x0000000111f079e0
2026.01.26 15:36:00.430355 [ 4308615 ] {} <Fatal> BaseDaemon: 16. _ZZN24ThreadFromGlobalPoolImplILb1ELb1EEC1INSt3__16__bindIMN6nuraft17asio_service_implEFvvEJPS5_EEEJEEEOT_DpOT0_ENUlvE_clEv @ 0x0000000111f166b0
2026.01.26 15:36:00.430356 [ 4308615 ] {} <Fatal> BaseDaemon: 17. _ZN14ThreadPoolImplINSt3__16threadEE20ThreadFromThreadPool6workerEv @ 0x000000010746c594
2026.01.26 15:36:00.430358 [ 4308615 ] {} <Fatal> BaseDaemon: 18. _ZNSt3__114__thread_proxyB8se210105INS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN14ThreadPoolImplINS_6threadEE20ThreadFromThreadPoolEFvvEPSA_EEEEEPvSF_ @ 0x000000010747934c
2026.01.26 15:36:00.430360 [ 4308615 ] {} <Fatal> BaseDaemon: 19. _pthread_start @ 0x0000000196ee9c08
2026.01.26 15:36:00.430364 [ 4308615 ] {} <Fatal> BaseDaemon: 20. thread_start @ 0x0000000196ee4ba8

@JiaQiTang98
Copy link
Author

Hi @antonio2368 could you help review this PR? Thx!

@antonio2368 antonio2368 self-assigned this Jan 26, 2026
@antonio2368
Copy link
Member

Thanks for the submission!
I'm trying to understand how did a node that was removed from cluster become a leader?

@JiaQiTang98
Copy link
Author

I'm trying to understand how did a node that was removed from cluster become a leader?

Actually, the node was not removed from the cluster.
I reproduced this issue with the following steps:

  1. Create a 3-node cluster
  2. Add a new node to the cluster's config.xml, and reduce the values of election_timeout_lower_bound_ms and election_timeout_upper_bound_ms to make raft_server::handle_election_timeout trigger easily
  3. Start the new node; it will join the cluster and execute raft_server::reconfigure, then call raft_server::restart_election_timer — this timer will invoke raft_server::handle_election_timeout when a timeout occurs.

if (id_ == (*it)->get_id()) {
my_priority_ = (*it)->get_priority();
im_learner_ = (*it)->is_learner();
steps_to_down_ = 0;
if (!(*it)->is_new_joiner() &&
role_ == srv_role::follower &&
state_->is_catching_up()) {
// Except for new joiner type, if this server is added
// to the cluster config, that means the sync is done.
// Start election timer without waiting for
// the next append_entries message.
//
// If this server is a new joiner, `catching_up_` flag
// will be cleared when it becomes a regular member,
// that is also notified by a new cluster config.
p_in("now this node is the part of cluster, "
"catch-up process is done, clearing the flag");
state_->set_catching_up(false);
ctx_->state_mgr_->save_state(*state_);
restart_election_timer();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants