Fix: Reopen the peer if it was shutdown previously #91

JiaQiTang98 · 2026-01-26T08:11:01Z

Reopen the peer if it was shut down previously.

I found that raft_server::handle_election_timeout will call raft_server::cancel_schedulers when steps_to_down_==0

NuRaft/src/handle_timeout.cxx

Line 235 in 1649e92

cancel_schedulers();

This function will shut down all peers in raft_server::peers_:

NuRaft/src/handle_timeout.cxx

Lines 340 to 346 in 1649e92

    
           for (peer_itor it = peers_.begin(); it != peers_.end(); ++it) { 
        
               const ptr<peer>& p = it->second; 
        
               if (p->get_hb_task()) { 
        
                   cancel_task(p->get_hb_task()); 
        
               } 
        
               // Shutdown peer to cut off smart pointers. 
        
               p->shutdown();

Therefore, if the node becomes the leader, it will call raft_server::enable_hb_for_peer in raft_server::become_leader:

NuRaft/src/raft_server.cxx

Line 1116 in 1649e92

enable_hb_for_peer(*pp);

In raft_server::enable_hb_for_peer, it will reschedule peer::hb_task_;

NuRaft/src/handle_timeout.cxx

Lines 34 to 39 in 1649e92

    
           void raft_server::enable_hb_for_peer(peer& p) { 
        
               p.enable_hb(true); 
        
               p.resume_hb_speed(); 
        
               p_tr("peer %d, interval: %d\n", p.get_id(), p.get_current_hb_interval()); 
        
               schedule_task(p.get_hb_task(), p.get_current_hb_interval()); 
        
           }

however, peer::hb_task_ was reset in peer::shutdown previously, so peer::hb_task_ is now nullptr, and this will lead to a crash as shown below:

2026.01.26 15:36:00.377212 [ 4308615 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2026.01.26 15:36:00.377236 [ 4308615 ] {} <Fatal> BaseDaemon: (version 26.1.1.1, build id: , git hash: a5a2d547eaff3f390c60ad4b424850864ae7c14f, architecture: arm64) (from thread 4308643) Received signal 11
2026.01.26 15:36:00.377245 [ 4308615 ] {} <Fatal> BaseDaemon: Signal description: Segmentation fault: 11
2026.01.26 15:36:00.377247 [ 4308615 ] {} <Fatal> BaseDaemon: Address: 0x10. Access: <not available on Darwin>. Attempted access has violated the permissions assigned to the memory area.
2026.01.26 15:36:00.377257 [ 4308615 ] {} <Fatal> BaseDaemon: Stack trace: 0x00000001073ffee8 0x00000001073fb778 0x0000000196ef3744 0x0000000111f08a20 0x0000000111f95fe8 0x0000000111f6d21c 0x0000000111f91004 0x0000000111f66e1c 0x0000000111eff56c 0x0000000111efeed0 0x0000000111f00af4 0x0000000111f3567c 0x0000000111f3c264 0x0000000111f0e350 0x0000000111f0df10 0x0000000111f079e0 0x0000000111f166b0 0x000000010746c594 0x000000010747934c 0x0000000196ee9c08 0x0000000196ee4ba8
2026.01.26 15:36:00.377258 [ 4308615 ] {} <Fatal> BaseDaemon: ########################################
2026.01.26 15:36:00.377260 [ 4308615 ] {} <Fatal> BaseDaemon: (version 26.1.1.1, build id: , git hash: a5a2d547eaff3f390c60ad4b424850864ae7c14f) (from thread 4308643) (no query) Received signal Segmentation fault: 11 (11)
2026.01.26 15:36:00.377260 [ 4308615 ] {} <Fatal> BaseDaemon: Address: 0x10. Access: <not available on Darwin>. Attempted access has violated the permissions assigned to the memory area.
2026.01.26 15:36:00.377261 [ 4308615 ] {} <Fatal> BaseDaemon: Stack trace: 0x00000001073ffee8 0x00000001073fb778 0x0000000196ef3744 0x0000000111f08a20 0x0000000111f95fe8 0x0000000111f6d21c 0x0000000111f91004 0x0000000111f66e1c 0x0000000111eff56c 0x0000000111efeed0 0x0000000111f00af4 0x0000000111f3567c 0x0000000111f3c264 0x0000000111f0e350 0x0000000111f0df10 0x0000000111f079e0 0x0000000111f166b0 0x000000010746c594 0x000000010747934c 0x0000000196ee9c08 0x0000000196ee4ba8
2026.01.26 15:36:00.430302 [ 4308615 ] {} <Fatal> BaseDaemon: 0. _ZN10StackTraceC1ERK17__darwin_ucontext @ 0x00000001073ffee8
2026.01.26 15:36:00.430315 [ 4308615 ] {} <Fatal> BaseDaemon: 1. _ZL13signalHandleriP9__siginfoPv @ 0x00000001073fb778
2026.01.26 15:36:00.430316 [ 4308615 ] {} <Fatal> BaseDaemon: 2. _sigtramp @ 0x0000000196ef3744
2026.01.26 15:36:00.430323 [ 4308615 ] {} <Fatal> BaseDaemon: 3. _ZN6nuraft12asio_service8scheduleERNSt3__110shared_ptrINS_12delayed_taskEEEi @ 0x0000000111f08a20
2026.01.26 15:36:00.430325 [ 4308615 ] {} <Fatal> BaseDaemon: 4. _ZN6nuraft11raft_server18enable_hb_for_peerERNS_4peerE @ 0x0000000111f95fe8
2026.01.26 15:36:00.430327 [ 4308615 ] {} <Fatal> BaseDaemon: 5. _ZN6nuraft11raft_server13become_leaderEv @ 0x0000000111f6d21c
2026.01.26 15:36:00.430329 [ 4308615 ] {} <Fatal> BaseDaemon: 6. _ZN6nuraft11raft_server16handle_vote_respERNS_8resp_msgE @ 0x0000000111f91004
2026.01.26 15:36:00.430331 [ 4308615 ] {} <Fatal> BaseDaemon: 7. _ZN6nuraft11raft_server16handle_peer_respERNSt3__110shared_ptrINS_8resp_msgEEERNS2_INS_13rpc_exceptionEEE @ 0x0000000111f66e1c
2026.01.26 15:36:00.430333 [ 4308615 ] {} <Fatal> BaseDaemon: 8. _ZN6nuraft10cmd_resultINSt3__110shared_ptrINS_8resp_msgEEENS2_INS_13rpc_exceptionEEEE10set_resultERS4_RS6_NS_15cmd_result_codeE @ 0x0000000111eff56c
2026.01.26 15:36:00.430335 [ 4308615 ] {} <Fatal> BaseDaemon: 9. _ZN6nuraft4peer17handle_rpc_resultENSt3__110shared_ptrIS0_EENS2_INS_10rpc_clientEEERNS2_INS_7req_msgEEERNS2_INS_10cmd_resultINS2_INS_8resp_msgEEENS2_INS_13rpc_exceptionEEEEEEEbmRSB_RSD_ @ 0x0000000111efeed0
2026.01.26 15:36:00.430338 [ 4308615 ] {} <Fatal> BaseDaemon: 10. _ZNSt3__110__function13__policy_funcIFvRNS_10shared_ptrIN6nuraft8resp_msgEEERNS2_INS3_13rpc_exceptionEEEEE11__call_funcB8se210105INS_6__bindIMNS3_4peerEFvNS2_ISE_EENS2_INS3_10rpc_clientEEERNS2_INS3_7req_msgEEERNS2_INS3_10cmd_resultIS5_S8_EEEEbmS6_S9_EJPSE_RSF_RSH_SK_SO_RbRmRKNS_12placeholders4__phILi1EEERKNSX_ILi2EEEEEEEEvPKNS0_16__policy_storageES6_S9_ @ 0x0000000111f00af4
2026.01.26 15:36:00.430340 [ 4308615 ] {} <Fatal> BaseDaemon: 11. _ZN6nuraft15asio_rpc_client13response_readERNSt3__110shared_ptrINS_7req_msgEEERNS1_8functionIFvRNS2_INS_8resp_msgEEERNS2_INS_13rpc_exceptionEEEEEERNS2_INS_6bufferEEENS1_10error_codeEm @ 0x0000000111f3567c
2026.01.26 15:36:00.430347 [ 4308615 ] {} <Fatal> BaseDaemon: 12. _ZN5boost4asio6detail23reactive_socket_recv_opINS0_17mutable_buffers_1ENS1_7read_opINS0_19basic_stream_socketINS0_2ip3tcpENS0_15any_io_executorEEES3_PKNS0_14mutable_bufferENS1_14transfer_all_tENSt3__16__bindIMN6nuraft15asio_rpc_clientEFvRNSE_10shared_ptrINSG_7req_msgEEERNSE_8functionIFvRNSI_INSG_8resp_msgEEERNSI_INSG_13rpc_exceptionEEEEEERNSI_INSG_6bufferEEENSE_10error_codeEmEJRNSI_ISH_EESL_SV_SY_RKNSE_12placeholders4__phILi1EEERKNS15_ILi2EEEEEEEES8_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm @ 0x0000000111f3c264
2026.01.26 15:36:00.430349 [ 4308615 ] {} <Fatal> BaseDaemon: 13. _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE @ 0x0000000111f0e350
2026.01.26 15:36:00.430351 [ 4308615 ] {} <Fatal> BaseDaemon: 14. _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE @ 0x0000000111f0df10
2026.01.26 15:36:00.430353 [ 4308615 ] {} <Fatal> BaseDaemon: 15. _ZN6nuraft17asio_service_impl12worker_entryEv @ 0x0000000111f079e0
2026.01.26 15:36:00.430355 [ 4308615 ] {} <Fatal> BaseDaemon: 16. _ZZN24ThreadFromGlobalPoolImplILb1ELb1EEC1INSt3__16__bindIMN6nuraft17asio_service_implEFvvEJPS5_EEEJEEEOT_DpOT0_ENUlvE_clEv @ 0x0000000111f166b0
2026.01.26 15:36:00.430356 [ 4308615 ] {} <Fatal> BaseDaemon: 17. _ZN14ThreadPoolImplINSt3__16threadEE20ThreadFromThreadPool6workerEv @ 0x000000010746c594
2026.01.26 15:36:00.430358 [ 4308615 ] {} <Fatal> BaseDaemon: 18. _ZNSt3__114__thread_proxyB8se210105INS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN14ThreadPoolImplINS_6threadEE20ThreadFromThreadPoolEFvvEPSA_EEEEEPvSF_ @ 0x000000010747934c
2026.01.26 15:36:00.430360 [ 4308615 ] {} <Fatal> BaseDaemon: 19. _pthread_start @ 0x0000000196ee9c08
2026.01.26 15:36:00.430364 [ 4308615 ] {} <Fatal> BaseDaemon: 20. thread_start @ 0x0000000196ee4ba8

JiaQiTang98 · 2026-01-26T13:41:03Z

Hi @antonio2368 could you help review this PR? Thx!

antonio2368 · 2026-01-26T13:58:08Z

Thanks for the submission!
I'm trying to understand how did a node that was removed from cluster become a leader?

JiaQiTang98 · 2026-01-27T03:08:58Z

I'm trying to understand how did a node that was removed from cluster become a leader?

Actually, the node was not removed from the cluster.
I reproduced this issue with the following steps:

Create a 3-node cluster
Add a new node to the cluster's config.xml, and reduce the values of election_timeout_lower_bound_ms and election_timeout_upper_bound_ms to make raft_server::handle_election_timeout trigger easily
Start the new node; it will join the cluster and execute raft_server::reconfigure, then call raft_server::restart_election_timer — this timer will invoke raft_server::handle_election_timeout when a timeout occurs.

NuRaft/src/handle_commit.cxx

Lines 796 to 815 in 1649e92

    
           if (id_ == (*it)->get_id()) { 
        
               my_priority_ = (*it)->get_priority(); 
        
               im_learner_ = (*it)->is_learner(); 
        
               steps_to_down_ = 0; 
        
               if (!(*it)->is_new_joiner() && 
        
                   role_ == srv_role::follower && 
        
                   state_->is_catching_up()) { 
        
                   // Except for new joiner type, if this server is added 
        
                   // to the cluster config, that means the sync is done. 
        
                   // Start election timer without waiting for 
        
                   // the next append_entries message. 
        
                   // 
        
                   // If this server is a new joiner, `catching_up_` flag 
        
                   // will be cleared when it becomes a regular member, 
        
                   // that is also notified by a new cluster config. 
        
                   p_in("now this node is the part of cluster, " 
        
                        "catch-up process is done, clearing the flag"); 
        
                   state_->set_catching_up(false); 
        
                   ctx_->state_mgr_->save_state(*state_); 
        
                   restart_election_timer();

JiaQiTang98 added 4 commits January 26, 2026 15:47

Reopen the peer if it was shutdown previously

37cd6a2

Reopen the peer if it was shutdown previously

1308f37

change log level

7f42bc1

change log

9fd91b1

antonio2368 self-assigned this Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Reopen the peer if it was shutdown previously #91

Fix: Reopen the peer if it was shutdown previously #91

JiaQiTang98 commented Jan 26, 2026

Uh oh!

JiaQiTang98 commented Jan 26, 2026

Uh oh!

antonio2368 commented Jan 26, 2026

Uh oh!

JiaQiTang98 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	for (peer_itor it = peers_.begin(); it != peers_.end(); ++it) {
	const ptr<peer>& p = it->second;
	if (p->get_hb_task()) {
	cancel_task(p->get_hb_task());
	}
	// Shutdown peer to cut off smart pointers.
	p->shutdown();

	void raft_server::enable_hb_for_peer(peer& p) {
	p.enable_hb(true);
	p.resume_hb_speed();
	p_tr("peer %d, interval: %d\n", p.get_id(), p.get_current_hb_interval());
	schedule_task(p.get_hb_task(), p.get_current_hb_interval());
	}

Fix: Reopen the peer if it was shutdown previously #91

Are you sure you want to change the base?

Fix: Reopen the peer if it was shutdown previously #91

Conversation

JiaQiTang98 commented Jan 26, 2026

Uh oh!

JiaQiTang98 commented Jan 26, 2026

Uh oh!

antonio2368 commented Jan 26, 2026

Uh oh!

JiaQiTang98 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants