Skip to content

Move Non-Read/Write Operations to Dedicated Reactor #847

@Besroy

Description

@Besroy

Found a deadlock issue in SH test (more details in SH issue#88):

  1. Thread 1 (nuraft-reconfigure): During the replace_member process, after removing the old member, nuraft acquires the nuraft lock to trigger a reconfiguration and clears the snapshot_sync_ctx. The cleanup operation requires the current user_snp_ctx to stop, which in turn depends on all pending prefetch blobs being read. However, this operation is blocked, waiting for an I/O reactor to handle the read.

  2. Thread 2 (IO reactor worker 1): This thread calls monitor_replace_member_replication_status, detects that the replace member task is completed, and attempts to reset the quorum size. However, it is blocked waiting for the nuraft lock, which is held by Thread 1. At the same time, Thread 2 holds the m_rd_map_mtx mutex.

  3. Thread 3 (IO reactor worker 2): This thread calls gc_repl_reqs, which attempts to acquire the m_rd_map_mtx mutex held by Thread 2. As a result, Thread 3 is blocked.

Since both I/O reactor threads (Thread 2 and Thread 3) are blocked, no I/O operations can proceed. This prevents Thread 1 from completing the read operation required to release the nuraft lock, leading to a deadlock.

Since monitor_replace_member_replication_status and gc_repl_reqs are not typical write/read operations, should we consider isolating them from the default IOMgr workers? Below are the timers currently using default IOMgr workers:

m_rdev_gc_timer_hdl: Triggers gc_repl_reqs and gc_repl_devs every minute.
m_rdev_fetch_timer_hdl: Triggers fetch_pending_data every second.
m_flush_durable_commit_timer_hdl: Triggers flush_durable_commit_lsn every 500ms.
m_replace_member_sync_check_timer_hdl: Triggers monitor_replace_member_replication_status every minute.
m_res_audit_timer_hdl: Triggers trigger_truncate every 2 minutes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions