module: Reset peer notification state on normal recoveries#175
Open
eschoeller wants to merge 1 commit intoITRS-Group:masterfrom
Open
module: Reset peer notification state on normal recoveries#175eschoeller wants to merge 1 commit intoITRS-Group:masterfrom
eschoeller wants to merge 1 commit intoITRS-Group:masterfrom
Conversation
handle_notification_data() propagates notification metadata from the sending peer to receivers. For normal recovery notifications, that metadata carries current_notification_number as it existed at NEBTYPE_NOTIFICATION_END: after the recovery notification increment, but before Naemon's local post-recovery reset. On receiving peers, the recovery check result is replayed through Naemon first, which correctly resets current_notification_number to 0. The later recovery notification packet then overwrites that reset with the sender's stale nonzero value. That stale counter persists on non-sender peers. If notification ownership later shifts for the same host, the new sender inherits the stale nonzero current_notification_number and Naemon skips the first_notification_delay gate, causing an immediate notification at HARD DOWN. Context: - 94f8aab introduced cross-peer notification-state sync for later renotify/escalation. - e32d4f5 added add_notified_on() so recoveries could be sent from a different node than the one that sent the problem notification. - Normal recoveries are different because they represent the terminal post-problem state and should not preserve the pre-reset counter. Fix this by treating NOTIFICATION_NORMAL + STATE_UP/STATE_OK as a reset on receipt: clear current_notification_number and notified_on. Problem-state notifications continue to sync normally. Ref: ITRS-Group#126
Author
|
I have a lot more testing to do, but figured I'd get this posted in the meantime. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #126. Normal recovery notification packets carry a stale nonzero
current_notification_numberthat overwrites the correct post-recoveryreset on receiving peers, causing
first_notification_delayto bebypassed on subsequent incidents when sender ownership changes.
Root Cause
When the sender emits a recovery notification:
host_notification()incrementscurrent_notification_number(e.g., to 2)NEBTYPE_NOTIFICATION_ENDfires — merlin serializes the counter as 2current_notification_number = 0locally (handle_host_state())hook_host_result()sends the check result, then flushes the held notificationOn receiving peers:
5. Recovery CHECK_DATA replays through Naemon → counter reset to 0 (correct)
6. Recovery NOTIFICATION_DATA arrives →
handle_notification_data()overwrites counter to 2 (stale)On the next incident, if a different peer becomes sender, it has
notif_num=2.Naemon's delay gate (
check_host_notification_viability()) requiresnotif_num == 0to check the delay — with 2, the block is skipped entirely and the notification fires
at the exact second of HARD DOWN.
Additional Impact
In addition to the reproduced
first_notification_delaybypass, stale peernotification state can also influence escalation step selection,
$NOTIFICATIONNUMBER$metadata, and recovery-notification eligibility/routing.This patch resets both
current_notification_numberandnotified_onon normalrecoveries, so peers return to the same post-recovery state Naemon maintains locally.
Fix
In
handle_notification_data(), when receiving aNOTIFICATION_NORMALwithSTATE_UP(hosts) orSTATE_OK(services), resetcurrent_notification_numberto 0 and clear
notified_oninstead of preserving the stale counter.Problem-state notifications continue to sync normally, preserving cross-peer
renotify/escalation (94f8aab) and recovery eligibility for active problems (e32d4f5).
Testing
Reproduced and verified on a local 3-peer cluster (AlmaLinux 8, naemon-core
1.5.1, merlin 2024.10.14) with diagnostic instrumentation (
delay_diaginnaemon-core's
check_host_notification_viability()andnotif_diaginmerlin's
hook_notification()):The stale state persists indefinitely after recovery (confirmed at 5 minutes
post-recovery with no decay). The bypass requires only that sender ownership
shifts for the same host, which occurs in production during peer restarts,
network events, or
active_peerschanges.