module: Reset peer notification state on normal recoveries by eschoeller · Pull Request #175 · ITRS-Group/monitor-merlin

eschoeller · 2026-04-04T00:59:00Z

Summary

Fixes #126. Normal recovery notification packets carry a stale nonzero
current_notification_number that overwrites the correct post-recovery
reset on receiving peers, causing first_notification_delay to be
bypassed on subsequent incidents when sender ownership changes.

Root Cause

When the sender emits a recovery notification:

host_notification() increments current_notification_number (e.g., to 2)
NEBTYPE_NOTIFICATION_END fires — merlin serializes the counter as 2
Naemon resets current_notification_number = 0 locally (handle_host_state())
hook_host_result() sends the check result, then flushes the held notification

On receiving peers:
5. Recovery CHECK_DATA replays through Naemon → counter reset to 0 (correct)
6. Recovery NOTIFICATION_DATA arrives → handle_notification_data() overwrites counter to 2 (stale)

On the next incident, if a different peer becomes sender, it has notif_num=2.
Naemon's delay gate (check_host_notification_viability()) requires notif_num == 0
to check the delay — with 2, the block is skipped entirely and the notification fires
at the exact second of HARD DOWN.

Additional Impact

In addition to the reproduced first_notification_delay bypass, stale peer
notification state can also influence escalation step selection,
$NOTIFICATIONNUMBER$ metadata, and recovery-notification eligibility/routing.
This patch resets both current_notification_number and notified_on on normal
recoveries, so peers return to the same post-recovery state Naemon maintains locally.

Fix

In handle_notification_data(), when receiving a NOTIFICATION_NORMAL with
STATE_UP (hosts) or STATE_OK (services), reset current_notification_number
to 0 and clear notified_on instead of preserving the stale counter.

Problem-state notifications continue to sync normally, preserving cross-peer
renotify/escalation (94f8aab) and recovery eligibility for active problems (e32d4f5).

Testing

Reproduced and verified on a local 3-peer cluster (AlmaLinux 8, naemon-core
1.5.1, merlin 2024.10.14) with diagnostic instrumentation (delay_diag in
naemon-core's check_host_notification_viability() and notif_diag in
merlin's hook_notification()):

Test	Inc1 Sender	Inc2 Sender	notif_num	delay_diag	Delta	Bypass?
Before fix (owner shift)	local01	local03	2	SKIPPED	0s	YES
Before fix (owner shift)	local03	local01	2	SKIPPED	0s	YES
Before fix (5min gap shift)	local01	local02	2	SKIPPED	0s	YES
After fix (owner shift)	local03	local01	0	BLOCKED→PASSED	120s	no
Negative control (no shift)	local01	local01	0	BLOCKED→PASSED	120s	no

The stale state persists indefinitely after recovery (confirmed at 5 minutes
post-recovery with no decay). The bypass requires only that sender ownership
shifts for the same host, which occurs in production during peer restarts,
network events, or active_peers changes.

handle_notification_data() propagates notification metadata from the sending peer to receivers. For normal recovery notifications, that metadata carries current_notification_number as it existed at NEBTYPE_NOTIFICATION_END: after the recovery notification increment, but before Naemon's local post-recovery reset. On receiving peers, the recovery check result is replayed through Naemon first, which correctly resets current_notification_number to 0. The later recovery notification packet then overwrites that reset with the sender's stale nonzero value. That stale counter persists on non-sender peers. If notification ownership later shifts for the same host, the new sender inherits the stale nonzero current_notification_number and Naemon skips the first_notification_delay gate, causing an immediate notification at HARD DOWN. Context: - 94f8aab introduced cross-peer notification-state sync for later renotify/escalation. - e32d4f5 added add_notified_on() so recoveries could be sent from a different node than the one that sent the problem notification. - Normal recoveries are different because they represent the terminal post-problem state and should not preserve the pre-reset counter. Fix this by treating NOTIFICATION_NORMAL + STATE_UP/STATE_OK as a reset on receipt: clear current_notification_number and notified_on. Problem-state notifications continue to sync normally. Ref: ITRS-Group#126

eschoeller · 2026-04-04T01:07:35Z

I have a lot more testing to do, but figured I'd get this posted in the meantime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

module: Reset peer notification state on normal recoveries#175

module: Reset peer notification state on normal recoveries#175
eschoeller wants to merge 1 commit intoITRS-Group:masterfrom
eschoeller:fix-recovery-notification-stale-counter

eschoeller commented Apr 4, 2026 •

edited

Loading

Uh oh!

eschoeller commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eschoeller commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Additional Impact

Fix

Testing

Uh oh!

eschoeller commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eschoeller commented Apr 4, 2026 •

edited

Loading