Skip to content

module: Reset peer notification state on normal recoveries#175

Open
eschoeller wants to merge 1 commit intoITRS-Group:masterfrom
eschoeller:fix-recovery-notification-stale-counter
Open

module: Reset peer notification state on normal recoveries#175
eschoeller wants to merge 1 commit intoITRS-Group:masterfrom
eschoeller:fix-recovery-notification-stale-counter

Conversation

@eschoeller
Copy link
Copy Markdown

@eschoeller eschoeller commented Apr 4, 2026

Summary

Fixes #126. Normal recovery notification packets carry a stale nonzero
current_notification_number that overwrites the correct post-recovery
reset on receiving peers, causing first_notification_delay to be
bypassed on subsequent incidents when sender ownership changes.

Root Cause

When the sender emits a recovery notification:

  1. host_notification() increments current_notification_number (e.g., to 2)
  2. NEBTYPE_NOTIFICATION_END fires — merlin serializes the counter as 2
  3. Naemon resets current_notification_number = 0 locally (handle_host_state())
  4. hook_host_result() sends the check result, then flushes the held notification

On receiving peers:
5. Recovery CHECK_DATA replays through Naemon → counter reset to 0 (correct)
6. Recovery NOTIFICATION_DATA arrives → handle_notification_data() overwrites counter to 2 (stale)

On the next incident, if a different peer becomes sender, it has notif_num=2.
Naemon's delay gate (check_host_notification_viability()) requires notif_num == 0
to check the delay — with 2, the block is skipped entirely and the notification fires
at the exact second of HARD DOWN.

Additional Impact

In addition to the reproduced first_notification_delay bypass, stale peer
notification state can also influence escalation step selection,
$NOTIFICATIONNUMBER$ metadata, and recovery-notification eligibility/routing.
This patch resets both current_notification_number and notified_on on normal
recoveries, so peers return to the same post-recovery state Naemon maintains locally.

Fix

In handle_notification_data(), when receiving a NOTIFICATION_NORMAL with
STATE_UP (hosts) or STATE_OK (services), reset current_notification_number
to 0 and clear notified_on instead of preserving the stale counter.

Problem-state notifications continue to sync normally, preserving cross-peer
renotify/escalation (94f8aab) and recovery eligibility for active problems (e32d4f5).

Testing

Reproduced and verified on a local 3-peer cluster (AlmaLinux 8, naemon-core
1.5.1, merlin 2024.10.14) with diagnostic instrumentation (delay_diag in
naemon-core's check_host_notification_viability() and notif_diag in
merlin's hook_notification()):

Test Inc1 Sender Inc2 Sender notif_num delay_diag Delta Bypass?
Before fix (owner shift) local01 local03 2 SKIPPED 0s YES
Before fix (owner shift) local03 local01 2 SKIPPED 0s YES
Before fix (5min gap shift) local01 local02 2 SKIPPED 0s YES
After fix (owner shift) local03 local01 0 BLOCKED→PASSED 120s no
Negative control (no shift) local01 local01 0 BLOCKED→PASSED 120s no

The stale state persists indefinitely after recovery (confirmed at 5 minutes
post-recovery with no decay). The bypass requires only that sender ownership
shifts for the same host, which occurs in production during peer restarts,
network events, or active_peers changes.

handle_notification_data() propagates notification metadata from the
sending peer to receivers. For normal recovery notifications, that
metadata carries current_notification_number as it existed at
NEBTYPE_NOTIFICATION_END: after the recovery notification increment,
but before Naemon's local post-recovery reset.

On receiving peers, the recovery check result is replayed through
Naemon first, which correctly resets current_notification_number to 0.
The later recovery notification packet then overwrites that reset with
the sender's stale nonzero value.

That stale counter persists on non-sender peers. If notification
ownership later shifts for the same host, the new sender inherits the
stale nonzero current_notification_number and Naemon skips the
first_notification_delay gate, causing an immediate notification at
HARD DOWN.

Context:
- 94f8aab introduced cross-peer notification-state sync for later
  renotify/escalation.
- e32d4f5 added add_notified_on() so recoveries could be sent from a
  different node than the one that sent the problem notification.
- Normal recoveries are different because they represent the terminal
  post-problem state and should not preserve the pre-reset counter.

Fix this by treating NOTIFICATION_NORMAL + STATE_UP/STATE_OK as a reset
on receipt: clear current_notification_number and notified_on.
Problem-state notifications continue to sync normally.

Ref: ITRS-Group#126
@eschoeller
Copy link
Copy Markdown
Author

I have a lot more testing to do, but figured I'd get this posted in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Host Notification issue with Loadbalancing

1 participant