Skip to content

Conversation

@dustman9000
Copy link
Member

Summary

This PR splits the MachineOutOfComplianceSRE alert into two separate alerts to reduce noise while maintaining visibility into compliance-monkey failures.

Problem

The current alert fires when any machine is >28 days old. However, when many machines age out simultaneously (e.g., all created during MC provisioning), compliance-monkey processes them at 1 machine per 15 minutes. This creates a queue where some machines exceed 28 days while waiting their turn, triggering critical alerts even though the automation is working correctly.

Solution

Two-tier alert system:

  1. MachineOutOfComplianceSRE (critical, >35 days, 1h for):

    • Fires when ANY machine exceeds 35 days old
    • Indicates a clear compliance-monkey failure requiring immediate attention
    • 7 days of buffer beyond the 28-day threshold allows adequate time for queue processing
  2. MachineOutOfComplianceSREWarning (warning, >5 machines >28 days, 4h for):

    • Fires when multiple machines (>5) are >28 days old for 4+ hours
    • Indicates a queue backup that warrants monitoring
    • Expected behavior when many machines age out simultaneously
    • Provides visibility without generating critical pages

Benefits

  • Reduced noise: Normal queue processing no longer triggers critical alerts
  • Better signal-to-noise: Critical alerts reserved for true failures (>35 days)
  • Maintained visibility: Warning alerts for queue backlogs provide awareness
  • Clear escalation path: Warning → Critical based on severity and duration

Testing

  • Alert logic validated against current MC cluster state showing 72 machines >21 days old
  • At 1 machine/15min, queue processes within 18 hours, well under the 35-day threshold
  • Warning alert would fire for large backlogs, critical only for stuck machines

References

Split the MachineOutOfComplianceSRE alert into two separate alerts to
reduce noise while maintaining visibility:

1. MachineOutOfComplianceSRE (critical): Fires when ANY machine is >35
   days old, indicating a clear compliance-monkey failure that requires
   immediate attention.

2. MachineOutOfComplianceSREWarning (warning): Fires when >5 machines
   are >28 days old for 4+ hours, indicating a queue backup that may
   warrant monitoring but is expected behavior when many machines age
   out simultaneously.

This change addresses the issue where compliance-monkey processes
machines at 1 per 15 minutes, causing some machines to exceed 28 days
while waiting in the normal replacement queue. The new thresholds
provide better signal-to-noise ratio for on-call responders.
@openshift-ci openshift-ci bot requested review from boranx and rogbas December 6, 2025 00:05
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dustman9000

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 6, 2025

@dustman9000: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

expr: (time() - mapi_machine_created_timestamp_seconds) > 2419200
for: 60m
# Fires when ANY machine exceeds 35 days old, indicating compliance-monkey failed to replace it.
expr: (time() - mapi_machine_created_timestamp_seconds) > 3024000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a requirement where we need to replace machines older than 28 days, which means that firing at 35 means we are out of compliance?

@joshbranham
Copy link
Contributor

We now have jitter introduced in compliance-monkey, so machines all with the same creation time won't actually be phased out at the same time, therefore we may not need these changes anymore, but will defer to you 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants