Skip to content

Conversation

@clcollins
Copy link
Member

@clcollins clcollins commented Dec 23, 2025

Summary

This PR adds pre-write validation of Alertmanager configurations to prevent invalid configs from being deployed, which could cause Alertmanager to crash on restart.

Fixes: SREP-2967

Key Changes:

  • ✅ Validate configs using Prometheus Alertmanager's official config.Load() function
  • ✅ Block invalid config writes, preserving last-known-good configuration
  • ✅ Create Kubernetes Events on validation failures with actionable SRE guidance
  • ✅ Expose alertmanager_config_validation_status metric (0=valid, 1=invalid)
  • ✅ Add comprehensive unit tests and e2e tests
  • ✅ Update README with validation documentation

Motivation

This addresses production incident ITN-2025-00331 where an invalid label name route-to-cad (containing hyphens) caused Alertmanager to crash, resulting in a 6-hour monitoring outage. The incident occurred because CAMO generated a configuration with an invalid Prometheus label name - Prometheus labels must match [a-zA-Z_][a-zA-Z0-9_]* (hyphens are not allowed).

How Validation Works

  1. Pre-write Validation: Before writing any configuration to alertmanager-main, the operator validates it using the exact same config.Load() function that Alertmanager uses on startup
  2. Validation Failure Handling:
    • Invalid config is not written to the secret (preserves last-known-good config)
    • Kubernetes Event created with reason AlertmanagerConfigValidationFailure
    • Metric alertmanager_config_validation_status set to 1 (invalid)
    • Reconcile loop returns error, triggering automatic retry
  3. Validation Success:
    • Config is written to alertmanager-main
    • Metric set to 0 (valid)

Testing

Unit Tests:

  • Valid config validation
  • Invalid config detection (bad duration format)
  • Regression test for ITN-2025-00331 (invalid label name with hyphens)
  • Duplicate receiver name detection
  • Missing route detection
  • Kubernetes Event creation verification

E2E Tests:

  • Config validation using amconfig.Load() on real cluster
  • Validation metric exposure via Prometheus

Local Testing:
All unit tests pass:

go test ./controllers/... -v

E2E tests compile successfully and are ready for cluster validation.

Monitoring

SREs can monitor validation status via:

Prometheus Metric:

alertmanager_config_validation_status{name="configure-alertmanager-operator"}

Kubernetes Events:

oc get events -n openshift-monitoring --field-selector reason=AlertmanagerConfigValidationFailure

Files Changed

  • go.mod / go.sum - Added github.com/prometheus/alertmanager dependency
  • pkg/metrics/metrics.go - Added validation status metric
  • controllers/secret_controller.go - Added validation functions and updated write logic
  • controllers/secret_controller_test.go - Added 6 unit tests
  • test/e2e/configure_alertmanager_operator_tests.go - Added 2 e2e tests
  • README.md - Added validation documentation

Pre-Merge Checklist

  • Unit tests pass
  • E2E tests compile
  • E2E tests validated on real cluster (pending - will run locally before merge)
  • Code review completed

🤖 Generated with Claude Code

This change adds pre-write validation of Alertmanager configurations to
prevent invalid configs from being deployed, which could cause
Alertmanager to crash on restart.

Changes:
- Use Prometheus Alertmanager's official config.Load() for validation
- Block invalid config writes, preserving last-known-good configuration
- Create Kubernetes Events on validation failures with actionable guidance
- Expose alertmanager_config_validation_status metric (0=valid, 1=invalid)
- Add unit tests for validation logic and event creation
- Add e2e tests for config validation and metric exposure
- Update README with validation documentation

This addresses production incident ITN-2025-00331 where an invalid label
name "route-to-cad" (containing hyphens) caused Alertmanager to crash,
resulting in a 6-hour monitoring outage.

Regression test added: Test_validateAlertManagerConfig_InvalidLabelNameWithHyphens

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@clcollins
Copy link
Member Author

/hold

Holding for local e2e test validation before merge.

@openshift-ci openshift-ci bot requested review from nephomaniac and typeid December 23, 2025 01:20
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 23, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clcollins

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 23, 2025
@codecov
Copy link

codecov bot commented Dec 23, 2025

Codecov Report

❌ Patch coverage is 80.00000% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.66%. Comparing base (7b24c7e) to head (4da9496).

Files with missing lines Patch % Lines
controllers/secret_controller.go 78.00% 8 Missing and 3 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #473      +/-   ##
==========================================
+ Coverage   66.99%   67.66%   +0.66%     
==========================================
  Files           8        8              
  Lines        1021     1073      +52     
==========================================
+ Hits          684      726      +42     
- Misses        312      319       +7     
- Partials       25       28       +3     
Files with missing lines Coverage Δ
pkg/metrics/metrics.go 76.34% <100.00%> (+1.34%) ⬆️
controllers/secret_controller.go 88.42% <78.00%> (-0.68%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Check error return values from writeAlertManagerConfig() calls in test
setup to satisfy errcheck linter requirement.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@clcollins
Copy link
Member Author

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 23, 2025
- Test validation failure path with invalid config
- Test success path with valid config
- Verify secret creation/non-creation behavior
- Improves code coverage for config validation feature

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@clcollins
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 6, 2026

@clcollins: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

// UpdateAlertmanagerConfigValidationMetric updates the validation status metric
// validationPassed should be true if validation succeeded, false if it failed
func UpdateAlertmanagerConfigValidationMetric(validationPassed bool) {
if validationPassed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking if the 0 vs 1 logic as intended here?

}

// recordConfigValidationEvent creates a Kubernetes event to alert SRE when Alertmanager config validation fails
func (r *SecretReconciler) recordConfigValidationEvent(err error) {
Copy link
Contributor

@nephomaniac nephomaniac Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially consider passing (reconcile) context from r.Reconcile() to writeAlertManagerConfig() -> recordConfigValidationEvent() so can be used in place of context.TODO() ?

@nephomaniac
Copy link
Contributor

Nice work @clcollins, thank you for the extra validations, metrics, and e2e tests!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants