-
Notifications
You must be signed in to change notification settings - Fork 100
Add Alertmanager config validation to prevent invalid configurations #473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Alertmanager config validation to prevent invalid configurations #473
Conversation
This change adds pre-write validation of Alertmanager configurations to prevent invalid configs from being deployed, which could cause Alertmanager to crash on restart. Changes: - Use Prometheus Alertmanager's official config.Load() for validation - Block invalid config writes, preserving last-known-good configuration - Create Kubernetes Events on validation failures with actionable guidance - Expose alertmanager_config_validation_status metric (0=valid, 1=invalid) - Add unit tests for validation logic and event creation - Add e2e tests for config validation and metric exposure - Update README with validation documentation This addresses production incident ITN-2025-00331 where an invalid label name "route-to-cad" (containing hyphens) caused Alertmanager to crash, resulting in a 6-hour monitoring outage. Regression test added: Test_validateAlertManagerConfig_InvalidLabelNameWithHyphens 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
/hold Holding for local e2e test validation before merge. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clcollins The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #473 +/- ##
==========================================
+ Coverage 66.99% 67.66% +0.66%
==========================================
Files 8 8
Lines 1021 1073 +52
==========================================
+ Hits 684 726 +42
- Misses 312 319 +7
- Partials 25 28 +3
🚀 New features to boost your workflow:
|
Check error return values from writeAlertManagerConfig() calls in test setup to satisfy errcheck linter requirement. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
/label tide/merge-method-squash |
- Test validation failure path with invalid config - Test success path with valid config - Verify secret creation/non-creation behavior - Improves code coverage for config validation feature 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/hold cancel |
|
@clcollins: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| // UpdateAlertmanagerConfigValidationMetric updates the validation status metric | ||
| // validationPassed should be true if validation succeeded, false if it failed | ||
| func UpdateAlertmanagerConfigValidationMetric(validationPassed bool) { | ||
| if validationPassed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checking if the 0 vs 1 logic as intended here?
| } | ||
|
|
||
| // recordConfigValidationEvent creates a Kubernetes event to alert SRE when Alertmanager config validation fails | ||
| func (r *SecretReconciler) recordConfigValidationEvent(err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially consider passing (reconcile) context from r.Reconcile() to writeAlertManagerConfig() -> recordConfigValidationEvent() so can be used in place of context.TODO() ?
|
Nice work @clcollins, thank you for the extra validations, metrics, and e2e tests! |
Summary
This PR adds pre-write validation of Alertmanager configurations to prevent invalid configs from being deployed, which could cause Alertmanager to crash on restart.
Fixes: SREP-2967
Key Changes:
config.Load()functionalertmanager_config_validation_statusmetric (0=valid, 1=invalid)Motivation
This addresses production incident ITN-2025-00331 where an invalid label name
route-to-cad(containing hyphens) caused Alertmanager to crash, resulting in a 6-hour monitoring outage. The incident occurred because CAMO generated a configuration with an invalid Prometheus label name - Prometheus labels must match[a-zA-Z_][a-zA-Z0-9_]*(hyphens are not allowed).How Validation Works
alertmanager-main, the operator validates it using the exact sameconfig.Load()function that Alertmanager uses on startupAlertmanagerConfigValidationFailurealertmanager_config_validation_statusset to1(invalid)alertmanager-main0(valid)Testing
Unit Tests:
E2E Tests:
amconfig.Load()on real clusterLocal Testing:
All unit tests pass:
go test ./controllers/... -vE2E tests compile successfully and are ready for cluster validation.
Monitoring
SREs can monitor validation status via:
Prometheus Metric:
Kubernetes Events:
Files Changed
go.mod/go.sum- Addedgithub.com/prometheus/alertmanagerdependencypkg/metrics/metrics.go- Added validation status metriccontrollers/secret_controller.go- Added validation functions and updated write logiccontrollers/secret_controller_test.go- Added 6 unit teststest/e2e/configure_alertmanager_operator_tests.go- Added 2 e2e testsREADME.md- Added validation documentationPre-Merge Checklist
🤖 Generated with Claude Code