Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 57 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,51 @@ The Secret Controller watches over the resources in the table below. Changes to
| ConfigMap | `openshift-monitoring/managed-namespaces` | Defines a list of OpenShift "managed" namespaces. The operator will route alerts originating from these namespaces to PagerDuty and/or GoAlert. |
| ConfigMap | `openshift-monitoring/ocp-namespaces` | Defines a list of OpenShift Container Platform namespaces. The operator will route alerts originating from these namespaces to PagerDuty and/or GoAlert. |

## Alertmanager Config Validation

The operator validates all Alertmanager configurations before writing them to the `alertmanager-main` secret. This prevents invalid configurations from being deployed, which could cause Alertmanager to fail on restart.

### How Validation Works

1. **Pre-write Validation**: Before writing any configuration to `alertmanager-main`, the operator validates it using Prometheus Alertmanager's official `config.Load()` function - the exact same validation that Alertmanager performs on startup.

2. **Validation Failure Handling**: If validation fails:
- The invalid config is **not written** to the secret (preserving the last-known-good configuration)
- A Kubernetes Event is created in `openshift-monitoring` namespace with reason `AlertmanagerConfigValidationFailure`
- The `alertmanager_config_validation_status` metric is set to `1` (invalid)
- The reconcile loop returns an error, triggering automatic retry

3. **Validation Success**: If validation succeeds:
- The config is written to `alertmanager-main`
- The `alertmanager_config_validation_status` metric is set to `0` (valid)

### Monitoring Validation Status

**Via Prometheus Metric**:
```promql
alertmanager_config_validation_status{name="configure-alertmanager-operator"}
```
- Value `0` = configuration is valid
- Value `1` = configuration validation failed

**Via Kubernetes Events**:
```bash
oc get events -n openshift-monitoring --field-selector reason=AlertmanagerConfigValidationFailure
```

Failed validation events include:
- The specific validation error from Alertmanager
- Guidance to check source secrets and configmaps for invalid data
- A reference to operator logs for detailed debugging

### Common Validation Failures

- **Invalid label names**: Prometheus label names must match `[a-zA-Z_][a-zA-Z0-9_]*` (no hyphens allowed)
- **Duplicate receiver names**: Each receiver must have a unique name
- **Missing required fields**: Route and at least one receiver are required
- **Invalid duration formats**: Must use valid Go duration strings (e.g., "5m", "1h")
- **Invalid regex patterns**: MatchRE patterns must be valid regular expressions

## Cluster Readiness
To avoid alert noise while a cluster is in the early stages of being installed and configured, this operator waits to configure Pager Duty -- effectively silencing alerts -- until a predetermined set of health checks, performed by [osd-cluster-ready](https://github.com/openshift/osd-cluster-ready/), has completed.

Expand All @@ -49,17 +94,18 @@ This determination is made through the presence of a completed `Job` named `osd-
## Metrics
The Configure Alertmanager Operator exposes the following Prometheus metrics:

| Metric name | Purpose |
|---------------------------------------|-------------------------------------------------------------------------------------------------------|
| `ga_secret_exists` | indicates that a Secret named `goalert-secret` exists in the `openshift-monitoring` namespace. |
| `pd_secret_exists` | indicates that a Secret named `pd-secret` exists in the `openshift-monitoring` namespace. |
| `dms_secret_exists` | indicates that a Secret named `dms-secret` exists in the `openshift-monitoring` namespace. |
| `am_secret_exists` | indicates that a Secret named `alertmanager-main` exists in the `openshift-monitoring` namespace. |
| `managed_namespaces_configmap_exists` | indicates that a ConfigMap named `managed-namespaces` exists in the `openshift-monitoring` namespace. |
| `ocp_namespaces_configmap_exists` | indicates that a ConfigMap named `ocp-namespaces` exists in the `openshift-monitoring` namespace. |
| `am_secret_contains_ga` | indicates the GoAlert receiver is present in alertmanager.yaml. |
| `am_secret_contains_pd` | indicates the Pager Duty receiver is present in alertmanager.yaml. |
| `am_secret_contains_dms` | indicates the Dead Man's Snitch receiver is present in alertmanager.yaml. |
| Metric name | Purpose |
|------------------------------------------------|-------------------------------------------------------------------------------------------------------|
| `ga_secret_exists` | indicates that a Secret named `goalert-secret` exists in the `openshift-monitoring` namespace. |
| `pd_secret_exists` | indicates that a Secret named `pd-secret` exists in the `openshift-monitoring` namespace. |
| `dms_secret_exists` | indicates that a Secret named `dms-secret` exists in the `openshift-monitoring` namespace. |
| `am_secret_exists` | indicates that a Secret named `alertmanager-main` exists in the `openshift-monitoring` namespace. |
| `managed_namespaces_configmap_exists` | indicates that a ConfigMap named `managed-namespaces` exists in the `openshift-monitoring` namespace. |
| `ocp_namespaces_configmap_exists` | indicates that a ConfigMap named `ocp-namespaces` exists in the `openshift-monitoring` namespace. |
| `am_secret_contains_ga` | indicates the GoAlert receiver is present in alertmanager.yaml. |
| `am_secret_contains_pd` | indicates the Pager Duty receiver is present in alertmanager.yaml. |
| `am_secret_contains_dms` | indicates the Dead Man's Snitch receiver is present in alertmanager.yaml. |
| `alertmanager_config_validation_status` | indicates Alertmanager config validation status: `0` = valid, `1` = invalid. |

The operator creates a `Service` and `ServiceMonitor` named `configure-alertmanager-operator` to expose these metrics to Prometheus.

Expand Down
82 changes: 79 additions & 3 deletions controllers/secret_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ import (

"github.com/go-logr/logr"
configv1 "github.com/openshift/api/config/v1"
amconfig "github.com/prometheus/alertmanager/config"

"github.com/openshift/configure-alertmanager-operator/config"
"github.com/openshift/configure-alertmanager-operator/pkg/metrics"
Expand Down Expand Up @@ -272,7 +273,10 @@ func (r *SecretReconciler) Reconcile(ctx context.Context, request ctrl.Request)
osdNamespaces)

// write the alertmanager Config
writeAlertManagerConfig(r, reqLogger, alertmanagerconfig)
if err := writeAlertManagerConfig(r, reqLogger, alertmanagerconfig); err != nil {
reqLogger.Error(err, "Failed to write alertmanager config")
return reconcile.Result{}, err
}

// Update metrics after all reconcile operations are complete.
metrics.UpdateSecretsMetrics(secretList, alertmanagerconfig)
Expand Down Expand Up @@ -1259,11 +1263,78 @@ func readSecretKey(r *SecretReconciler, secretName string, secretNamespace strin
return string(secret.Data[fieldName])
}

// validateAlertManagerConfig validates the alertmanager config using Alertmanager's official validation
// This ensures the config will be accepted by Alertmanager on startup/reload
func validateAlertManagerConfig(reqLogger logr.Logger, cfg *alertmanager.Config) error {
// Marshal our config to YAML
cfgBytes, marshalerr := yaml.Marshal(cfg)
if marshalerr != nil {
return fmt.Errorf("failed to marshal alertmanager config for validation: %w", marshalerr)
}

// Use Alertmanager's official config.Load() to validate
// This is the same validation Alertmanager performs on startup
_, err := amconfig.Load(string(cfgBytes))
if err != nil {
return fmt.Errorf("alertmanager config validation failed: %w", err)
}

reqLogger.Info("INFO: Alertmanager config validation passed")
return nil
}

// recordConfigValidationEvent creates a Kubernetes event to alert SRE when Alertmanager config validation fails
func (r *SecretReconciler) recordConfigValidationEvent(err error) {
Copy link
Contributor

@nephomaniac nephomaniac Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially consider passing (reconcile) context from r.Reconcile() to writeAlertManagerConfig() -> recordConfigValidationEvent() so can be used in place of context.TODO() ?

event := &corev1.Event{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("alertmanager-config-validation-failure-%d", time.Now().Unix()),
Namespace: "openshift-monitoring",
},
InvolvedObject: corev1.ObjectReference{
Kind: "Secret",
Namespace: "openshift-monitoring",
Name: secretNameAlertmanager,
},
Reason: "AlertmanagerConfigValidationFailure",
Message: fmt.Sprintf("CRITICAL: Failed to validate Alertmanager configuration - config will not be written to prevent Alertmanager from failing on restart. Error: %v. Action required: Check source configmaps and secrets for invalid data. Review operator logs for details.", err),
Type: corev1.EventTypeWarning,
EventTime: metav1.MicroTime{
Time: time.Now(),
},
FirstTimestamp: metav1.Time{
Time: time.Now(),
},
LastTimestamp: metav1.Time{
Time: time.Now(),
},
Count: 1,
}

// Best effort event creation - don't fail reconciliation if event creation fails
if createErr := r.Client.Create(context.TODO(), event); createErr != nil {
log.Error(createErr, "Failed to create Alertmanager config validation failure event")
}
}

// writeAlertManagerConfig writes the updated alertmanager config to the `alertmanager-main` secret in namespace `openshift-monitoring`.
func writeAlertManagerConfig(r *SecretReconciler, reqLogger logr.Logger, amconfig *alertmanager.Config) {
// It validates the config before writing to prevent Alertmanager from failing on restart.
func writeAlertManagerConfig(r *SecretReconciler, reqLogger logr.Logger, amconfig *alertmanager.Config) error {
// Validate the config before writing
if err := validateAlertManagerConfig(reqLogger, amconfig); err != nil {
reqLogger.Error(err, "ERROR: Alertmanager config validation failed - will not write to secret")
// Record Kubernetes event to alert SRE
r.recordConfigValidationEvent(err)
// Update metric to indicate validation failure
metrics.UpdateAlertmanagerConfigValidationMetric(false)
return fmt.Errorf("alertmanager config validation failed, config not written: %w", err)
}

// Config is valid, proceed with marshaling and writing
amconfigbyte, marshalerr := yaml.Marshal(amconfig)
if marshalerr != nil {
reqLogger.Error(marshalerr, "ERROR: failed to marshal Alertmanager config")
metrics.UpdateAlertmanagerConfigValidationMetric(false)
return fmt.Errorf("failed to marshal alertmanager config: %w", marshalerr)
}
// This is commented out because it prints secrets, but it might be useful for debugging when running locally.
//reqLogger.Info("DEBUG: Marshalled Alertmanager config:", string(amconfigbyte))
Expand All @@ -1290,7 +1361,12 @@ func writeAlertManagerConfig(r *SecretReconciler, reqLogger logr.Logger, amconfi

if err != nil {
reqLogger.Error(err, "ERROR: Could not write secret alertmanger-main", "namespace", secret.Namespace)
return
metrics.UpdateAlertmanagerConfigValidationMetric(false)
return fmt.Errorf("failed to write alertmanager-main secret: %w", err)
}

reqLogger.Info("INFO: Secret alertmanager-main successfully updated")
// Update metric to indicate validation and write success
metrics.UpdateAlertmanagerConfigValidationMetric(true)
return nil
}
Loading