-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Title
Alert on GraphQL Gateway Supergraph Fetch Failures via Prometheus + Alertmanager (Slack Notification)
Description
Context
On 2026-03-01, GraphQL in staging became unavailable. The UI surfaced missing organizations, and the network console showed:
POST /api/graphql/user/me- HTTP 500
The gateway logs reported:
Error: Failed to fetch the supergraph, check your supergraph configuration.
at UnifiedGraphManager.handleLoadedUnifiedGraph (...)
message: "Failed to fetch the supergraph, check your supergraph configuration."
This originated from @graphql-mesh/fusion-runtime, indicating the GraphQL Gateway could not fetch or compose the supergraph.
Root Cause (Observed Behavior)
- The issue was triggered by invalid / breaking autogenerated OpenAPI specs from the
search-service. - The GraphQL Gateway failed while attempting to load the federated schema (supergraph).
- Restarting the
milo-apiserverwas required to clear cached search API endpoints. - Removing the
search-servicedeployment temporarily restored GraphQL functionality.
This qualifies as an incident because:
- GraphQL was returning 500s.
- The UI showed inconsistent state (no organizations).
- The failure was not surfaced clearly in the frontend.
- There was no alerting mechanism to detect supergraph composition failures.
Problem
We currently do not have automated alerting when:
- The GraphQL Gateway fails to fetch/compose the supergraph.
- A downstream service (e.g.
search-service) publishes broken OpenAPI specs. - Supergraph discovery or composition enters a failed state.
/api/graphql/*begins returning elevated 5xx responses.
This results in silent degradation until someone manually notices.
Goal
Implement Slack alerting for GraphQL Gateway health failures, ideally leveraging existing Prometheus metrics and Alertmanager.
Scope of Investigation
1. Check Existing Prometheus Metrics
Investigate whether the GraphQL Gateway (or underlying runtime) already exposes:
- Supergraph composition status metrics
- Discovery health metrics
- Upstream service health metrics
- HTTP 5xx error rate metrics
- Dependency failure counters
If available, define alerts such as:
- Supergraph load failure > N occurrences in M minutes
- HTTP 5xx rate > threshold
- Dependency health = down
- Gateway readiness probe failing
2. If Metrics Are Insufficient
If no relevant metrics exist, evaluate:
- Adding custom Prometheus metrics to the GraphQL Gateway:
supergraph_load_failures_totalsupergraph_last_success_timestampdependency_discovery_status
- Instrumenting OpenAPI fetch failures
- Exposing a health endpoint that validates supergraph integrity
Acceptance Criteria
- Identify existing Prometheus metrics that can detect this failure mode
- If not available, implement required instrumentation
- Create Prometheus alert rules for:
- Supergraph fetch/composition failures
- Elevated GraphQL 5xx rates
- Configure Alertmanager to send Slack notifications
- Validate alert in staging via controlled failure scenario
- Document runbook for this incident type