Skip to content

Alert on GraphQL Gateway Supergraph Fetch Failures via Prometheus + Alertmanager (Slack Notification) #14

@JoseSzycho

Description

@JoseSzycho

Title

Alert on GraphQL Gateway Supergraph Fetch Failures via Prometheus + Alertmanager (Slack Notification)


Description

Context

On 2026-03-01, GraphQL in staging became unavailable. The UI surfaced missing organizations, and the network console showed:

  • POST /api/graphql/user/me
  • HTTP 500

The gateway logs reported:

Error: Failed to fetch the supergraph, check your supergraph configuration.
    at UnifiedGraphManager.handleLoadedUnifiedGraph (...)
message: "Failed to fetch the supergraph, check your supergraph configuration."

This originated from @graphql-mesh/fusion-runtime, indicating the GraphQL Gateway could not fetch or compose the supergraph.


Root Cause (Observed Behavior)

  • The issue was triggered by invalid / breaking autogenerated OpenAPI specs from the search-service.
  • The GraphQL Gateway failed while attempting to load the federated schema (supergraph).
  • Restarting the milo-apiserver was required to clear cached search API endpoints.
  • Removing the search-service deployment temporarily restored GraphQL functionality.

This qualifies as an incident because:

  • GraphQL was returning 500s.
  • The UI showed inconsistent state (no organizations).
  • The failure was not surfaced clearly in the frontend.
  • There was no alerting mechanism to detect supergraph composition failures.

Problem

We currently do not have automated alerting when:

  • The GraphQL Gateway fails to fetch/compose the supergraph.
  • A downstream service (e.g. search-service) publishes broken OpenAPI specs.
  • Supergraph discovery or composition enters a failed state.
  • /api/graphql/* begins returning elevated 5xx responses.

This results in silent degradation until someone manually notices.


Goal

Implement Slack alerting for GraphQL Gateway health failures, ideally leveraging existing Prometheus metrics and Alertmanager.


Scope of Investigation

1. Check Existing Prometheus Metrics

Investigate whether the GraphQL Gateway (or underlying runtime) already exposes:

  • Supergraph composition status metrics
  • Discovery health metrics
  • Upstream service health metrics
  • HTTP 5xx error rate metrics
  • Dependency failure counters

If available, define alerts such as:

  • Supergraph load failure > N occurrences in M minutes
  • HTTP 5xx rate > threshold
  • Dependency health = down
  • Gateway readiness probe failing

2. If Metrics Are Insufficient

If no relevant metrics exist, evaluate:

  • Adding custom Prometheus metrics to the GraphQL Gateway:
    • supergraph_load_failures_total
    • supergraph_last_success_timestamp
    • dependency_discovery_status
  • Instrumenting OpenAPI fetch failures
  • Exposing a health endpoint that validates supergraph integrity

Acceptance Criteria

  • Identify existing Prometheus metrics that can detect this failure mode
  • If not available, implement required instrumentation
  • Create Prometheus alert rules for:
    • Supergraph fetch/composition failures
    • Elevated GraphQL 5xx rates
  • Configure Alertmanager to send Slack notifications
  • Validate alert in staging via controlled failure scenario
  • Document runbook for this incident type

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions