Alert on GraphQL Gateway Supergraph Fetch Failures via Prometheus + Alertmanager (Slack Notification)

## Title

Alert on GraphQL Gateway Supergraph Fetch Failures via Prometheus + Alertmanager (Slack Notification)

---

## Description

### Context

On **2026-03-01**, GraphQL in **staging** became unavailable. The UI surfaced missing organizations, and the network console showed:

- `POST /api/graphql/user/me`
- **HTTP 500**

The gateway logs reported:

```
Error: Failed to fetch the supergraph, check your supergraph configuration.
    at UnifiedGraphManager.handleLoadedUnifiedGraph (...)
message: "Failed to fetch the supergraph, check your supergraph configuration."
```

This originated from `@graphql-mesh/fusion-runtime`, indicating the **GraphQL Gateway could not fetch or compose the supergraph**.

---

### Root Cause (Observed Behavior)

- The issue was triggered by invalid / breaking autogenerated OpenAPI specs from the `search-service`.
- The GraphQL Gateway failed while attempting to load the federated schema (supergraph).
- Restarting the `milo-apiserver` was required to clear cached search API endpoints.
- Removing the `search-service` deployment temporarily restored GraphQL functionality.

This qualifies as an incident because:

- GraphQL was returning 500s.
- The UI showed inconsistent state (no organizations).
- The failure was not surfaced clearly in the frontend.
- There was no alerting mechanism to detect supergraph composition failures.

---

## Problem

We currently **do not have automated alerting** when:

- The GraphQL Gateway fails to fetch/compose the supergraph.
- A downstream service (e.g. `search-service`) publishes broken OpenAPI specs.
- Supergraph discovery or composition enters a failed state.
- `/api/graphql/*` begins returning elevated 5xx responses.

This results in silent degradation until someone manually notices.

---

## Goal

Implement Slack alerting for GraphQL Gateway health failures, ideally leveraging existing Prometheus metrics and Alertmanager.

---

## Scope of Investigation

### 1. Check Existing Prometheus Metrics

Investigate whether the GraphQL Gateway (or underlying runtime) already exposes:

- Supergraph composition status metrics
- Discovery health metrics
- Upstream service health metrics
- HTTP 5xx error rate metrics
- Dependency failure counters

If available, define alerts such as:

- Supergraph load failure > N occurrences in M minutes
- HTTP 5xx rate > threshold
- Dependency health = down
- Gateway readiness probe failing

---

### 2. If Metrics Are Insufficient

If no relevant metrics exist, evaluate:

- Adding custom Prometheus metrics to the GraphQL Gateway:
  - `supergraph_load_failures_total`
  - `supergraph_last_success_timestamp`
  - `dependency_discovery_status`
- Instrumenting OpenAPI fetch failures
- Exposing a health endpoint that validates supergraph integrity

---

## Acceptance Criteria

- [ ] Identify existing Prometheus metrics that can detect this failure mode
- [ ] If not available, implement required instrumentation
- [ ] Create Prometheus alert rules for:
  - Supergraph fetch/composition failures
  - Elevated GraphQL 5xx rates
- [ ] Configure Alertmanager to send Slack notifications
- [ ] Validate alert in staging via controlled failure scenario
- [ ] Document runbook for this incident type



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert on GraphQL Gateway Supergraph Fetch Failures via Prometheus + Alertmanager (Slack Notification) #14

Title

Description

Context

Root Cause (Observed Behavior)

Problem

Goal

Scope of Investigation

1. Check Existing Prometheus Metrics

2. If Metrics Are Insufficient

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alert on GraphQL Gateway Supergraph Fetch Failures via Prometheus + Alertmanager (Slack Notification) #14

Description

Title

Description

Context

Root Cause (Observed Behavior)

Problem

Goal

Scope of Investigation

1. Check Existing Prometheus Metrics

2. If Metrics Are Insufficient

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions