Skip to content

[Feature]: Create SLA monitoring dashboard for API latency targets #234

@Vraj1234

Description

@Vraj1234

Pre-flight checklist

  • I have searched the existing issues

Problem to solve

There is no dashboard to visualize the PRD latency targets (<1s / <2s) over time or to trigger alerts when they are violated. Incoming maintainers at UC Berkeley/LBL will have no operational visibility into the service.

Proposed solution or API

  • Create a Cloud Monitoring dashboard with panels for:
    • p50 / p95 / p99 request latency by endpoint
    • Error rate (4xx, 5xx) by endpoint
    • Cloud Run instance count and cold-start frequency
    • Cloud SQL query latency
  • Add alerting policies for p95 > 1s sustained over 5 minutes
  • Export dashboard JSON to deployment/cloud/gcp/monitoring/ so it can be applied via gcloud monitoring dashboards create

Alternatives considered

Grafana with Prometheus scraping — more portable but adds infrastructure. Cloud Monitoring is already available on GCP at no extra setup cost.

Additional context

Ref: #135 (SLA monitoring sub-task). Depends on the OpenTelemetry issue (see related issues) for the latency metrics data source.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions