Pre-flight checklist
Problem to solve
There is no dashboard to visualize the PRD latency targets (<1s / <2s) over time or to trigger alerts when they are violated. Incoming maintainers at UC Berkeley/LBL will have no operational visibility into the service.
Proposed solution or API
- Create a Cloud Monitoring dashboard with panels for:
- p50 / p95 / p99 request latency by endpoint
- Error rate (4xx, 5xx) by endpoint
- Cloud Run instance count and cold-start frequency
- Cloud SQL query latency
- Add alerting policies for p95 > 1s sustained over 5 minutes
- Export dashboard JSON to
deployment/cloud/gcp/monitoring/ so it can be applied via gcloud monitoring dashboards create
Alternatives considered
Grafana with Prometheus scraping — more portable but adds infrastructure. Cloud Monitoring is already available on GCP at no extra setup cost.
Additional context
Ref: #135 (SLA monitoring sub-task). Depends on the OpenTelemetry issue (see related issues) for the latency metrics data source.
Pre-flight checklist
Problem to solve
There is no dashboard to visualize the PRD latency targets (<1s / <2s) over time or to trigger alerts when they are violated. Incoming maintainers at UC Berkeley/LBL will have no operational visibility into the service.
Proposed solution or API
deployment/cloud/gcp/monitoring/so it can be applied viagcloud monitoring dashboards createAlternatives considered
Grafana with Prometheus scraping — more portable but adds infrastructure. Cloud Monitoring is already available on GCP at no extra setup cost.
Additional context
Ref: #135 (SLA monitoring sub-task). Depends on the OpenTelemetry issue (see related issues) for the latency metrics data source.