Pre-flight checklist
Problem to solve
There is no distributed tracing, structured metrics collection, or /metrics endpoint. When a slow request is reported, there is no trace to identify whether latency is in the DB query, the SQLAlchemy ORM layer, or the FastAPI handler itself. Without a Prometheus endpoint, Cloud Monitoring cannot scrape custom application metrics (request counts, DB pool utilization, cache hit rates). The PRD requires latency percentile visibility to enforce SLAs (<1s standard, <2s complex spatial at 95th percentile).
Proposed solution or API
- Add
opentelemetry-sdk, opentelemetry-instrumentation-fastapi, opentelemetry-instrumentation-sqlalchemy, and opentelemetry-exporter-prometheus to the webservice dependencies
- Configure an OTLP exporter pointing to Cloud Trace (GCP's OpenTelemetry-compatible backend — no extra infra required)
- Auto-instrument FastAPI and SQLAlchemy so all requests and DB queries emit spans automatically
- Expose a
/metrics endpoint via opentelemetry-exporter-prometheus, emitting at minimum: request count by endpoint + status, request duration histogram, DB connection pool size
- Configure Cloud Monitoring to scrape
/metrics via a managed Prometheus collector
Alternatives considered
Google Cloud Profiler for CPU profiling only — doesn't cover request-level traces. prometheus-fastapi-instrumentator as a standalone alternative if full OTel is too heavyweight — but a unified OTel stack avoids duplicate instrumentation.
Additional context
Ref: #135 (observability + Prometheus sub-tasks). Cloud Run already emits basic request logs to Cloud Logging automatically; this issue covers application-level instrumentation for per-endpoint, per-query latency breakdown.
Pre-flight checklist
Problem to solve
There is no distributed tracing, structured metrics collection, or
/metricsendpoint. When a slow request is reported, there is no trace to identify whether latency is in the DB query, the SQLAlchemy ORM layer, or the FastAPI handler itself. Without a Prometheus endpoint, Cloud Monitoring cannot scrape custom application metrics (request counts, DB pool utilization, cache hit rates). The PRD requires latency percentile visibility to enforce SLAs (<1s standard, <2s complex spatial at 95th percentile).Proposed solution or API
opentelemetry-sdk,opentelemetry-instrumentation-fastapi,opentelemetry-instrumentation-sqlalchemy, andopentelemetry-exporter-prometheusto the webservice dependencies/metricsendpoint viaopentelemetry-exporter-prometheus, emitting at minimum: request count by endpoint + status, request duration histogram, DB connection pool size/metricsvia a managed Prometheus collectorAlternatives considered
Google Cloud Profiler for CPU profiling only — doesn't cover request-level traces.
prometheus-fastapi-instrumentatoras a standalone alternative if full OTel is too heavyweight — but a unified OTel stack avoids duplicate instrumentation.Additional context
Ref: #135 (observability + Prometheus sub-tasks). Cloud Run already emits basic request logs to Cloud Logging automatically; this issue covers application-level instrumentation for per-endpoint, per-query latency breakdown.