Issue #119 — Automated health monitoring for the TryVit.
UptimeRobot / cron ──► GET /api/health ──► service_role client ──► api_health_check() RPC
│
▼
200 or 503 JSON
Admin dashboard ──► /app/admin/monitoring ──► fetch(/api/health) ──► auto-refresh 60 s
URL: GET /api/health
Authentication: None required (the endpoint calls Supabase via service_role key server-side).
Cache: Cache-Control: no-store — every request is live.
{
"status": "healthy",
"checks": {
"connectivity": true,
"mv_staleness": {
"mv_ingredient_frequency": {
"mv_rows": 487,
"source_rows": 487,
"stale": false
},
"v_product_confidence": {
"mv_rows": 3012,
"source_rows": 3012,
"stale": false
}
},
"row_counts": {
"products": 3012,
"ceiling": 15000,
"utilization_pct": 20.1
}
},
"timestamp": "2026-02-22T14:35:00Z"
}| Code | Meaning |
|---|---|
| 200 | healthy or degraded — system is operational |
| 503 | unhealthy or connection failure — investigation required |
| Status | Trigger |
|---|---|
healthy |
All checks pass, utilization < 80% |
degraded |
MV is stale OR utilization 80–95% |
unhealthy |
Product count = 0 OR utilization > 95% OR DB connection failure |
Returns true if the RPC executes successfully. Returns false (503) if the Supabase database is unreachable.
Compares row counts between:
mv_ingredient_frequencyvsCOUNT(DISTINCT ingredient_id)inproduct_ingredientv_product_confidencevs active product count
If counts differ, the MV is flagged as stale. This usually means REFRESH MATERIALIZED VIEW hasn't run after the last pipeline execution.
Fix: Run the MV refresh (triggered automatically by ci_post_pipeline.sql).
Tracks active products (non-deprecated) against a ceiling of 15,000. Designed for Supabase Free tier capacity planning.
| Utilization | Status | Action |
|---|---|---|
| < 80% | healthy | None |
| 80–95% | degraded | Plan cleanup or tier upgrade |
| > 95% | unhealthy | Immediate action: deprecate unused products or upgrade plan |
URL: /app/admin/monitoring
Access: Requires authentication (admin role recommended). Protected by existing auth middleware.
Features:
- Overall status banner with color-coded indicators (green/yellow/red)
- MV staleness cards for each materialized view
- Product row count gauge with progress bar
- Auto-refresh every 60 seconds
- TanStack Query with 30 s stale time
UptimeRobot provides 50 free HTTP monitors with 5-minute intervals.
- Create an account at https://uptimerobot.com (no credit card required)
- Go to Dashboard → Add New Monitor
- Configure the monitor:
| Setting | Value |
|---|---|
| Monitor Type | HTTP(s) |
| Friendly Name | TryVit — Production |
| URL | https://your-domain.vercel.app/api/health |
| Monitoring Interval | 5 minutes |
| Monitor Timeout | 10 seconds |
- Under Alert Contacts, add your preferred notification method:
- Email (included free)
- Slack webhook (included free)
- Discord webhook (included free)
- Under Advanced Settings:
- HTTP Method:
GET - Alert after: 2 consecutive failures (avoids false-positives from transient Vercel cold-starts)
- HTTP status: Alert when status ≠
200
- HTTP Method:
- Click Create Monitor
Add a keyword monitor alongside the HTTP monitor:
| Setting | Value |
|---|---|
| Monitor Type | Keyword |
| URL | Same as above |
| Keyword Value | "status":"unhealthy" |
| Alert When | Keyword exists |
This catches degraded-but-200 responses that the HTTP status check alone would miss.
BetterStack offers a generous free tier with more granular alerting.
- Create an account at https://betterstack.com
- Go to Uptime → Monitors → Create Monitor
- Configure:
| Setting | Value |
|---|---|
| URL | https://your-domain.vercel.app/api/health |
| Check period | 3 minutes |
| Request timeout | 10 seconds |
| Regions | EU West + US East (at minimum) |
| Expected status | 200 |
- Under On-call → Escalation Policies, configure the escalation chain (see below)
- Create an incident when HTTP status ≠ 200 for 2 consecutive checks
| Condition | Threshold | Action |
|---|---|---|
| HTTP status ≠ 200 | 2 consecutive failures | Trigger alert |
| Response time | > 5000 ms | Trigger warning |
| Response time | > 10000 ms | Trigger critical alert |
| Downtime duration | > 2 minutes | Escalate (see below) |
| SSL certificate expiry | < 14 days | Email warning |
| Time after alert | Channel | Contact |
|---|---|---|
| 0 min | your-email@example.com |
|
| 5 min | Slack / Discord | #alerts channel webhook |
| 15 min | Phone / SMS | +1-XXX-XXX-XXXX (on-call) |
Note: Replace placeholder contacts with actual values when configuring. Phone escalation is optional — only configure if you have a paid plan that supports it.
| Metric | Target | Budget |
|---|---|---|
| Uptime | 99.5% | ~3.6 hours downtime / month |
| Health endpoint response time | < 2 seconds | p95 |
| Incident acknowledgment | < 15 minutes | During business hours |
| Incident resolution | < 2 hours | For unhealthy status |
The 99.5% SLA is a documented target, not an enforced SLO. It serves as a planning guide for monitoring frequency and escalation urgency.
After every deployment, the CI pipeline automatically verifies health:
deploy.ymlpushes migrations to Supabase- Waits 30 seconds for edge functions to stabilize
- Curls
/api/healthand asserts HTTP 200 - If health check fails → deployment is marked as failed in GitHub Actions
- Inspect the step summary for response body details
# Quick check
curl -s https://your-domain.vercel.app/api/health | jq '.status'
# Expected: "healthy"
# Full response
curl -s https://your-domain.vercel.app/api/health | jq .
# With timing
curl -o /dev/null -s -w "HTTP %{http_code} in %{time_total}s\n" \
https://your-domain.vercel.app/api/healthIf unhealthy after deploy, see Rollback Procedures (Issue #121).
Suite #30: Monitoring & Health Check — 7 checks in QA__monitoring.sql
| # | Check |
|---|---|
| 1 | api_health_check() returns valid JSONB |
| 2 | Status is valid enum (healthy/degraded/unhealthy) |
| 3 | Top-level keys present (status, checks, timestamp) |
| 4 | MV staleness values are non-negative |
| 5 | Row count matches actual product count |
| 6 | Connectivity is true |
| 7 | Timestamp is valid ISO-8601 format |
The health endpoint requires these server-side environment variables:
| Variable | Purpose |
|---|---|
NEXT_PUBLIC_SUPABASE_URL |
Supabase project URL |
SUPABASE_SERVICE_ROLE_KEY |
Service role key (server-side only, never exposed to client) |
Both are already configured in Vercel for production.
api_health_check()isSECURITY DEFINER— runs as the function owner- Access is restricted:
REVOKE ALL FROM PUBLIC/anon/authenticated,GRANT TO service_roleonly - The API route sanitizes the response shape to prevent data leaks
- No secrets, connection strings, or infrastructure details are exposed
- The
/api/healthroute is excluded from auth middleware (matcher already excludes/api)
| Condition | Who | Action | SLA |
|---|---|---|---|
Status degraded for > 1 hour |
Developer | Check MV refresh schedule, run pipeline | Acknowledge < 15 min |
Status unhealthy |
Developer | Check DB connectivity, verify product count | Resolve < 2 hours |
| Utilization > 90% | Project lead | Plan capacity: cleanup deprecated products or upgrade Supabase tier | Plan within 24 hours |
| Post-deploy health check fails | Developer | Inspect step summary, consider rollback | Immediate |
See Alert Thresholds and SLA Target above for numeric targets.
| File | Purpose |
|---|---|
supabase/migrations/20260222000400_health_check_monitoring.sql |
RPC function |
frontend/src/app/api/health/route.ts |
Next.js API route |
frontend/src/app/api/health/route.test.ts |
Unit tests (15 tests) |
frontend/src/app/api/health/health-contract.test.ts |
Zod contract test (16 tests) |
frontend/src/app/app/admin/monitoring/page.tsx |
Admin dashboard |
frontend/src/lib/supabase/service.ts |
Service-role client |
.github/workflows/deploy.yml |
Deploy workflow with post-deploy health check |
db/qa/QA__monitoring.sql |
QA suite (7 checks) |