simsteward · simsteward · Apr 1, 2026 · Mar 26, 2026 · Apr 1, 2026
diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
@@ -150,4 +150,31 @@ All session context fields fall back to `"not in session"` (use `SessionLogging.
 - [ ] New iRacing SDK event handler → structured log with `domain="iracing"`
 - [ ] `iracing_incident` / `incident_detected` log → full uniqueness signature (`unique_user_id`, start/end frame, camera)
 
-**Canonical reference:** [docs/RULES-ActionCoverage.md](../docs/RULES-ActionCoverage.md)
+**Canonical reference:** [docs/RULES-ActionCoverage.md](../docs/RULES-ActionCoverage.md)
+
+---
+
+## Grafana Alert Covenant
+
+Every behavioral change to the plugin, dashboard, or LLM integration MUST include a Grafana alert review. **Alert silence ≠ alert passing.**
+
+### Change → Domain quick-reference
+
+| Change type | Domain to check |
+|---|---|
+| New `DispatchAction` branch | Domain 3 — `action-failure-streak` thresholds |
+| New iRacing SDK event | Domains 3 + 7 — session/replay rules |
+| New Claude API integration | Domains 4 + 5 — session health + cost |
+| New MCP tool | Domain 4 — `mcp-service-errors`, `tool-loop-detected` |
+| Log event renamed/removed | Search alert YAMLs — alert will go **silent**, not fire |
+| New log event/field | Consider whether a new alert rule is warranted |
+| Sentinel code change | Domain 6 — self-health rules |
+
+### PR Checklist addition
+
+- [ ] Reviewed impacted Grafana alert domains (see table above)
+- [ ] Verified no alert queries break silently if log events were renamed/removed
+- [ ] Considered new alert rule if new log events were added
+
+**Alert YAML files:** `observability/local/grafana/provisioning/alerting/` (46 rules, 8 domains)
+**Canonical reference:** [docs/RULES-GrafanaAlerts.md](../docs/RULES-GrafanaAlerts.md)
diff --git a/.playwright-mcp/console-2026-03-26T21-06-47-084Z.log b/.playwright-mcp/console-2026-03-26T21-06-47-084Z.log
@@ -0,0 +1,2 @@
+[     885ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/login:0
+[   23600ms] [ERROR] Failed to load resource: the server responded with a status of 401 (Unauthorized) @ http://localhost:3000/login:0
diff --git a/.playwright-mcp/console-2026-03-26T21-07-25-658Z.log b/.playwright-mcp/console-2026-03-26T21-07-25-658Z.log
@@ -0,0 +1,3 @@
+[     292ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/login:0
+[   15683ms] [ERROR] Failed to load resource: the server responded with a status of 401 (Unauthorized) @ http://localhost:3000/login:0
+[   49060ms] [ERROR] Failed to load resource: the server responded with a status of 401 (Unauthorized) @ http://localhost:3000/login:0
diff --git a/.playwright-mcp/console-2026-03-26T21-11-17-962Z.log b/.playwright-mcp/console-2026-03-26T21-11-17-962Z.log
@@ -0,0 +1,2 @@
+[     602ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/login:0
+[   18682ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/d/claude-token-cost?orgId=1&from=now-7d&to=now&kiosk:0
diff --git a/.playwright-mcp/console-2026-03-26T21-11-55-773Z.log b/.playwright-mcp/console-2026-03-26T21-11-55-773Z.log
@@ -0,0 +1 @@
+[     648ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/d/claude-cache-context?orgId=1&from=now-7d&to=now:0
diff --git a/.playwright-mcp/console-2026-03-26T21-12-23-787Z.log b/.playwright-mcp/console-2026-03-26T21-12-23-787Z.log
@@ -0,0 +1,2 @@
+[     636ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/d/simsteward-log-sentinel?orgId=1&from=now-6h&to=now:0
+[   59082ms] [ERROR] WebSocket connection to 'ws://localhost:3000/api/live/ws' failed: Connection closed before receiving a handshake response @ http://localhost:3000/public/build/3855.c53eb219979d7cb3b2d4.js:1312
diff --git a/cache-context-dashboard.png b/cache-context-dashboard.png
diff --git a/docs/RULES-GrafanaAlerts.md b/docs/RULES-GrafanaAlerts.md
@@ -0,0 +1,93 @@
+# Grafana Alert Rules — Development Covenant
+
+Every behavioral change to the plugin, dashboard, or LLM integration **must include a
+corresponding Grafana alert review**. Silence is not the same as passing.
+
+**Canonical spec:** `docs/superpowers/specs/2026-03-30-grafana-alerts-design.md`
+**Alert YAML files:** `observability/local/grafana/provisioning/alerting/`
+
+---
+
+## Change → Domain Mapping
+
+| Change type | Domain(s) to review |
+|---|---|
+| New action handler in `DispatchAction` | Domain 3 — check `action-failure-streak` thresholds |
+| New iRacing SDK event handler | Domain 3 and/or Domain 7 — check incident/replay rules |
+| New Claude API integration | Domains 4 + 5 — session health and cost rules |
+| New MCP tool added | Domain 4 — `mcp-service-errors`, `tool-loop-detected` |
+| New log event or field added | Check all domains — does it need a new alert? |
+| Removing or renaming a log event | Search alert YAMLs for old name — alert will go **silent**, not fire |
+| Changing cost fields in token metrics | Domain 5 — all cost threshold alerts |
+| Changing session lifecycle events | Domains 3, 4, 8 — session start/end correlation |
+| Sentinel code change | Domain 6 — self-health rules |
+| Grafana dashboard change | Domain 8 — cross-stream rules may need annotation updates |
+
+---
+
+## Alert Silence ≠ Alert Passing
+
+When you rename or remove a log event:
+- The alert query will return **no data** (not 0)
+- If `noDataState: OK` — the alert silently stops firing
+- This is a **silent regression** — harder to detect than a real alert
+
+Always check `noDataState` when modifying events that existing alerts depend on.
+
+---
+
+## Testing New Alerts
+
+To verify an alert fires correctly before relying on it:
+
+1. **Write a test event to Loki** via the gateway:
+   ```bash
+   curl -X POST http://localhost:3500/loki/api/v1/push \
+     -H "Content-Type: application/json" \
+     -d '{
+       "streams": [{
+         "stream": {"app": "sim-steward", "env": "local", "level": "ERROR"},
+         "values": [["'"$(date +%s%N)"'", "{\"level\":\"ERROR\",\"event\":\"test\",\"message\":\"test alert\"}"]]
+       }]
+     }'
+   ```
+
+2. **Temporarily lower the threshold** in the alert rule to `0` and set the evaluation interval to `10s` in Grafana UI (do not commit this change).
+
+3. **Verify the alert fires** in Grafana UI → Alerting → Alert Rules within the evaluation window.
+
+4. **Verify the `/trigger` webhook** receives the payload:
+   ```bash
+   # Check log-sentinel logs
+   docker compose logs log-sentinel --tail=20
+   ```
+
+5. **Restore the threshold** before committing any YAML changes.
+
+---
+
+## Alert Catalog Summary
+
+| File | Domains | Count |
+|---|---|---|
+| `rules-infrastructure.yml` | 1+2: Infrastructure & Deploy Quality | 10 |
+| `rules-iracing.yml` | 3+7: iRacing Session + Replay | 10 |
+| `rules-claude-sessions.yml` | 4: Claude Code Session Health | 7 |
+| `rules-token-cost.yml` | 5: Token & Cost Budget | 7 |
+| `rules-sentinel-health.yml` | 6: Sentinel Self-Health | 7 |
+| `rules-cross-stream.yml` | 8: Cross-Stream Correlation | 5 |
+| **Total** | | **46** |
+
+T2-tier alerts (skip `needs_t2` gate, escalate immediately):
+`subagent-explosion`, `tool-loop-detected`, `session-cost-critical`, `daily-spend-critical`,
+`ws-claude-coinflict`, `session-token-abandon`, `action-fail-session-fail`, `deploy-triple-signal`
+
+---
+
+## PR Checklist Addition
+
+For any PR modifying plugin behavior, add to the review checklist:
+
+- [ ] Reviewed Grafana alert domains for impacted change type (see table above)
+- [ ] If log events were renamed/removed: verified no alert queries silently break
+- [ ] If new log events added: considered whether a new alert rule is warranted
diff --git a/docs/superpowers/specs/2026-03-30-grafana-alerts-design.md b/docs/superpowers/specs/2026-03-30-grafana-alerts-design.md
@@ -0,0 +1,218 @@
+# Grafana Alerts Design — Log Sentinel Layer 0
+**Date:** 2026-03-30
+**Status:** Approved
+
+---
+
+## Context
+
+The log-sentinel V2 LLM investigation pipeline (T1 triage + T2 agentic tool loop) is expensive to run continuously — qwen3:8b T1 scan on a 6700 XT takes 60-90 seconds, T2 takes 3-4 minutes. Running this on a fixed hourly poll means real incidents can sit undetected for up to 60 minutes, and the models waste cycles on quiet periods.
+
+Grafana Alerts solves this as **Layer 0**: always-on, no GPU cost, fires webhooks only when something is actually wrong. The sentinel switches from polling to event-driven. When Grafana fires, it delivers structured alert context (labels, values, timeframe) directly in the webhook payload — T1 skips cold-start gathering for the relevant domain and goes straight to targeted investigation.
+
+**Layer 0 (Grafana Alerts) → Layer 1 (T1 fast triage) → Layer 2 (T2 agentic tool loop)**
+
+---
+
+## Alert Architecture
+
+### Transport: Webhook-Only
+Grafana alert notifications route exclusively to log-sentinel's `/trigger` HTTP endpoint. No email, Slack, or PagerDuty at this stage. The sentinel logs every trigger, runs the appropriate tier, and emits findings to Loki (queryable by Grafana dashboards).
+
+### Provisioning Structure
+All alerts are provisioned as code — no manual UI configuration:
+```
+observability/local/grafana/provisioning/alerting/
+  contact-points.yml           # webhook endpoint definition
+  notification-policies.yml    # routing: all alerts → webhook
+  rules-infrastructure.yml     # Domains 1+2
+  rules-iracing.yml            # Domain 3+7
+  rules-claude-sessions.yml    # Domain 4
+  rules-token-cost.yml         # Domain 5
+  rules-sentinel-health.yml    # Domain 6
+  rules-cross-stream.yml       # Domain 8
+```
+
+### Trigger Tier Labeling
+Every alert rule carries a `trigger_tier` label (`t1` or `t2`). The sentinel's `/trigger` endpoint reads this label and routes accordingly — T1 for most alerts, T2 for critical multi-signal correlations.
+
+---
+
+## Alert Catalog
+
+### Domain 1+2: Infrastructure & Deploy Quality (10 alerts)
+
+| Alert ID | LogQL / Condition | Severity | Tier |
+|---|---|---|---|
+| `bridge-start-failed` | `count_over_time({app="sim-steward"} \| json \| event="plugin_lifecycle" \| level="ERROR" [5m]) > 0` | critical | T1 |
+| `plugin-never-ready` | plugin_lifecycle start, no ready within 60s | warn | T1 |
+| `sentinel-cycle-stalled` | No `sentinel_cycle` event in 90 min | critical | T1 |
+| `ollama-unreachable` | `sentinel_health` event with `ollama_reachable=false` | critical | T1 |
+| `loki-circuit-open` | `sentinel_health` with `loki_circuit_open=true` | critical | T1 |
+| `post-deploy-warn-rate` | WARN rate > 5/min in 10 min after lifecycle event | warn | T1 |
+| `bridge-failure-post-deploy` | ERROR in sim-steward within 15 min of plugin_start | critical | T1 |
+| `plugin-slow-start` | Time from plugin_lifecycle start → ready > 30s | warn | T1 |
+| `error-spike-post-deploy` | Error count doubles vs prior 15 min window after deploy | warn | T1 |
+| `error-spike-general` | `count_over_time({app="sim-steward"} \| json \| level="ERROR" [10m]) > 10` | warn | T1 |
+
+### Domain 3: iRacing Session Behavior (5 alerts)
+
+| Alert ID | Condition | Severity | Tier |
+|---|---|---|---|
+| `session-no-actions` | Session active 15+ min, zero `action_dispatched` events | warn | T1 |
+| `session-no-end` | `iracing_session_start` with no `iracing_session_end` within 4h | warn | T1 |
+| `action-failure-streak` | 3+ consecutive `action_result` errors in same session | critical | T1 |
+| `websocket-disconnect-spike` | 3+ `websocket_disconnect` events in 5 min | warn | T1 |
+| `incident-detection-zero` | iRacing session > 30 min, zero `iracing_incident` events | warn | T1 |
+
+### Domain 4: Claude Code Session Health (7 alerts)
+
+| Alert ID | Condition | Severity | Tier |
+|---|---|---|---|
+| `session-abandoned` | Session start, no completion token entry, no activity for 30 min | warn | T1 |
+| `claude-error-spike` | 5+ ERROR entries in claude-dev-logging in 5 min | warn | T1 |
+| `permission-flood` | 10+ permission-related log entries in 5 min | warn | T1 |
+| `subagent-explosion` | Subagent spawn count > 20 in single session | warn | T2 |
+| `mcp-service-errors` | MCP call failures > 5 in 10 min | warn | T1 |
+| `tool-loop-detected` | Same tool called 5+ times in same session without progress | warn | T2 |
+| `session-zero-output` | Session completes (token entry exists), zero assistant messages logged | warn | T1 |
+
+### Domain 5: Token/Cost Budget (7 alerts)
+
+| Alert ID | Condition | Severity | Tier |
+|---|---|---|---|
+| `session-cost-spike` | Single session cost > $1.00 | warn | T1 |
+| `session-cost-critical` | Single session cost > $3.00 | critical | T2 |
+| `daily-spend-warning` | Rolling 24h spend > $10.00 | warn | T1 |
+| `daily-spend-critical` | Rolling 24h spend > $25.00 | critical | T2 |
+| `tool-use-flood` | Tool calls per session > 100 | warn | T1 |
+| `unexpected-model` | Model field not in approved set (claude-opus-4, claude-sonnet-4-6, etc.) | warn | T1 |
+| `cache-hit-rate-low` | Cache hit rate < 20% over 1h (when sessions active) | info | T1 |
+
+### Domain 6: Sentinel Self-Health (7 alerts)
+
+| Alert ID | Condition | Severity | Tier |
+|---|---|---|---|
+| `sentinel-cycle-stalled` | No `sentinel_cycle` event in 90 min | critical | T1 |
+| `detector-error-rate` | Detector errors > 3 in single cycle | warn | T1 |
+| `t1-slow` | T1 inference duration > 120s | warn | T1 |
+| `t2-slow` | T2 tool loop duration > 300s | warn | T1 |
+| `sentry-flood` | Sentry-worthy findings > 5 in 1h | warn | T1 |
+| `findings-flood` | Total findings > 20 in single cycle | warn | T1 |
+| `zero-findings-48h` | No findings at all in 48h (system may be suppressing) | info | T1 |
+
+### Domain 7: Replay & Incident Investigation (5 alerts)
+
+| Alert ID | Condition | Severity | Tier |
+|---|---|---|---|
+| `replay-no-seeks` | Replay started, zero `iracing_replay_seek` in 5 min | warn | T1 |
+| `incident-detection-stall` | iRacing session active > 30 min, zero `iracing_incident` events in replay mode | warn | T1 |
+| `incident-camera-stuck` | Same `camera_view` on 3+ consecutive incidents | info | T1 |
+| `replay-session-no-close` | Replay session start, no session_end within 2h | warn | T1 |
+| `action-incident-gap` | Incident detected, no `action_dispatched` within 10 min | info | T1 |
+
+### Domain 8: Cross-Stream Correlation (5 alerts)
+*Implemented as multi-query rules using Grafana `math` expressions — fires only when both conditions true simultaneously.*
+
+| Alert ID | Streams | Condition | Severity | Tier |
+|---|---|---|---|---|
+| `ws-claude-coinflict` | sim-steward + claude-dev-logging | WebSocket disconnect + Claude ERROR in same 5-min window | warn | T2 |
+| `session-token-abandon` | claude-dev-logging + claude-token-metrics | Session ERROR + no token entry for that session_id | warn | T2 |
+| `action-fail-session-fail` | sim-steward + claude-dev-logging | `action_result` errors + Claude session ERROR within 10 min | critical | T2 |
+| `deploy-triple-signal` | all 3 streams | 2+ streams show elevated error rate within 15 min of plugin lifecycle event | critical | T2 |
+| `cost-spike-tool-flood` | claude-dev-logging + claude-token-metrics | Tool call count spike + session cost spike in same cycle | warn | T1 |
+
+**Total: 46 alerts across 8 domains.**
+
+---
+
+## `/trigger` Endpoint Design
+
+The log-sentinel app gains a new HTTP endpoint:
+
+```
+POST /trigger
+Content-Type: application/json
+
+{
+  "alerts": [{
+    "labels": {
+      "alertname": "ws-claude-coinflict",
+      "trigger_tier": "t2",
+      "severity": "warn"
+    },
+    "annotations": {
+      "summary": "WebSocket disconnects co-occurring with Claude errors",
+      "description": "3 ws_disconnect events and 2 Claude ERROR entries in 5-min window ending 14:32:00"
+    },
+    "startsAt": "2026-03-30T14:32:00Z",
+    "endsAt": "0001-01-01T00:00:00Z"
+  }]
+}
+```
+
+Sentinel behavior on receipt:
+1. Parse alert labels — extract `alertname`, `trigger_tier`, `severity`
+2. Derive lookback window from `startsAt` (default: 30 min before alert fired)
+3. If `trigger_tier=t1`: run T1 with alert context injected into summary prompt
+4. If `trigger_tier=t2`: run T1 (for context) then immediately run T2 — skip the `needs_t2` gate
+5. Deduplicate: if the same `alertname` triggered within `SENTINEL_DEDUP_WINDOW_SEC`, skip
+6. Log `sentinel_trigger` event to Loki with alert metadata
+
+Alert context injection into T1 prompt:
+```
+ALERT CONTEXT (from Grafana):
+  Alert: ws-claude-coinflict (warn)
+  Fired: 2026-03-30 14:32:00 UTC
+  Description: 3 ws_disconnect events and 2 Claude ERROR entries in 5-min window
+  → Focus investigation on this signal. Do not suppress even if recent history is quiet.
+```
+
+---
+
+## Alert Covenant (Living Document)
+
+**Every behavioral change to the plugin, dashboard, or LLM integration must include a corresponding Grafana alert review.**
+
+When adding or changing:
+- A new action handler → check Domain 3 (action-failure-streak thresholds)
+- A new Claude integration → check Domain 4 + 5
+- A new log event or field → check if it should trigger an alert in the relevant domain
+- Removing a log event → check if any alert depends on it (alert will go silent, not fire)
+
+Alert silence ≠ alert passing. Test new alerts by writing a test event to Loki via the gateway and verifying the alert fires within its evaluation window.
+
+**Canonical reference: `docs/RULES-GrafanaAlerts.md`** (to be added to CLAUDE.md)
+
+---
+
+## Implementation Files
+
+### New files
+- `observability/local/grafana/provisioning/alerting/contact-points.yml`
+- `observability/local/grafana/provisioning/alerting/notification-policies.yml`
+- `observability/local/grafana/provisioning/alerting/rules-infrastructure.yml`
+- `observability/local/grafana/provisioning/alerting/rules-iracing.yml`
+- `observability/local/grafana/provisioning/alerting/rules-claude-sessions.yml`
+- `observability/local/grafana/provisioning/alerting/rules-token-cost.yml`
+- `observability/local/grafana/provisioning/alerting/rules-sentinel-health.yml`
+- `observability/local/grafana/provisioning/alerting/rules-cross-stream.yml`
+- `docs/RULES-GrafanaAlerts.md`
+
+### Modified files
+- `observability/local/log-sentinel/app.py` — add `POST /trigger` endpoint
+- `observability/local/log-sentinel/sentinel.py` — add `trigger_cycle()` method (alert-context-aware T1/T2 dispatch)
+- `observability/local/log-sentinel/config.py` — no new fields needed (uses existing dedup window)
+- `observability/local/docker-compose.yml` — no changes needed (grafana already provisioned, port 3000)
+- `.claude/CLAUDE.md` — add alert covenant reference
+
+---
+
+## Verification
+
+1. **Provisioning loads**: `docker compose up grafana` — check Grafana UI → Alerting → Alert Rules shows all 46 rules
+2. **Webhook fires**: Manually set an alert rule to always-firing in Grafana UI, verify `/trigger` receives POST and logs `sentinel_trigger` event to Loki
+3. **T1 trigger path**: Confirm T1 runs after a non-critical alert fires, `sentinel_analyst_run` appears in logs with `trigger_source=grafana_alert`
+4. **T2 direct trigger**: Confirm T2 runs immediately (skipping `needs_t2` gate) when `trigger_tier=t2` alert fires
+5. **Dedup**: Fire same alert twice within dedup window, verify second is silently skipped
+6. **Cross-stream rule**: Write test events to both sim-steward and claude-dev-logging streams via Loki push API, verify `ws-claude-coinflict` fires
diff --git a/log-sentinel-dashboard.png b/log-sentinel-dashboard.png
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[ 885ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/login:0
		[ 23600ms] [ERROR] Failed to load resource: the server responded with a status of 401 (Unauthorized) @ http://localhost:3000/login:0
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[ 602ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/login:0
		[ 18682ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/d/claude-token-cost?orgId=1&from=now-7d&to=now&kiosk:0
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		[ 648ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/d/claude-cache-context?orgId=1&from=now-7d&to=now:0
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[ 636ms] [WARNING] <meta name="apple-mobile-web-app-capable" content="yes"> is deprecated. Please include <meta name="mobile-web-app-capable" content="yes"> @ http://localhost:3000/d/simsteward-log-sentinel?orgId=1&from=now-6h&to=now:0
		[ 59082ms] [ERROR] WebSocket connection to 'ws://localhost:3000/api/live/ws' failed: Connection closed before receiving a handshake response @ http://localhost:3000/public/build/3855.c53eb219979d7cb3b2d4.js:1312