diff --git a/docs/MANUAL_TEST_CHECKLIST.md b/docs/MANUAL_TEST_CHECKLIST.md index 92d3add81..52c2cbcfc 100644 --- a/docs/MANUAL_TEST_CHECKLIST.md +++ b/docs/MANUAL_TEST_CHECKLIST.md @@ -463,3 +463,14 @@ Summary scope: 4. True-missing vs cross-user denial indistinguishability (B-90 to B-96) 5. Error payload contract verification for auth/validation/sandbox paths (B-100 to B-110) 6. Advanced controller families: ops/logs/users/abuse/llm-quota/agents/knowledge/webhooks/external-imports (B-130 to B-175) + +--- + +## Incident Rehearsals + +For operational failure diagnosis and recovery validation beyond functional testing, see the incident rehearsal program: + +- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- schedule and rotation +- `docs/ops/rehearsal-scenarios/` -- scenario templates +- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format +- `docs/ops/rehearsals/` -- completed rehearsal evidence diff --git a/docs/TESTING_GUIDE.md b/docs/TESTING_GUIDE.md index e2808f681..113eec5db 100644 --- a/docs/TESTING_GUIDE.md +++ b/docs/TESTING_GUIDE.md @@ -604,6 +604,19 @@ Recommended execution pairing: - automated: API + frontend unit + E2E capture loop (`#210` delivered, retained as active regression path) - manual: capture friction/trust checks in `docs/MANUAL_TEST_CHECKLIST.md` +## Incident Rehearsals + +Manual incident rehearsals complement automated tests by validating diagnosis and recovery workflows against realistic failure conditions. Rehearsals are scheduled monthly (lightweight, ~30 min) and quarterly (deep drill, ~2 hours). + +Key resources: +- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- schedule, rotation, and process +- `docs/ops/rehearsal-scenarios/` -- scenario templates (health degradation, telemetry gaps, deployment failures) +- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format +- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- how rehearsal findings become tracked issues +- `docs/ops/rehearsals/` -- completed rehearsal evidence packages + +Rehearsals are distinct from the automated failure-injection drill suite (`docs/ops/FAILURE_INJECTION_DRILLS.md`). Drills are scripted and CI-runnable; rehearsals are human-driven and focus on diagnosis speed, tooling gaps, and recovery muscle memory. + ## Development Sandbox Mode For local development only, authorization bypass can be enabled via: diff --git a/docs/ops/EVIDENCE_TEMPLATE.md b/docs/ops/EVIDENCE_TEMPLATE.md new file mode 100644 index 000000000..f539c2fa4 --- /dev/null +++ b/docs/ops/EVIDENCE_TEMPLATE.md @@ -0,0 +1,119 @@ +# Rehearsal Evidence Package Template + +Last Updated: 2026-03-29 +Issue: `#150` OPS-19 incident rehearsal and recovery evidence program + +## Purpose + +Every rehearsal produces an evidence package that records what happened, what was found, and what follow-up is needed. This template defines the required format. + +Evidence files are stored in `docs/ops/rehearsals/` with the naming convention: + +``` +YYYY-MM-DD_scenario-name.md +``` + +Example: `2026-03-29_degraded-api-health.md` + +--- + +## Template + +Copy the block below into a new file and fill in each section. + +```markdown +# Rehearsal Evidence: [Scenario Name] + +## Metadata + +| Field | Value | +| --- | --- | +| Date | YYYY-MM-DD | +| Rehearsal type | Monthly / Quarterly deep drill | +| Scenario | [scenario filename from docs/ops/rehearsal-scenarios/] | +| Lead | [GitHub username] | +| Participants | [comma-separated GitHub usernames] | +| Commit SHA | [HEAD of main at rehearsal start] | +| OS / Environment | [e.g., Windows 10 Pro, Docker Desktop 4.x] | +| Duration | [actual elapsed time] | +| Outcome | Pass / Partial / Fail | + +## Timeline + +Use ISO 8601 timestamps (UTC). Record each significant action or observation. + +| Timestamp (UTC) | Actor | Action / Observation | +| --- | --- | --- | +| 2026-03-29T14:00:00Z | @lead | Started API with injected fault | +| 2026-03-29T14:02:30Z | @lead | Observed 503 on /health/ready | +| ... | ... | ... | + +## Commands Run + +Record every command executed during the rehearsal, in order. + +```bash +# Example +dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj +curl http://localhost:5000/health/ready +``` + +## Log Excerpts + +Include relevant log output. Redact any secrets or PII. + +``` +[relevant log lines here] +``` + +## Root Cause / Diagnosis Summary + +Describe what the injected fault was, how it was detected, and what the diagnosis path looked like. + +## Recovery Actions Taken + +Describe the steps taken to restore the system to a healthy state. + +## Findings + +List any issues, gaps, or improvements discovered during the rehearsal. + +- [ ] Finding 1: [description] -- Severity: [P1/P2/P3/P4] -- Issue: [#NNN or "to be filed"] +- [ ] Finding 2: [description] -- Severity: [P1/P2/P3/P4] -- Issue: [#NNN or "to be filed"] + +## Sign-Off + +| Role | Name | Date | Approved | +| --- | --- | --- | --- | +| Rehearsal lead | @username | YYYY-MM-DD | [ ] | +| Observer | @username | YYYY-MM-DD | [ ] | + +## Follow-Up Issues + +Link to any issues filed as a result of this rehearsal: + +- #NNN: [title] +- #NNN: [title] +``` + +--- + +## Required Artifacts Checklist + +Every evidence package must include: + +- [ ] Completed metadata table with all fields filled +- [ ] Timeline with at least 3 entries (start, key observation, resolution) +- [ ] Commands run section with actual commands (not placeholders) +- [ ] At least one log excerpt or explanation of why logs were unavailable +- [ ] Root cause / diagnosis summary +- [ ] Recovery actions taken +- [ ] Findings list (even if empty -- state "No new findings") +- [ ] Sign-off from at least the rehearsal lead +- [ ] Follow-up issues linked (or "None" if no issues were filed) + +## Related Documents + +- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- rehearsal schedule and rotation +- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- how to file findings as issues +- `docs/ops/rehearsal-scenarios/` -- scenario templates diff --git a/docs/ops/INCIDENT_REHEARSAL_CADENCE.md b/docs/ops/INCIDENT_REHEARSAL_CADENCE.md new file mode 100644 index 000000000..aa14fb988 --- /dev/null +++ b/docs/ops/INCIDENT_REHEARSAL_CADENCE.md @@ -0,0 +1,90 @@ +# Incident Rehearsal Cadence + +Last Updated: 2026-03-29 +Issue: `#150` OPS-19 incident rehearsal and recovery evidence program + +## Purpose + +Rehearsals validate that the team can diagnose and recover from production-realistic failures using real tooling and documented procedures. They also surface gaps in observability, runbooks, and recovery automation before real incidents expose them. + +## Monthly Lightweight Rehearsal + +| Field | Detail | +| --- | --- | +| Cadence | First working Thursday of each month | +| Duration | ~30 minutes | +| Scope | Single scenario from `docs/ops/rehearsal-scenarios/` | +| Lead | Rotating (see assignment model below) | +| Participants | Rehearsal lead + one observer minimum | +| Artifacts | Evidence package filed in `docs/ops/rehearsals/` | + +Steps: +1. Lead selects a scenario from the scenario library (prefer unexercised or recently-failed scenarios). +2. Announce the rehearsal in the team channel at least 24 hours in advance. +3. Execute the scenario using the template's injection method and diagnosis path. +4. Record an evidence package using `docs/ops/EVIDENCE_TEMPLATE.md`. +5. File any discovered issues per `docs/ops/REHEARSAL_BACKOFF_RULES.md`. + +## Quarterly Deep Drill + +| Field | Detail | +| --- | --- | +| Cadence | Second week of Q1/Q2/Q3/Q4 (January, April, July, October) | +| Duration | ~2 hours | +| Scope | Combined or cascading scenario (e.g., degraded health + deployment failure) | +| Lead | Rotating (same rotation, offset from monthly) | +| Participants | All active contributors | +| Artifacts | Evidence package + retrospective summary | + +Steps: +1. Lead designs a combined scenario at least one week before the drill date. +2. Distribute the scenario brief (pre-conditions, scope, goals) to all participants 48 hours in advance. +3. Execute the drill with explicit role assignments: incident commander, investigator, communicator. +4. Record the evidence package and a retrospective summary covering what went well, what was slow, and what tooling or documentation was missing. +5. File findings and retrospective actions per `docs/ops/REHEARSAL_BACKOFF_RULES.md`. + +## Rotation and Assignment Model + +Rehearsal lead rotates alphabetically by GitHub username among active contributors. + +| Month | Lead selection | +| --- | --- | +| Month N | First contributor alphabetically who has not led in the current quarter | +| Fallback | If the assigned lead is unavailable, the next person in rotation picks up | + +The rotation resets each quarter. Deep drills use the same rotation but are offset (the deep-drill lead should not be the same person who led the preceding monthly rehearsal). + +To check the current rotation state, see the most recent evidence file in `docs/ops/rehearsals/` -- the lead is recorded in the metadata section. + +## Calendar Integration + +Add rehearsal dates to the team calendar: + +- **Monthly**: recurring event on the first Thursday of each month, 30 minutes, titled `[Taskdeck] Monthly Incident Rehearsal` +- **Quarterly**: recurring event in the second week of Jan/Apr/Jul/Oct, 2 hours, titled `[Taskdeck] Quarterly Deep Drill` + +Include the following in the calendar event description: + +``` +Scenario library: docs/ops/rehearsal-scenarios/ +Evidence template: docs/ops/EVIDENCE_TEMPLATE.md +Backlog rules: docs/ops/REHEARSAL_BACKOFF_RULES.md +``` + +## Scenario Library + +Available scenarios in `docs/ops/rehearsal-scenarios/`: + +- `degraded-api-health.md` -- API health endpoint returns degraded/unhealthy status +- `missing-telemetry-signal.md` -- Correlation ID missing from OpenTelemetry traces +- `mcp-server-startup-regression.md` -- Optional MCP server fails at boot +- `deployment-readiness-failure.md` -- Docker Compose startup fails readiness checks + +New scenarios should follow the same template structure (pre-conditions, injection, diagnosis, recovery, evidence checklist). File them in the `rehearsal-scenarios/` directory with a descriptive kebab-case filename. + +## Related Documents + +- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format +- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- issue filing and SLA rules for findings +- `docs/ops/FAILURE_INJECTION_DRILLS.md` -- automated drill scripts (complementary to manual rehearsals) +- `docs/ops/OBSERVABILITY_BASELINE.md` -- telemetry and dashboard contract diff --git a/docs/ops/REHEARSAL_BACKOFF_RULES.md b/docs/ops/REHEARSAL_BACKOFF_RULES.md new file mode 100644 index 000000000..ec44daed7 --- /dev/null +++ b/docs/ops/REHEARSAL_BACKOFF_RULES.md @@ -0,0 +1,118 @@ +# Rehearsal Backlog Handoff Rules + +Last Updated: 2026-03-29 +Issue: `#150` OPS-19 incident rehearsal and recovery evidence program + +## Purpose + +Rehearsals surface real gaps. This document defines how findings from rehearsals become tracked work items with clear ownership and response expectations. + +## Filing Issues from Rehearsal Findings + +Every finding recorded in the evidence package that requires follow-up must be filed as a GitHub issue within 2 working days of the rehearsal. + +### Issue Title Convention + +``` +[rehearsal-finding] +``` + +Example: `[rehearsal-finding] Health endpoint does not report queue worker name on stale heartbeat` + +### Issue Body Requirements + +Each rehearsal-finding issue must include: + +1. **Source rehearsal link**: relative path to the evidence file (e.g., `docs/ops/rehearsals/2026-03-29_degraded-api-health.md`) +2. **Finding description**: what was observed and why it matters +3. **Reproduction steps**: commands or conditions that trigger the finding +4. **Suggested fix or investigation path**: concrete next step, not just "look into this" +5. **Severity label**: one of P1/P2/P3/P4 (see below) + +### Template + +```markdown +## Source + +Rehearsal: `docs/ops/rehearsals/YYYY-MM-DD_scenario-name.md` +Finding #N from evidence package. + +## Description + +[What was observed and why it matters] + +## Reproduction + +[Commands or conditions] + +## Suggested Fix + +[Concrete next step] +``` + +## Label Conventions + +Apply the following labels to every rehearsal-finding issue: + +| Label | When to apply | +| --- | --- | +| `rehearsal-finding` | Always (primary identifier) | +| `hardening` | When the finding relates to reliability or operability | +| `bug` | When the finding is a defect in existing behavior | +| `docs` | When the finding is a documentation gap | +| `testing` | When the finding reveals missing test coverage | + +Severity labels: + +| Label | Meaning | +| --- | --- | +| `P1` | Blocks production readiness or causes data loss risk | +| `P2` | Degrades reliability or operator experience significantly | +| `P3` | Minor gap, workaround exists | +| `P4` | Cosmetic or nice-to-have improvement | + +If `rehearsal-finding` does not yet exist as a GitHub label, create it with color `#D4C5F9` and description `Finding surfaced during incident rehearsal`. + +## SLA Expectations + +| Severity | Triage SLA | Resolution target | +| --- | --- | --- | +| P1 | Same day | Next release / hotfix | +| P2 | 2 working days | Within current sprint | +| P3 | 5 working days | Within current quarter | +| P4 | Best effort | Backlog; pick up when convenient | + +"Triage" means the issue has been reviewed, assigned, and prioritized -- not necessarily started. + +## Connecting Findings to Evidence + +Every rehearsal-finding issue must link back to its source evidence file. Use the following format in the issue body: + +``` +Source rehearsal: docs/ops/rehearsals/YYYY-MM-DD_scenario-name.md +``` + +Conversely, the evidence file's "Follow-Up Issues" section must link forward to all filed issues: + +```markdown +## Follow-Up Issues + +- #NNN: [title] +``` + +This bidirectional linking ensures no finding is orphaned. + +## Escalation + +If a P1 finding is discovered during a rehearsal: + +1. File the issue immediately (do not wait for the 2-day window). +2. Tag the issue with `P1` and `rehearsal-finding`. +3. Notify the team channel with a link to the issue. +4. The rehearsal lead owns triage until the issue is assigned. + +## Related Documents + +- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- rehearsal schedule and rotation +- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format +- `docs/ops/GITHUB_LABEL_TAXONOMY.md` -- canonical label definitions diff --git a/docs/ops/rehearsal-scenarios/degraded-api-health.md b/docs/ops/rehearsal-scenarios/degraded-api-health.md new file mode 100644 index 000000000..bbd5f625f --- /dev/null +++ b/docs/ops/rehearsal-scenarios/degraded-api-health.md @@ -0,0 +1,151 @@ +# Scenario: Degraded API Health + +Last Updated: 2026-03-29 +Issue: `#150` OPS-19 incident rehearsal and recovery evidence program + +## Overview + +Simulate a condition where the `/health/ready` endpoint returns `503 NotReady` due to a degraded subsystem (database unreachable, queue backlog exceeded, or worker heartbeat stale). Diagnose which check failed and recover the system to a healthy state. + +## Pre-Conditions + +- Repository checked out at a known commit on `main`. +- Backend builds successfully: `dotnet build backend/Taskdeck.sln -c Release` +- No other Taskdeck API instance running on port 5000. +- SQLite database file is accessible (default: `taskdeck.db` in the API project directory). +- `curl` or equivalent HTTP client available. + +## Injection Method + +Choose one of the following fault injection approaches: + +### Option A: Database Connectivity Fault + +Rename or lock the SQLite database file before starting the API so the database connectivity check fails. + +```bash +# From repo root +cd backend/src/Taskdeck.Api +# Rename the DB file to simulate missing database +mv taskdeck.db taskdeck.db.bak 2>/dev/null || true +# Start the API +dotnet run --project Taskdeck.Api.csproj +``` + +Note: EF Core with SQLite will auto-create a new empty database. To truly break connectivity, set the connection string to a read-only or non-existent directory: + +```bash +ConnectionStrings__DefaultConnection="Data Source=/nonexistent/path/taskdeck.db" \ + dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj +``` + +### Option B: Worker Heartbeat Staleness + +Start the API with queue processing enabled, then observe the heartbeat staleness window. The `queueToProposal` worker is considered stale if its last heartbeat exceeds `QueuePollIntervalSeconds * 3` (minimum 30 seconds). The `proposalHousekeeping` worker goes stale after 3 minutes. + +To inject staleness without code changes: start the API, wait for workers to begin heartbeating, then suspend the worker thread (not practical without code changes). Instead, inspect the staleness values reported by `/health/ready` and understand the thresholds. + +For a realistic rehearsal: modify `appsettings.Development.json` temporarily to set `Workers:QueuePollIntervalSeconds` to 1, then observe how quickly the worker goes stale if delayed. + +### Option C: Queue Backlog Overload + +Flood the LLM queue with pending items to exceed the threshold (`MaxBatchSize * 20`, minimum 100): + +```bash +# Start the API +dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj & + +# Register and authenticate +curl -s -X POST http://localhost:5000/api/auth/register \ + -H "Content-Type: application/json" \ + -d '{"username":"rehearsal-user","password":"Rehearsal123!"}' + +TOKEN=$(curl -s -X POST http://localhost:5000/api/auth/login \ + -H "Content-Type: application/json" \ + -d '{"username":"rehearsal-user","password":"Rehearsal123!"}' | jq -r '.token') + +# Create a board to target +BOARD_ID=$(curl -s -X POST http://localhost:5000/api/boards \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $TOKEN" \ + -d '{"name":"rehearsal-board"}' | jq -r '.id') + +# Submit many LLM queue items (adjust count to exceed threshold) +for i in $(seq 1 120); do + curl -s -X POST http://localhost:5000/api/llm-queue \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $TOKEN" \ + -d "{\"boardId\":\"$BOARD_ID\",\"requestType\":\"Suggest\",\"payload\":\"item $i\"}" > /dev/null +done +``` + +## Expected Diagnosis Path + +1. **Check health endpoints**: + ```bash + curl -s http://localhost:5000/health/live | jq . + curl -s http://localhost:5000/health/ready | jq . + ``` + +2. **Interpret the response**: The `/health/ready` response includes a `checks` object with `database`, `queue`, and `workers` sub-checks. Each has a `status` field (`Healthy`, `Degraded`, `Unhealthy`, `Stale`, `Starting`, `Disabled`). + +3. **Identify the failing check**: Look for non-`Healthy` status values. Examples: + - `checks.database.status: "Unhealthy"` with an `error` field + - `checks.queue.status: "Degraded"` with `depth` exceeding `threshold` + - `checks.workers.queueToProposal.status: "Stale"` with `stalenessSeconds` exceeding `maxStalenessSeconds` + +4. **Correlate with logs**: Check the API console output for error messages related to the failing subsystem. + +5. **Verify with telemetry** (if OpenTelemetry is enabled): Check for `taskdeck.automation.queue.backlog` and `taskdeck.worker.heartbeat.staleness` metrics. + +## Recovery Steps + +### Database Fault Recovery + +```bash +# Restore the original database +mv taskdeck.db.bak taskdeck.db +# Or fix the connection string and restart +# Verify recovery +curl -s http://localhost:5000/health/ready | jq .checks.database +``` + +### Queue Backlog Recovery + +The queue will drain naturally as the worker processes items. To accelerate: + +```bash +# Check current queue depth +curl -s http://localhost:5000/health/ready | jq .checks.queue + +# Wait for worker to process items, or restart with higher batch size +# Workers__MaxBatchSize=20 dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj +``` + +### Worker Staleness Recovery + +```bash +# If the worker is stuck, restart the API process +# Verify heartbeats resume +curl -s http://localhost:5000/health/ready | jq .checks.workers +``` + +## Evidence Checklist + +After completing the rehearsal, verify the evidence package includes: + +- [ ] Screenshot or captured output of the degraded `/health/ready` response (503) +- [ ] Identification of which specific check failed (database, queue, or workers) +- [ ] The exact `status`, `error`, or threshold values from the response +- [ ] Commands used to inject the fault +- [ ] Commands used to diagnose the fault +- [ ] Commands used to recover +- [ ] Captured output of the recovered `/health/ready` response (200) +- [ ] Any log excerpts showing error or warning messages during the degraded state +- [ ] Findings about gaps in the health response (e.g., missing context, unclear error messages) + +## Related Documents + +- `backend/src/Taskdeck.Api/Controllers/HealthController.cs` -- health endpoint implementation +- `docs/ops/OBSERVABILITY_BASELINE.md` -- telemetry contract +- `docs/ops/FAILURE_INJECTION_DRILLS.md` -- automated drill scripts diff --git a/docs/ops/rehearsal-scenarios/deployment-readiness-failure.md b/docs/ops/rehearsal-scenarios/deployment-readiness-failure.md new file mode 100644 index 000000000..717656c6c --- /dev/null +++ b/docs/ops/rehearsal-scenarios/deployment-readiness-failure.md @@ -0,0 +1,182 @@ +# Scenario: Deployment Readiness Failure + +Last Updated: 2026-03-29 +Issue: `#150` OPS-19 incident rehearsal and recovery evidence program + +## Overview + +Simulate a Docker Compose deployment where the API container starts but fails readiness checks. Triage whether the failure is in the container build, runtime configuration, networking, or dependent services. + +## Pre-Conditions + +- Repository checked out at a known commit on `main`. +- Docker Engine with `docker compose` support installed and running. +- `deploy/.env` configured per `deploy/.env.example` (at minimum, `TASKDECK_JWT_SECRET` must be set). +- No other services occupying the default proxy port (8080). +- `curl` or equivalent HTTP client available. + +## Injection Method + +### Option A: Missing Required Environment Variable + +Remove or empty the required `TASKDECK_JWT_SECRET` from `deploy/.env`: + +```bash +# Back up the env file +cp deploy/.env deploy/.env.bak + +# Remove the JWT secret (compose will fail at render time) +sed -i 's/^TASKDECK_JWT_SECRET=.*/TASKDECK_JWT_SECRET=/' deploy/.env + +# Attempt to start +docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d --build +``` + +### Option B: Invalid Database Path + +Override the database connection string to point to a non-writable path inside the container. The compose file sets `ConnectionStrings__DefaultConnection` directly, so override it at the shell level: + +```bash +# Start with an invalid DB path (overrides the compose-file value) +ConnectionStrings__DefaultConnection="Data Source=/readonly/taskdeck.db" \ +docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d --build + +# Or add the override in a docker-compose.override.yml environment block +``` + +### Option C: Port Conflict on Proxy + +Occupy the proxy port before starting compose: + +```bash +# Block port 8080 +python -m http.server 8080 & +BLOCKER_PID=$! + +# Attempt to start compose +docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d --build + +# Clean up later +kill $BLOCKER_PID +``` + +### Option D: Corrupted Image Build + +Introduce a build failure by temporarily modifying the Dockerfile: + +```bash +# Back up +cp deploy/docker/backend.Dockerfile deploy/docker/backend.Dockerfile.bak + +# Inject a failing step (add a bad RUN command) +echo "RUN exit 1" >> deploy/docker/backend.Dockerfile + +# Attempt to build +docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d --build +``` + +## Expected Diagnosis Path + +1. **Check container status**: + ```bash + docker compose -f deploy/docker-compose.yml --profile baseline ps + ``` + Look for containers in `Exited`, `Restarting`, or `Created` (not `Running`) state. + +2. **Check container logs**: + ```bash + docker compose -f deploy/docker-compose.yml --profile baseline logs api + docker compose -f deploy/docker-compose.yml --profile baseline logs web + docker compose -f deploy/docker-compose.yml --profile baseline logs proxy + ``` + +3. **Test readiness through the proxy**: + ```bash + curl -s http://localhost:8080/health/live | jq . + curl -s http://localhost:8080/health/ready | jq . + ``` + If the proxy is up but the API is not, expect a `502 Bad Gateway` from nginx. + +4. **Test the API container directly** (bypassing proxy): + ```bash + # Get the API container's internal port mapping + docker compose -f deploy/docker-compose.yml --profile baseline port api 8080 + # Or exec into the proxy container + docker compose -f deploy/docker-compose.yml --profile baseline exec proxy curl -s http://api:8080/health/ready + ``` + +5. **Check build output** (if the build failed): + ```bash + docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline build 2>&1 | tail -30 + ``` + +6. **Verify environment variable injection**: + ```bash + docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline config | grep -A5 JWT + ``` + +## Recovery Steps + +### Missing Environment Variable + +1. Restore the env file: + ```bash + mv deploy/.env.bak deploy/.env + ``` +2. Restart: + ```bash + docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d + ``` + +### Invalid Database Path + +1. Fix the connection string in the environment configuration. +2. If the container already created a bad state, remove the volume and restart: + ```bash + docker compose -f deploy/docker-compose.yml --profile baseline down -v + docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d + ``` + +### Port Conflict + +1. Identify and stop the conflicting process. +2. Or change the proxy port: + ```bash + TASKDECK_PROXY_PORT=8081 docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d + ``` + +### Corrupted Dockerfile + +1. Restore the original Dockerfile: + ```bash + mv deploy/docker/backend.Dockerfile.bak deploy/docker/backend.Dockerfile + ``` +2. Rebuild: + ```bash + docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d --build + ``` + +### Full Cleanup + +```bash +docker compose -f deploy/docker-compose.yml --profile baseline down -v +``` + +## Evidence Checklist + +- [ ] `docker compose ps` output showing container states +- [ ] Container logs for the failing service(s) +- [ ] Health endpoint responses (or connection errors if unreachable) +- [ ] `docker compose config` output showing effective environment (secrets redacted) +- [ ] Build output if the failure was at build time +- [ ] Commands used to diagnose the root cause +- [ ] Commands used to recover +- [ ] Verification of healthy deployment after recovery (`/health/live` and `/health/ready` both 200) +- [ ] Any findings about error clarity in container logs or compose output + +## Related Documents + +- `docs/ops/DEPLOYMENT_CONTAINERS.md` -- container deployment baseline +- `docs/ops/DEPLOYMENT_HARDENING_MATRIX.md` -- hardening verification matrix +- `deploy/docker-compose.yml` -- compose configuration +- `deploy/.env.example` -- required environment variables diff --git a/docs/ops/rehearsal-scenarios/mcp-server-startup-regression.md b/docs/ops/rehearsal-scenarios/mcp-server-startup-regression.md new file mode 100644 index 000000000..cbfa61c5b --- /dev/null +++ b/docs/ops/rehearsal-scenarios/mcp-server-startup-regression.md @@ -0,0 +1,139 @@ +# Scenario: MCP Server Startup Regression + +Last Updated: 2026-03-29 +Issue: `#150` OPS-19 incident rehearsal and recovery evidence program + +## Overview + +Simulate a failure in an optional MCP (Model Context Protocol) server used by CLI/IDE agents during development. The MCP server fails at boot due to a configuration or dependency error. Investigate the failure, determine whether it blocks core API functionality, and recover. + +## Pre-Conditions + +- Repository checked out at a known commit on `main`. +- MCP server configuration present (`.mcp.json` or equivalent in repo root or `.claude/` directory). +- Node.js / npx available (MCP servers are typically Node-based). +- Backend API is not dependent on MCP servers for core operation (MCP is a development tooling layer). + +## Injection Method + +### Option A: Invalid MCP Server Command + +Modify the MCP server configuration to reference a non-existent command or package. + +```bash +# Check current MCP config +cat .mcp.json 2>/dev/null || cat .claude/mcp.json 2>/dev/null || echo "No MCP config found" + +# If .mcp.json exists, back it up and inject a fault +cp .mcp.json .mcp.json.bak +# Edit .mcp.json to change a server command to a non-existent binary +# e.g., change "npx" to "npx-nonexistent" for one server entry +``` + +### Option B: Missing API Key for MCP Server + +Some MCP servers require API keys or tokens. Remove or invalidate the required environment variable. + +```bash +# Start with an invalid key for a server that requires authentication +GITHUB_PERSONAL_ACCESS_TOKEN="" npx @modelcontextprotocol/server-github +``` + +### Option C: Port Conflict + +If an MCP server binds to a specific port, start a conflicting listener first. + +```bash +# Occupy the port before starting the MCP server +python -m http.server 3000 & +# Then attempt to start the MCP server that needs port 3000 +``` + +## Expected Diagnosis Path + +1. **Attempt to start the MCP server and capture the error**: + ```bash + # Try to start the server using the configured command + # Capture stderr for error messages + npx @modelcontextprotocol/server-github 2>&1 | head -20 + ``` + +2. **Check if the core API is affected**: + ```bash + # MCP servers are optional tooling -- verify the API is unaffected + curl -s http://localhost:5000/health/live | jq . + curl -s http://localhost:5000/health/ready | jq . + ``` + Both endpoints should return healthy. MCP server failure must not degrade API health. + +3. **Inspect the MCP configuration**: + ```bash + cat .mcp.json | jq . + # Or check the Claude config directory + ls -la .claude/ + ``` + +4. **Check for dependency issues**: + ```bash + # Verify the MCP server package is resolvable + npx --yes @modelcontextprotocol/server-github --version 2>&1 + ``` + +5. **Review MCP tooling guide for known issues**: + Reference `docs/MCP_TOOLING_GUIDE.md` for the current MCP server status and fallback rules. + +## Recovery Steps + +### Invalid Command + +1. Restore the original MCP configuration: + ```bash + mv .mcp.json.bak .mcp.json + ``` +2. Verify the server starts: + ```bash + # Test the corrected command + npx @modelcontextprotocol/server-github --help 2>&1 | head -5 + ``` + +### Missing API Key + +1. Set the required environment variable: + ```bash + export GITHUB_PERSONAL_ACCESS_TOKEN="ghp_..." + ``` +2. Restart the MCP server. + +### Port Conflict + +1. Identify the conflicting process: + ```bash + # Linux/Mac + lsof -i :3000 + # Windows + netstat -ano | findstr :3000 + ``` +2. Stop the conflicting process or reconfigure the MCP server port. + +### Fallback: Work Without MCP + +Per the MCP Tooling Guide, when MCP is unavailable: +- Use shell/CLI as fallback for the same operations. +- Note the MCP unavailability in the work summary. +- Core development and API operations are unaffected. + +## Evidence Checklist + +- [ ] Captured error output from the failed MCP server startup +- [ ] Verification that `/health/live` and `/health/ready` are unaffected by MCP failure +- [ ] MCP configuration file contents (redact any tokens or secrets) +- [ ] Diagnosis steps taken to identify the root cause +- [ ] Recovery commands and verification of restored MCP server operation +- [ ] Confirmation that the failure was isolated to development tooling (no production impact) +- [ ] Any findings about MCP error messages (clear vs. cryptic, actionable vs. not) + +## Related Documents + +- `docs/MCP_TOOLING_GUIDE.md` -- MCP tool selection rules and fallback policy +- `docs/tooling/MCP_OPERATIONS_RUNBOOK.md` -- credential setup and verification +- `.mcp.json` -- MCP server configuration (if present) diff --git a/docs/ops/rehearsal-scenarios/missing-telemetry-signal.md b/docs/ops/rehearsal-scenarios/missing-telemetry-signal.md new file mode 100644 index 000000000..f15dae716 --- /dev/null +++ b/docs/ops/rehearsal-scenarios/missing-telemetry-signal.md @@ -0,0 +1,115 @@ +# Scenario: Missing Telemetry Signal + +Last Updated: 2026-03-29 +Issue: `#150` OPS-19 incident rehearsal and recovery evidence program + +## Overview + +Simulate a condition where the request correlation ID (`X-Request-Id`) is missing from OpenTelemetry trace spans. Diagnose whether the issue is in middleware configuration, telemetry export pipeline, or span attribute propagation. + +## Pre-Conditions + +- Repository checked out at a known commit on `main`. +- Backend builds successfully: `dotnet build backend/Taskdeck.sln -c Release` +- OpenTelemetry console exporter enabled for local inspection (no external collector required). +- `curl` or equivalent HTTP client available. + +## Injection Method + +### Option A: Disable Correlation Middleware + +Temporarily comment out or misconfigure the request correlation middleware registration in `Program.cs` to simulate a deployment where correlation IDs stop being propagated to trace spans. + +For a non-code-change rehearsal: start the API with OpenTelemetry enabled and verify correlation attributes are present, then start a second instance with `Observability:EnableOpenTelemetry=false` and observe the absence. + +```bash +# Start with telemetry enabled and console exporter +Observability__EnableOpenTelemetry=true \ +Observability__EnableConsoleExporter=true \ +Observability__ServiceName=taskdeck-rehearsal \ + dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj +``` + +### Option B: Missing OTLP Endpoint + +Configure the API to export to a non-existent OTLP collector endpoint. Traces will be generated but silently dropped. + +```bash +Observability__EnableOpenTelemetry=true \ +Observability__OtlpEndpoint=http://localhost:4317 \ +Observability__EnableConsoleExporter=false \ + dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj +``` + +This simulates a production scenario where the collector is down or misconfigured. + +## Expected Diagnosis Path + +1. **Make a request and check for correlation headers**: + ```bash + # Send a request and inspect response headers + curl -v http://localhost:5000/health/live 2>&1 | grep -i x-request-id + ``` + The API should echo back an `X-Request-Id` header. If missing, correlation middleware is not running. + +2. **Check console exporter output** (if enabled): + Look for trace spans in the API console output. Expected attributes: + - `taskdeck.correlation_id` + - `taskdeck.request_id` + + If these attributes are absent from spans but the `X-Request-Id` header is present in the HTTP response, the issue is in span attribute propagation (middleware runs but does not tag spans). + +3. **Verify OpenTelemetry configuration**: + ```bash + # Check appsettings for telemetry config + cat backend/src/Taskdeck.Api/appsettings.json | grep -A5 Observability + cat backend/src/Taskdeck.Api/appsettings.Development.json | grep -A5 Observability + ``` + +4. **Check for OTLP export errors**: + If the OTLP endpoint is unreachable, the OpenTelemetry SDK may log warnings. Look for messages containing `OTLP`, `export`, or `gRPC` errors in the console output. + +5. **Verify worker spans include expected attributes**: + ```bash + # Trigger a worker cycle by submitting a queue item, then check console for worker span attributes + # Look for: taskdeck.worker.name, taskdeck.llm.request_id + ``` + +## Recovery Steps + +### Correlation Middleware Missing + +1. Verify the middleware registration order in `Program.cs`. +2. Confirm the correlation middleware runs before the endpoint routing middleware. +3. Restart the API and verify `X-Request-Id` appears in response headers. + +### OTLP Endpoint Unreachable + +1. Verify the collector is running: + ```bash + curl -s http://localhost:4317 || echo "OTLP endpoint unreachable" + ``` +2. Fix the endpoint URL in configuration or start the collector. +3. Restart the API (or wait for the next export interval). + +### Console Exporter Not Showing Spans + +1. Verify `Observability:EnableConsoleExporter` is `true`. +2. Verify `Observability:EnableOpenTelemetry` is `true`. +3. Check that the `Taskdeck.Api` activity source name matches the configured source in the telemetry setup. + +## Evidence Checklist + +- [ ] Captured output showing the presence or absence of `X-Request-Id` in response headers +- [ ] Console exporter output showing trace spans with or without `taskdeck.correlation_id` +- [ ] Configuration values for `Observability:*` settings used during the rehearsal +- [ ] If OTLP was tested: evidence of export failure (log lines or connection errors) +- [ ] Commands used to diagnose the telemetry pipeline +- [ ] Recovery steps taken and verification of restored telemetry +- [ ] Any findings about error visibility when telemetry export silently fails + +## Related Documents + +- `docs/ops/OBSERVABILITY_BASELINE.md` -- telemetry contract and expected attributes +- `backend/src/Taskdeck.Api/Telemetry/` -- custom telemetry instrumentation +- `backend/src/Taskdeck.Api/appsettings.json` -- observability configuration diff --git a/docs/ops/rehearsals/2026-03-29_degraded-api-health.md b/docs/ops/rehearsals/2026-03-29_degraded-api-health.md new file mode 100644 index 000000000..8929bcba0 --- /dev/null +++ b/docs/ops/rehearsals/2026-03-29_degraded-api-health.md @@ -0,0 +1,122 @@ +# Rehearsal Evidence: Degraded API Health + +## Metadata + +| Field | Value | +| --- | --- | +| Date | 2026-03-29 | +| Rehearsal type | Monthly (initial program setup) | +| Scenario | `docs/ops/rehearsal-scenarios/degraded-api-health.md` | +| Lead | @Chris0Jeky | +| Participants | @Chris0Jeky | +| Commit SHA | `440a8c9dbfe63a9d631437ca97069ea34e51c81e` | +| OS / Environment | Windows 10 Pro 10.0.19045, .NET 8, SQLite | +| Duration | ~15 minutes | +| Outcome | Partial | + +## Timeline + +| Timestamp (UTC) | Actor | Action / Observation | +| --- | --- | --- | +| 2026-03-29T03:27:00Z | @Chris0Jeky | Verified backend builds successfully (`dotnet build -c Release`, 0 warnings, 0 errors) | +| 2026-03-29T03:27:30Z | @Chris0Jeky | Ran `HealthApiTests` -- 3/3 passing (Live, Ready, CaptureBacklogExclusion) | +| 2026-03-29T03:28:08Z | @Chris0Jeky | Started API on port 5099, confirmed healthy baseline | +| 2026-03-29T03:28:11Z | @Chris0Jeky | Captured `/health/live` response: `{"status":"Healthy"}` | +| 2026-03-29T03:28:11Z | @Chris0Jeky | Captured `/health/ready` response: HTTP 200, all checks Healthy | +| 2026-03-29T03:28:12Z | @Chris0Jeky | Stopped API, attempted injection Option A (invalid DB path) | +| 2026-03-29T03:28:29Z | @Chris0Jeky | Started API with `ConnectionStrings__DefaultConnection="Data Source=/nonexistent/path/taskdeck.db"` | +| 2026-03-29T03:28:29Z | @Chris0Jeky | **Finding**: API started healthy -- SQLite auto-created database at mapped path. Invalid Unix-style path was silently resolved on Windows | +| 2026-03-29T03:28:40Z | @Chris0Jeky | Attempted injection via `Workers__EnableAutoQueueProcessing=false` to disable queue worker | +| 2026-03-29T03:28:42Z | @Chris0Jeky | API started, all checks showed Healthy. Queue worker still heartbeating (launchSettings.json may override env vars) | +| 2026-03-29T03:29:17Z | @Chris0Jeky | After 35s wait (past startup grace), all workers still Healthy. `proposalHousekeeping` staleness=8s (max=180s), `queueToProposal` staleness=2.9s (max=30s) | +| 2026-03-29T03:29:20Z | @Chris0Jeky | Concluded rehearsal. Documented findings | + +## Commands Run + +```bash +# Build verification +dotnet build backend/Taskdeck.sln -c Release + +# Run health tests +dotnet test backend/Taskdeck.sln -c Release --filter "FullyQualifiedName~HealthApiTests" + +# Healthy baseline +dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj -- --urls "http://localhost:5099" +curl -s http://localhost:5099/health/live +curl -s http://localhost:5099/health/ready + +# Injection attempt A: invalid DB path +ConnectionStrings__DefaultConnection="Data Source=/nonexistent/path/taskdeck.db" \ + dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj -- --urls "http://localhost:5099" +curl -s http://localhost:5099/health/ready + +# Injection attempt B: disable queue processing +Workers__EnableAutoQueueProcessing=false \ + dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj -- --urls "http://localhost:5099" +curl -s http://localhost:5099/health/ready +# Waited 35s for startup grace to expire, checked again +curl -s http://localhost:5099/health/ready +``` + +## Log Excerpts + +Healthy baseline `/health/ready` response: +```json +{ + "status": "Ready", + "timestamp": "2026-03-29T03:28:11.466Z", + "checks": { + "database": { "status": "Healthy" }, + "queue": { "status": "Healthy", "depth": 0, "totalDepth": 0, "captureDepth": 0, "threshold": 100 }, + "workers": { + "queueToProposal": { "status": "Healthy", "lastHeartbeat": "2026-03-29T03:28:09.177Z", "stalenessSeconds": 2.28, "maxStalenessSeconds": 30 }, + "proposalHousekeeping": { "status": "Healthy", "lastHeartbeat": "2026-03-29T03:28:09.506Z", "stalenessSeconds": 1.96, "maxStalenessSeconds": 180 } + } + } +} +``` + +Invalid DB path response (still healthy): +```json +{ + "status": "Ready", + "timestamp": "2026-03-29T03:28:29.821Z", + "checks": { + "database": { "status": "Healthy" }, + "queue": { "status": "Healthy", "depth": 0, "totalDepth": 0, "captureDepth": 0, "threshold": 100 }, + "workers": { + "queueToProposal": { "status": "Healthy", "lastHeartbeat": "2026-03-29T03:28:29.586Z", "stalenessSeconds": 0.25, "maxStalenessSeconds": 30 }, + "proposalHousekeeping": { "status": "Healthy", "lastHeartbeat": "2026-03-29T03:28:09.506Z", "stalenessSeconds": 20.31, "maxStalenessSeconds": 180 } + } + } +} +``` + +## Root Cause / Diagnosis Summary + +The rehearsal targeted Option A (database connectivity fault) and Option B-adjacent (worker heartbeat staleness). Two key observations: + +1. **SQLite auto-creation resilience**: Setting `ConnectionStrings__DefaultConnection` to a non-existent Unix-style path (`/nonexistent/path/taskdeck.db`) did not degrade the health endpoint. On Windows, the path was silently resolved (likely to a relative path), and SQLite's `CanConnectAsync` succeeded because the provider auto-creates the database file. This means the database check is resilient to missing-DB scenarios but may mask genuine connection string misconfiguration. + +2. **Environment variable override vs launchSettings**: The `Workers__EnableAutoQueueProcessing=false` environment variable did not visibly change worker behavior when `launchSettings.json` was in use (via `dotnet run`). The `--no-launch-profile` flag would be needed to ensure environment variables take precedence. + +## Recovery Actions Taken + +No recovery was needed -- the system never entered a degraded state due to the resilience of the SQLite provider and the launchSettings override behavior. + +## Findings + +- [ ] Finding 1: SQLite `CanConnectAsync` succeeds even with a non-existent file path because the provider auto-creates the database. The health check's database connectivity test does not distinguish between "connected to the intended database" and "connected to an auto-created empty database." -- Severity: P3 -- Issue: to be filed +- [ ] Finding 2: Environment variable overrides for worker settings are ineffective when using `dotnet run` with `launchSettings.json`. The scenario documentation should specify `--no-launch-profile` for reliable fault injection via environment variables. -- Severity: P3 -- Issue: documented in scenario update +- [ ] Finding 3: The degraded-api-health scenario's Option A (invalid DB path) is not reliably reproducible on Windows due to Unix-to-Windows path resolution. The scenario should include Windows-specific injection guidance. -- Severity: P4 -- Issue: documented in scenario update + +## Sign-Off + +| Role | Name | Date | Approved | +| --- | --- | --- | --- | +| Rehearsal lead | @Chris0Jeky | 2026-03-29 | [x] | + +## Follow-Up Issues + +- Scenario documentation updated with findings (same PR) +- P3 finding about SQLite auto-creation masking connection errors should be tracked in a future hardening issue