Chris0Jeky · Chris0Jeky · Mar 29, 2026 · Mar 29, 2026 · Mar 29, 2026 · Mar 29, 2026
diff --git a/docs/MANUAL_TEST_CHECKLIST.md b/docs/MANUAL_TEST_CHECKLIST.md
@@ -463,3 +463,14 @@ Summary scope:
 4. True-missing vs cross-user denial indistinguishability (B-90 to B-96)
 5. Error payload contract verification for auth/validation/sandbox paths (B-100 to B-110)
 6. Advanced controller families: ops/logs/users/abuse/llm-quota/agents/knowledge/webhooks/external-imports (B-130 to B-175)
+
+---
+
+## Incident Rehearsals
+
+For operational failure diagnosis and recovery validation beyond functional testing, see the incident rehearsal program:
+
+- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- schedule and rotation
+- `docs/ops/rehearsal-scenarios/` -- scenario templates
+- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
+- `docs/ops/rehearsals/` -- completed rehearsal evidence
diff --git a/docs/TESTING_GUIDE.md b/docs/TESTING_GUIDE.md
@@ -604,6 +604,19 @@ Recommended execution pairing:
 - automated: API + frontend unit + E2E capture loop (`#210` delivered, retained as active regression path)
 - manual: capture friction/trust checks in `docs/MANUAL_TEST_CHECKLIST.md`
 
+## Incident Rehearsals
+
+Manual incident rehearsals complement automated tests by validating diagnosis and recovery workflows against realistic failure conditions. Rehearsals are scheduled monthly (lightweight, ~30 min) and quarterly (deep drill, ~2 hours).
+
+Key resources:
+- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- schedule, rotation, and process
+- `docs/ops/rehearsal-scenarios/` -- scenario templates (health degradation, telemetry gaps, deployment failures)
+- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
+- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- how rehearsal findings become tracked issues
+- `docs/ops/rehearsals/` -- completed rehearsal evidence packages
+
+Rehearsals are distinct from the automated failure-injection drill suite (`docs/ops/FAILURE_INJECTION_DRILLS.md`). Drills are scripted and CI-runnable; rehearsals are human-driven and focus on diagnosis speed, tooling gaps, and recovery muscle memory.
+
 ## Development Sandbox Mode
 
 For local development only, authorization bypass can be enabled via:

diff --git a/docs/ops/EVIDENCE_TEMPLATE.md b/docs/ops/EVIDENCE_TEMPLATE.md
@@ -0,0 +1,119 @@
+# Rehearsal Evidence Package Template
+
+Last Updated: 2026-03-29
+Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
+
+## Purpose
+
+Every rehearsal produces an evidence package that records what happened, what was found, and what follow-up is needed. This template defines the required format.
+
+Evidence files are stored in `docs/ops/rehearsals/` with the naming convention:
+
+```
+YYYY-MM-DD_scenario-name.md
+```
+
+Example: `2026-03-29_degraded-api-health.md`
+
+---
+
+## Template
+
+Copy the block below into a new file and fill in each section.
+
+```markdown
+# Rehearsal Evidence: [Scenario Name]
+
+## Metadata
+
+| Field | Value |
+| --- | --- |
+| Date | YYYY-MM-DD |
+| Rehearsal type | Monthly / Quarterly deep drill |
+| Scenario | [scenario filename from docs/ops/rehearsal-scenarios/] |
+| Lead | [GitHub username] |
+| Participants | [comma-separated GitHub usernames] |
+| Commit SHA | [HEAD of main at rehearsal start] |
+| OS / Environment | [e.g., Windows 10 Pro, Docker Desktop 4.x] |
+| Duration | [actual elapsed time] |
+| Outcome | Pass / Partial / Fail |
+
+## Timeline
+
+Use ISO 8601 timestamps (UTC). Record each significant action or observation.
+
+| Timestamp (UTC) | Actor | Action / Observation |
+| --- | --- | --- |
+| 2026-03-29T14:00:00Z | @lead | Started API with injected fault |
+| 2026-03-29T14:02:30Z | @lead | Observed 503 on /health/ready |
+| ... | ... | ... |
+
+## Commands Run
+
+Record every command executed during the rehearsal, in order.
+
+```bash
+# Example
+dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj
+curl http://localhost:5000/health/ready
+```
+
+## Log Excerpts
+
+Include relevant log output. Redact any secrets or PII.
+
+```
+[relevant log lines here]
+```
+
+## Root Cause / Diagnosis Summary
+
+Describe what the injected fault was, how it was detected, and what the diagnosis path looked like.
+
+## Recovery Actions Taken
+
+Describe the steps taken to restore the system to a healthy state.
+
+## Findings
+
+List any issues, gaps, or improvements discovered during the rehearsal.
+
+- [ ] Finding 1: [description] -- Severity: [P1/P2/P3/P4] -- Issue: [#NNN or "to be filed"]
+- [ ] Finding 2: [description] -- Severity: [P1/P2/P3/P4] -- Issue: [#NNN or "to be filed"]
+
+## Sign-Off
+
+| Role | Name | Date | Approved |
+| --- | --- | --- | --- |
+| Rehearsal lead | @username | YYYY-MM-DD | [ ] |
+| Observer | @username | YYYY-MM-DD | [ ] |
+
+## Follow-Up Issues
+
+Link to any issues filed as a result of this rehearsal:
+
+- #NNN: [title]
+- #NNN: [title]
+```
+
+---
+
+## Required Artifacts Checklist
+
+Every evidence package must include:
+
+- [ ] Completed metadata table with all fields filled
+- [ ] Timeline with at least 3 entries (start, key observation, resolution)
+- [ ] Commands run section with actual commands (not placeholders)
+- [ ] At least one log excerpt or explanation of why logs were unavailable
+- [ ] Root cause / diagnosis summary
+- [ ] Recovery actions taken
+- [ ] Findings list (even if empty -- state "No new findings")
+- [ ] Sign-off from at least the rehearsal lead
+- [ ] Follow-up issues linked (or "None" if no issues were filed)
+
+## Related Documents
+
+- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- rehearsal schedule and rotation
+- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- how to file findings as issues
+- `docs/ops/rehearsal-scenarios/` -- scenario templates
diff --git a/docs/ops/INCIDENT_REHEARSAL_CADENCE.md b/docs/ops/INCIDENT_REHEARSAL_CADENCE.md
@@ -0,0 +1,90 @@
+# Incident Rehearsal Cadence
+
+Last Updated: 2026-03-29
+Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
+
+## Purpose
+
+Rehearsals validate that the team can diagnose and recover from production-realistic failures using real tooling and documented procedures. They also surface gaps in observability, runbooks, and recovery automation before real incidents expose them.
+
+## Monthly Lightweight Rehearsal
+
+| Field | Detail |
+| --- | --- |
+| Cadence | First working Thursday of each month |
+| Duration | ~30 minutes |
+| Scope | Single scenario from `docs/ops/rehearsal-scenarios/` |
+| Lead | Rotating (see assignment model below) |
+| Participants | Rehearsal lead + one observer minimum |
+| Artifacts | Evidence package filed in `docs/ops/rehearsals/` |
+
+Steps:
+1. Lead selects a scenario from the scenario library (prefer unexercised or recently-failed scenarios).
+2. Announce the rehearsal in the team channel at least 24 hours in advance.
+3. Execute the scenario using the template's injection method and diagnosis path.
+4. Record an evidence package using `docs/ops/EVIDENCE_TEMPLATE.md`.
+5. File any discovered issues per `docs/ops/REHEARSAL_BACKOFF_RULES.md`.
+
+## Quarterly Deep Drill
+
+| Field | Detail |
+| --- | --- |
+| Cadence | Second week of Q1/Q2/Q3/Q4 (January, April, July, October) |
+| Duration | ~2 hours |
+| Scope | Combined or cascading scenario (e.g., degraded health + deployment failure) |
+| Lead | Rotating (same rotation, offset from monthly) |
+| Participants | All active contributors |
+| Artifacts | Evidence package + retrospective summary |
+
+Steps:
+1. Lead designs a combined scenario at least one week before the drill date.
+2. Distribute the scenario brief (pre-conditions, scope, goals) to all participants 48 hours in advance.
+3. Execute the drill with explicit role assignments: incident commander, investigator, communicator.
+4. Record the evidence package and a retrospective summary covering what went well, what was slow, and what tooling or documentation was missing.
+5. File findings and retrospective actions per `docs/ops/REHEARSAL_BACKOFF_RULES.md`.
+
+## Rotation and Assignment Model
+
+Rehearsal lead rotates alphabetically by GitHub username among active contributors.
+
+| Month | Lead selection |
+| --- | --- |
+| Month N | First contributor alphabetically who has not led in the current quarter |
+| Fallback | If the assigned lead is unavailable, the next person in rotation picks up |
+
+The rotation resets each quarter. Deep drills use the same rotation but are offset (the deep-drill lead should not be the same person who led the preceding monthly rehearsal).
+
+To check the current rotation state, see the most recent evidence file in `docs/ops/rehearsals/` -- the lead is recorded in the metadata section.
+
+## Calendar Integration
+
+Add rehearsal dates to the team calendar:
+
+- **Monthly**: recurring event on the first Thursday of each month, 30 minutes, titled `[Taskdeck] Monthly Incident Rehearsal`
+- **Quarterly**: recurring event in the second week of Jan/Apr/Jul/Oct, 2 hours, titled `[Taskdeck] Quarterly Deep Drill`
+
+Include the following in the calendar event description:
+
+```
+Scenario library: docs/ops/rehearsal-scenarios/
+Evidence template: docs/ops/EVIDENCE_TEMPLATE.md
+Backlog rules: docs/ops/REHEARSAL_BACKOFF_RULES.md
+```
+
+## Scenario Library
+
+Available scenarios in `docs/ops/rehearsal-scenarios/`:
+
+- `degraded-api-health.md` -- API health endpoint returns degraded/unhealthy status
+- `missing-telemetry-signal.md` -- Correlation ID missing from OpenTelemetry traces
+- `mcp-server-startup-regression.md` -- Optional MCP server fails at boot
+- `deployment-readiness-failure.md` -- Docker Compose startup fails readiness checks
+
+New scenarios should follow the same template structure (pre-conditions, injection, diagnosis, recovery, evidence checklist). File them in the `rehearsal-scenarios/` directory with a descriptive kebab-case filename.
+
+## Related Documents
+
+- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
+- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- issue filing and SLA rules for findings
+- `docs/ops/FAILURE_INJECTION_DRILLS.md` -- automated drill scripts (complementary to manual rehearsals)
+- `docs/ops/OBSERVABILITY_BASELINE.md` -- telemetry and dashboard contract
diff --git a/docs/ops/REHEARSAL_BACKOFF_RULES.md b/docs/ops/REHEARSAL_BACKOFF_RULES.md
@@ -0,0 +1,118 @@
+# Rehearsal Backlog Handoff Rules
+
+Last Updated: 2026-03-29
+Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
+
+## Purpose
+
+Rehearsals surface real gaps. This document defines how findings from rehearsals become tracked work items with clear ownership and response expectations.
+
+## Filing Issues from Rehearsal Findings
+
+Every finding recorded in the evidence package that requires follow-up must be filed as a GitHub issue within 2 working days of the rehearsal.
+
+### Issue Title Convention
+
+```
+[rehearsal-finding] <short description>
+```
+
+Example: `[rehearsal-finding] Health endpoint does not report queue worker name on stale heartbeat`
+
+### Issue Body Requirements
+
+Each rehearsal-finding issue must include:
+
+1. **Source rehearsal link**: relative path to the evidence file (e.g., `docs/ops/rehearsals/2026-03-29_degraded-api-health.md`)
+2. **Finding description**: what was observed and why it matters
+3. **Reproduction steps**: commands or conditions that trigger the finding
+4. **Suggested fix or investigation path**: concrete next step, not just "look into this"
+5. **Severity label**: one of P1/P2/P3/P4 (see below)
+
+### Template
+
+```markdown
+## Source
+
+Rehearsal: `docs/ops/rehearsals/YYYY-MM-DD_scenario-name.md`
+Finding #N from evidence package.
+
+## Description
+
+[What was observed and why it matters]
+
+## Reproduction
+
+[Commands or conditions]
+
+## Suggested Fix
+
+[Concrete next step]
+```
+
+## Label Conventions
+
+Apply the following labels to every rehearsal-finding issue:
+
+| Label | When to apply |
+| --- | --- |
+| `rehearsal-finding` | Always (primary identifier) |
+| `hardening` | When the finding relates to reliability or operability |
+| `bug` | When the finding is a defect in existing behavior |
+| `docs` | When the finding is a documentation gap |
+| `testing` | When the finding reveals missing test coverage |
+
+Severity labels:
+
+| Label | Meaning |
+| --- | --- |
+| `P1` | Blocks production readiness or causes data loss risk |
+| `P2` | Degrades reliability or operator experience significantly |
+| `P3` | Minor gap, workaround exists |
+| `P4` | Cosmetic or nice-to-have improvement |
+
+If `rehearsal-finding` does not yet exist as a GitHub label, create it with color `#D4C5F9` and description `Finding surfaced during incident rehearsal`.
+
+## SLA Expectations
+
+| Severity | Triage SLA | Resolution target |
+| --- | --- | --- |
+| P1 | Same day | Next release / hotfix |
+| P2 | 2 working days | Within current sprint |
+| P3 | 5 working days | Within current quarter |
+| P4 | Best effort | Backlog; pick up when convenient |
+
+"Triage" means the issue has been reviewed, assigned, and prioritized -- not necessarily started.
+
+## Connecting Findings to Evidence
+
+Every rehearsal-finding issue must link back to its source evidence file. Use the following format in the issue body:
+
+```
+Source rehearsal: docs/ops/rehearsals/YYYY-MM-DD_scenario-name.md
+```
+
+Conversely, the evidence file's "Follow-Up Issues" section must link forward to all filed issues:
+
+```markdown
+## Follow-Up Issues
+
+- #NNN: [title]
+```
+
+This bidirectional linking ensures no finding is orphaned.
+
+## Escalation
+
+If a P1 finding is discovered during a rehearsal:
+
+1. File the issue immediately (do not wait for the 2-day window).
+2. Tag the issue with `P1` and `rehearsal-finding`.
+3. Notify the team channel with a link to the issue.
+4. The rehearsal lead owns triage until the issue is assigned.
+
+## Related Documents
+
+- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- rehearsal schedule and rotation
+- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
+- `docs/ops/GITHUB_LABEL_TAXONOMY.md` -- canonical label definitions