Skip to content

Commit 5cf584a

Browse files
authored
Merge pull request #503 from Chris0Jeky/docs/incident-rehearsal-recovery-program
OPS-19: Incident rehearsal and recovery evidence program
2 parents 0f38f82 + 147c66b commit 5cf584a

10 files changed

+1060
-0
lines changed

docs/MANUAL_TEST_CHECKLIST.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -463,3 +463,14 @@ Summary scope:
463463
4. True-missing vs cross-user denial indistinguishability (B-90 to B-96)
464464
5. Error payload contract verification for auth/validation/sandbox paths (B-100 to B-110)
465465
6. Advanced controller families: ops/logs/users/abuse/llm-quota/agents/knowledge/webhooks/external-imports (B-130 to B-175)
466+
467+
---
468+
469+
## Incident Rehearsals
470+
471+
For operational failure diagnosis and recovery validation beyond functional testing, see the incident rehearsal program:
472+
473+
- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- schedule and rotation
474+
- `docs/ops/rehearsal-scenarios/` -- scenario templates
475+
- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
476+
- `docs/ops/rehearsals/` -- completed rehearsal evidence

docs/TESTING_GUIDE.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -604,6 +604,19 @@ Recommended execution pairing:
604604
- automated: API + frontend unit + E2E capture loop (`#210` delivered, retained as active regression path)
605605
- manual: capture friction/trust checks in `docs/MANUAL_TEST_CHECKLIST.md`
606606

607+
## Incident Rehearsals
608+
609+
Manual incident rehearsals complement automated tests by validating diagnosis and recovery workflows against realistic failure conditions. Rehearsals are scheduled monthly (lightweight, ~30 min) and quarterly (deep drill, ~2 hours).
610+
611+
Key resources:
612+
- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- schedule, rotation, and process
613+
- `docs/ops/rehearsal-scenarios/` -- scenario templates (health degradation, telemetry gaps, deployment failures)
614+
- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
615+
- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- how rehearsal findings become tracked issues
616+
- `docs/ops/rehearsals/` -- completed rehearsal evidence packages
617+
618+
Rehearsals are distinct from the automated failure-injection drill suite (`docs/ops/FAILURE_INJECTION_DRILLS.md`). Drills are scripted and CI-runnable; rehearsals are human-driven and focus on diagnosis speed, tooling gaps, and recovery muscle memory.
619+
607620
## Development Sandbox Mode
608621

609622
For local development only, authorization bypass can be enabled via:

docs/ops/EVIDENCE_TEMPLATE.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Rehearsal Evidence Package Template
2+
3+
Last Updated: 2026-03-29
4+
Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
5+
6+
## Purpose
7+
8+
Every rehearsal produces an evidence package that records what happened, what was found, and what follow-up is needed. This template defines the required format.
9+
10+
Evidence files are stored in `docs/ops/rehearsals/` with the naming convention:
11+
12+
```
13+
YYYY-MM-DD_scenario-name.md
14+
```
15+
16+
Example: `2026-03-29_degraded-api-health.md`
17+
18+
---
19+
20+
## Template
21+
22+
Copy the block below into a new file and fill in each section.
23+
24+
```markdown
25+
# Rehearsal Evidence: [Scenario Name]
26+
27+
## Metadata
28+
29+
| Field | Value |
30+
| --- | --- |
31+
| Date | YYYY-MM-DD |
32+
| Rehearsal type | Monthly / Quarterly deep drill |
33+
| Scenario | [scenario filename from docs/ops/rehearsal-scenarios/] |
34+
| Lead | [GitHub username] |
35+
| Participants | [comma-separated GitHub usernames] |
36+
| Commit SHA | [HEAD of main at rehearsal start] |
37+
| OS / Environment | [e.g., Windows 10 Pro, Docker Desktop 4.x] |
38+
| Duration | [actual elapsed time] |
39+
| Outcome | Pass / Partial / Fail |
40+
41+
## Timeline
42+
43+
Use ISO 8601 timestamps (UTC). Record each significant action or observation.
44+
45+
| Timestamp (UTC) | Actor | Action / Observation |
46+
| --- | --- | --- |
47+
| 2026-03-29T14:00:00Z | @lead | Started API with injected fault |
48+
| 2026-03-29T14:02:30Z | @lead | Observed 503 on /health/ready |
49+
| ... | ... | ... |
50+
51+
## Commands Run
52+
53+
Record every command executed during the rehearsal, in order.
54+
55+
```bash
56+
# Example
57+
dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj
58+
curl http://localhost:5000/health/ready
59+
```
60+
61+
## Log Excerpts
62+
63+
Include relevant log output. Redact any secrets or PII.
64+
65+
```
66+
[relevant log lines here]
67+
```
68+
69+
## Root Cause / Diagnosis Summary
70+
71+
Describe what the injected fault was, how it was detected, and what the diagnosis path looked like.
72+
73+
## Recovery Actions Taken
74+
75+
Describe the steps taken to restore the system to a healthy state.
76+
77+
## Findings
78+
79+
List any issues, gaps, or improvements discovered during the rehearsal.
80+
81+
- [ ] Finding 1: [description] -- Severity: [P1/P2/P3/P4] -- Issue: [#NNN or "to be filed"]
82+
- [ ] Finding 2: [description] -- Severity: [P1/P2/P3/P4] -- Issue: [#NNN or "to be filed"]
83+
84+
## Sign-Off
85+
86+
| Role | Name | Date | Approved |
87+
| --- | --- | --- | --- |
88+
| Rehearsal lead | @username | YYYY-MM-DD | [ ] |
89+
| Observer | @username | YYYY-MM-DD | [ ] |
90+
91+
## Follow-Up Issues
92+
93+
Link to any issues filed as a result of this rehearsal:
94+
95+
- #NNN: [title]
96+
- #NNN: [title]
97+
```
98+
99+
---
100+
101+
## Required Artifacts Checklist
102+
103+
Every evidence package must include:
104+
105+
- [ ] Completed metadata table with all fields filled
106+
- [ ] Timeline with at least 3 entries (start, key observation, resolution)
107+
- [ ] Commands run section with actual commands (not placeholders)
108+
- [ ] At least one log excerpt or explanation of why logs were unavailable
109+
- [ ] Root cause / diagnosis summary
110+
- [ ] Recovery actions taken
111+
- [ ] Findings list (even if empty -- state "No new findings")
112+
- [ ] Sign-off from at least the rehearsal lead
113+
- [ ] Follow-up issues linked (or "None" if no issues were filed)
114+
115+
## Related Documents
116+
117+
- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- rehearsal schedule and rotation
118+
- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- how to file findings as issues
119+
- `docs/ops/rehearsal-scenarios/` -- scenario templates
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Incident Rehearsal Cadence
2+
3+
Last Updated: 2026-03-29
4+
Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
5+
6+
## Purpose
7+
8+
Rehearsals validate that the team can diagnose and recover from production-realistic failures using real tooling and documented procedures. They also surface gaps in observability, runbooks, and recovery automation before real incidents expose them.
9+
10+
## Monthly Lightweight Rehearsal
11+
12+
| Field | Detail |
13+
| --- | --- |
14+
| Cadence | First working Thursday of each month |
15+
| Duration | ~30 minutes |
16+
| Scope | Single scenario from `docs/ops/rehearsal-scenarios/` |
17+
| Lead | Rotating (see assignment model below) |
18+
| Participants | Rehearsal lead + one observer minimum |
19+
| Artifacts | Evidence package filed in `docs/ops/rehearsals/` |
20+
21+
Steps:
22+
1. Lead selects a scenario from the scenario library (prefer unexercised or recently-failed scenarios).
23+
2. Announce the rehearsal in the team channel at least 24 hours in advance.
24+
3. Execute the scenario using the template's injection method and diagnosis path.
25+
4. Record an evidence package using `docs/ops/EVIDENCE_TEMPLATE.md`.
26+
5. File any discovered issues per `docs/ops/REHEARSAL_BACKOFF_RULES.md`.
27+
28+
## Quarterly Deep Drill
29+
30+
| Field | Detail |
31+
| --- | --- |
32+
| Cadence | Second week of Q1/Q2/Q3/Q4 (January, April, July, October) |
33+
| Duration | ~2 hours |
34+
| Scope | Combined or cascading scenario (e.g., degraded health + deployment failure) |
35+
| Lead | Rotating (same rotation, offset from monthly) |
36+
| Participants | All active contributors |
37+
| Artifacts | Evidence package + retrospective summary |
38+
39+
Steps:
40+
1. Lead designs a combined scenario at least one week before the drill date.
41+
2. Distribute the scenario brief (pre-conditions, scope, goals) to all participants 48 hours in advance.
42+
3. Execute the drill with explicit role assignments: incident commander, investigator, communicator.
43+
4. Record the evidence package and a retrospective summary covering what went well, what was slow, and what tooling or documentation was missing.
44+
5. File findings and retrospective actions per `docs/ops/REHEARSAL_BACKOFF_RULES.md`.
45+
46+
## Rotation and Assignment Model
47+
48+
Rehearsal lead rotates alphabetically by GitHub username among active contributors.
49+
50+
| Month | Lead selection |
51+
| --- | --- |
52+
| Month N | First contributor alphabetically who has not led in the current quarter |
53+
| Fallback | If the assigned lead is unavailable, the next person in rotation picks up |
54+
55+
The rotation resets each quarter. Deep drills use the same rotation but are offset (the deep-drill lead should not be the same person who led the preceding monthly rehearsal).
56+
57+
To check the current rotation state, see the most recent evidence file in `docs/ops/rehearsals/` -- the lead is recorded in the metadata section.
58+
59+
## Calendar Integration
60+
61+
Add rehearsal dates to the team calendar:
62+
63+
- **Monthly**: recurring event on the first Thursday of each month, 30 minutes, titled `[Taskdeck] Monthly Incident Rehearsal`
64+
- **Quarterly**: recurring event in the second week of Jan/Apr/Jul/Oct, 2 hours, titled `[Taskdeck] Quarterly Deep Drill`
65+
66+
Include the following in the calendar event description:
67+
68+
```
69+
Scenario library: docs/ops/rehearsal-scenarios/
70+
Evidence template: docs/ops/EVIDENCE_TEMPLATE.md
71+
Backlog rules: docs/ops/REHEARSAL_BACKOFF_RULES.md
72+
```
73+
74+
## Scenario Library
75+
76+
Available scenarios in `docs/ops/rehearsal-scenarios/`:
77+
78+
- `degraded-api-health.md` -- API health endpoint returns degraded/unhealthy status
79+
- `missing-telemetry-signal.md` -- Correlation ID missing from OpenTelemetry traces
80+
- `mcp-server-startup-regression.md` -- Optional MCP server fails at boot
81+
- `deployment-readiness-failure.md` -- Docker Compose startup fails readiness checks
82+
83+
New scenarios should follow the same template structure (pre-conditions, injection, diagnosis, recovery, evidence checklist). File them in the `rehearsal-scenarios/` directory with a descriptive kebab-case filename.
84+
85+
## Related Documents
86+
87+
- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
88+
- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- issue filing and SLA rules for findings
89+
- `docs/ops/FAILURE_INJECTION_DRILLS.md` -- automated drill scripts (complementary to manual rehearsals)
90+
- `docs/ops/OBSERVABILITY_BASELINE.md` -- telemetry and dashboard contract
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Rehearsal Backlog Handoff Rules
2+
3+
Last Updated: 2026-03-29
4+
Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
5+
6+
## Purpose
7+
8+
Rehearsals surface real gaps. This document defines how findings from rehearsals become tracked work items with clear ownership and response expectations.
9+
10+
## Filing Issues from Rehearsal Findings
11+
12+
Every finding recorded in the evidence package that requires follow-up must be filed as a GitHub issue within 2 working days of the rehearsal.
13+
14+
### Issue Title Convention
15+
16+
```
17+
[rehearsal-finding] <short description>
18+
```
19+
20+
Example: `[rehearsal-finding] Health endpoint does not report queue worker name on stale heartbeat`
21+
22+
### Issue Body Requirements
23+
24+
Each rehearsal-finding issue must include:
25+
26+
1. **Source rehearsal link**: relative path to the evidence file (e.g., `docs/ops/rehearsals/2026-03-29_degraded-api-health.md`)
27+
2. **Finding description**: what was observed and why it matters
28+
3. **Reproduction steps**: commands or conditions that trigger the finding
29+
4. **Suggested fix or investigation path**: concrete next step, not just "look into this"
30+
5. **Severity label**: one of P1/P2/P3/P4 (see below)
31+
32+
### Template
33+
34+
```markdown
35+
## Source
36+
37+
Rehearsal: `docs/ops/rehearsals/YYYY-MM-DD_scenario-name.md`
38+
Finding #N from evidence package.
39+
40+
## Description
41+
42+
[What was observed and why it matters]
43+
44+
## Reproduction
45+
46+
[Commands or conditions]
47+
48+
## Suggested Fix
49+
50+
[Concrete next step]
51+
```
52+
53+
## Label Conventions
54+
55+
Apply the following labels to every rehearsal-finding issue:
56+
57+
| Label | When to apply |
58+
| --- | --- |
59+
| `rehearsal-finding` | Always (primary identifier) |
60+
| `hardening` | When the finding relates to reliability or operability |
61+
| `bug` | When the finding is a defect in existing behavior |
62+
| `docs` | When the finding is a documentation gap |
63+
| `testing` | When the finding reveals missing test coverage |
64+
65+
Severity labels:
66+
67+
| Label | Meaning |
68+
| --- | --- |
69+
| `P1` | Blocks production readiness or causes data loss risk |
70+
| `P2` | Degrades reliability or operator experience significantly |
71+
| `P3` | Minor gap, workaround exists |
72+
| `P4` | Cosmetic or nice-to-have improvement |
73+
74+
If `rehearsal-finding` does not yet exist as a GitHub label, create it with color `#D4C5F9` and description `Finding surfaced during incident rehearsal`.
75+
76+
## SLA Expectations
77+
78+
| Severity | Triage SLA | Resolution target |
79+
| --- | --- | --- |
80+
| P1 | Same day | Next release / hotfix |
81+
| P2 | 2 working days | Within current sprint |
82+
| P3 | 5 working days | Within current quarter |
83+
| P4 | Best effort | Backlog; pick up when convenient |
84+
85+
"Triage" means the issue has been reviewed, assigned, and prioritized -- not necessarily started.
86+
87+
## Connecting Findings to Evidence
88+
89+
Every rehearsal-finding issue must link back to its source evidence file. Use the following format in the issue body:
90+
91+
```
92+
Source rehearsal: docs/ops/rehearsals/YYYY-MM-DD_scenario-name.md
93+
```
94+
95+
Conversely, the evidence file's "Follow-Up Issues" section must link forward to all filed issues:
96+
97+
```markdown
98+
## Follow-Up Issues
99+
100+
- #NNN: [title]
101+
```
102+
103+
This bidirectional linking ensures no finding is orphaned.
104+
105+
## Escalation
106+
107+
If a P1 finding is discovered during a rehearsal:
108+
109+
1. File the issue immediately (do not wait for the 2-day window).
110+
2. Tag the issue with `P1` and `rehearsal-finding`.
111+
3. Notify the team channel with a link to the issue.
112+
4. The rehearsal lead owns triage until the issue is assigned.
113+
114+
## Related Documents
115+
116+
- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- rehearsal schedule and rotation
117+
- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
118+
- `docs/ops/GITHUB_LABEL_TAXONOMY.md` -- canonical label definitions

0 commit comments

Comments
 (0)