Skip to content
Merged
11 changes: 11 additions & 0 deletions docs/MANUAL_TEST_CHECKLIST.md
Original file line number Diff line number Diff line change
Expand Up @@ -463,3 +463,14 @@ Summary scope:
4. True-missing vs cross-user denial indistinguishability (B-90 to B-96)
5. Error payload contract verification for auth/validation/sandbox paths (B-100 to B-110)
6. Advanced controller families: ops/logs/users/abuse/llm-quota/agents/knowledge/webhooks/external-imports (B-130 to B-175)

---

## Incident Rehearsals

For operational failure diagnosis and recovery validation beyond functional testing, see the incident rehearsal program:

- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- schedule and rotation
- `docs/ops/rehearsal-scenarios/` -- scenario templates
- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
- `docs/ops/rehearsals/` -- completed rehearsal evidence
13 changes: 13 additions & 0 deletions docs/TESTING_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -604,6 +604,19 @@ Recommended execution pairing:
- automated: API + frontend unit + E2E capture loop (`#210` delivered, retained as active regression path)
- manual: capture friction/trust checks in `docs/MANUAL_TEST_CHECKLIST.md`

## Incident Rehearsals

Manual incident rehearsals complement automated tests by validating diagnosis and recovery workflows against realistic failure conditions. Rehearsals are scheduled monthly (lightweight, ~30 min) and quarterly (deep drill, ~2 hours).

Key resources:
- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- schedule, rotation, and process
- `docs/ops/rehearsal-scenarios/` -- scenario templates (health degradation, telemetry gaps, deployment failures)
- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- how rehearsal findings become tracked issues
- `docs/ops/rehearsals/` -- completed rehearsal evidence packages

Rehearsals are distinct from the automated failure-injection drill suite (`docs/ops/FAILURE_INJECTION_DRILLS.md`). Drills are scripted and CI-runnable; rehearsals are human-driven and focus on diagnosis speed, tooling gaps, and recovery muscle memory.

## Development Sandbox Mode

For local development only, authorization bypass can be enabled via:
Expand Down
119 changes: 119 additions & 0 deletions docs/ops/EVIDENCE_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Rehearsal Evidence Package Template

Last Updated: 2026-03-29
Issue: `#150` OPS-19 incident rehearsal and recovery evidence program

## Purpose

Every rehearsal produces an evidence package that records what happened, what was found, and what follow-up is needed. This template defines the required format.

Evidence files are stored in `docs/ops/rehearsals/` with the naming convention:

```
YYYY-MM-DD_scenario-name.md
```

Example: `2026-03-29_degraded-api-health.md`

---

## Template

Copy the block below into a new file and fill in each section.

```markdown
# Rehearsal Evidence: [Scenario Name]

## Metadata

| Field | Value |
| --- | --- |
| Date | YYYY-MM-DD |
| Rehearsal type | Monthly / Quarterly deep drill |
| Scenario | [scenario filename from docs/ops/rehearsal-scenarios/] |
| Lead | [GitHub username] |
| Participants | [comma-separated GitHub usernames] |
| Commit SHA | [HEAD of main at rehearsal start] |
| OS / Environment | [e.g., Windows 10 Pro, Docker Desktop 4.x] |
| Duration | [actual elapsed time] |
| Outcome | Pass / Partial / Fail |

## Timeline

Use ISO 8601 timestamps (UTC). Record each significant action or observation.

| Timestamp (UTC) | Actor | Action / Observation |
| --- | --- | --- |
| 2026-03-29T14:00:00Z | @lead | Started API with injected fault |
| 2026-03-29T14:02:30Z | @lead | Observed 503 on /health/ready |
| ... | ... | ... |

## Commands Run

Record every command executed during the rehearsal, in order.

```bash
# Example
dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj
curl http://localhost:5000/health/ready
```

## Log Excerpts

Include relevant log output. Redact any secrets or PII.

```
[relevant log lines here]
```

## Root Cause / Diagnosis Summary

Describe what the injected fault was, how it was detected, and what the diagnosis path looked like.

## Recovery Actions Taken

Describe the steps taken to restore the system to a healthy state.

## Findings

List any issues, gaps, or improvements discovered during the rehearsal.

- [ ] Finding 1: [description] -- Severity: [P1/P2/P3/P4] -- Issue: [#NNN or "to be filed"]
- [ ] Finding 2: [description] -- Severity: [P1/P2/P3/P4] -- Issue: [#NNN or "to be filed"]

## Sign-Off

| Role | Name | Date | Approved |
| --- | --- | --- | --- |
| Rehearsal lead | @username | YYYY-MM-DD | [ ] |
| Observer | @username | YYYY-MM-DD | [ ] |

## Follow-Up Issues

Link to any issues filed as a result of this rehearsal:

- #NNN: [title]
- #NNN: [title]
```

---

## Required Artifacts Checklist

Every evidence package must include:

- [ ] Completed metadata table with all fields filled
- [ ] Timeline with at least 3 entries (start, key observation, resolution)
- [ ] Commands run section with actual commands (not placeholders)
- [ ] At least one log excerpt or explanation of why logs were unavailable
- [ ] Root cause / diagnosis summary
- [ ] Recovery actions taken
- [ ] Findings list (even if empty -- state "No new findings")
- [ ] Sign-off from at least the rehearsal lead
- [ ] Follow-up issues linked (or "None" if no issues were filed)

## Related Documents

- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- rehearsal schedule and rotation
- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- how to file findings as issues
- `docs/ops/rehearsal-scenarios/` -- scenario templates
90 changes: 90 additions & 0 deletions docs/ops/INCIDENT_REHEARSAL_CADENCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Incident Rehearsal Cadence

Last Updated: 2026-03-29
Issue: `#150` OPS-19 incident rehearsal and recovery evidence program

## Purpose

Rehearsals validate that the team can diagnose and recover from production-realistic failures using real tooling and documented procedures. They also surface gaps in observability, runbooks, and recovery automation before real incidents expose them.

## Monthly Lightweight Rehearsal

| Field | Detail |
| --- | --- |
| Cadence | First working Thursday of each month |
| Duration | ~30 minutes |
| Scope | Single scenario from `docs/ops/rehearsal-scenarios/` |
| Lead | Rotating (see assignment model below) |
| Participants | Rehearsal lead + one observer minimum |
| Artifacts | Evidence package filed in `docs/ops/rehearsals/` |

Steps:
1. Lead selects a scenario from the scenario library (prefer unexercised or recently-failed scenarios).
2. Announce the rehearsal in the team channel at least 24 hours in advance.
3. Execute the scenario using the template's injection method and diagnosis path.
4. Record an evidence package using `docs/ops/EVIDENCE_TEMPLATE.md`.
5. File any discovered issues per `docs/ops/REHEARSAL_BACKOFF_RULES.md`.

## Quarterly Deep Drill

| Field | Detail |
| --- | --- |
| Cadence | Second week of Q1/Q2/Q3/Q4 (January, April, July, October) |
| Duration | ~2 hours |
| Scope | Combined or cascading scenario (e.g., degraded health + deployment failure) |
| Lead | Rotating (same rotation, offset from monthly) |
| Participants | All active contributors |
| Artifacts | Evidence package + retrospective summary |

Steps:
1. Lead designs a combined scenario at least one week before the drill date.
2. Distribute the scenario brief (pre-conditions, scope, goals) to all participants 48 hours in advance.
3. Execute the drill with explicit role assignments: incident commander, investigator, communicator.
4. Record the evidence package and a retrospective summary covering what went well, what was slow, and what tooling or documentation was missing.
5. File findings and retrospective actions per `docs/ops/REHEARSAL_BACKOFF_RULES.md`.

## Rotation and Assignment Model

Rehearsal lead rotates alphabetically by GitHub username among active contributors.

| Month | Lead selection |
| --- | --- |
| Month N | First contributor alphabetically who has not led in the current quarter |
| Fallback | If the assigned lead is unavailable, the next person in rotation picks up |

The rotation resets each quarter. Deep drills use the same rotation but are offset (the deep-drill lead should not be the same person who led the preceding monthly rehearsal).

To check the current rotation state, see the most recent evidence file in `docs/ops/rehearsals/` -- the lead is recorded in the metadata section.

## Calendar Integration

Add rehearsal dates to the team calendar:

- **Monthly**: recurring event on the first Thursday of each month, 30 minutes, titled `[Taskdeck] Monthly Incident Rehearsal`
- **Quarterly**: recurring event in the second week of Jan/Apr/Jul/Oct, 2 hours, titled `[Taskdeck] Quarterly Deep Drill`

Include the following in the calendar event description:

```
Scenario library: docs/ops/rehearsal-scenarios/
Evidence template: docs/ops/EVIDENCE_TEMPLATE.md
Backlog rules: docs/ops/REHEARSAL_BACKOFF_RULES.md
```

## Scenario Library

Available scenarios in `docs/ops/rehearsal-scenarios/`:

- `degraded-api-health.md` -- API health endpoint returns degraded/unhealthy status
- `missing-telemetry-signal.md` -- Correlation ID missing from OpenTelemetry traces
- `mcp-server-startup-regression.md` -- Optional MCP server fails at boot
- `deployment-readiness-failure.md` -- Docker Compose startup fails readiness checks

New scenarios should follow the same template structure (pre-conditions, injection, diagnosis, recovery, evidence checklist). File them in the `rehearsal-scenarios/` directory with a descriptive kebab-case filename.

## Related Documents

- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- issue filing and SLA rules for findings
- `docs/ops/FAILURE_INJECTION_DRILLS.md` -- automated drill scripts (complementary to manual rehearsals)
- `docs/ops/OBSERVABILITY_BASELINE.md` -- telemetry and dashboard contract
118 changes: 118 additions & 0 deletions docs/ops/REHEARSAL_BACKOFF_RULES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Rehearsal Backlog Handoff Rules

Last Updated: 2026-03-29
Issue: `#150` OPS-19 incident rehearsal and recovery evidence program

## Purpose

Rehearsals surface real gaps. This document defines how findings from rehearsals become tracked work items with clear ownership and response expectations.

## Filing Issues from Rehearsal Findings

Every finding recorded in the evidence package that requires follow-up must be filed as a GitHub issue within 2 working days of the rehearsal.

### Issue Title Convention

```
[rehearsal-finding] <short description>
```

Example: `[rehearsal-finding] Health endpoint does not report queue worker name on stale heartbeat`

### Issue Body Requirements

Each rehearsal-finding issue must include:

1. **Source rehearsal link**: relative path to the evidence file (e.g., `docs/ops/rehearsals/2026-03-29_degraded-api-health.md`)
2. **Finding description**: what was observed and why it matters
3. **Reproduction steps**: commands or conditions that trigger the finding
4. **Suggested fix or investigation path**: concrete next step, not just "look into this"
5. **Severity label**: one of P1/P2/P3/P4 (see below)

### Template

```markdown
## Source

Rehearsal: `docs/ops/rehearsals/YYYY-MM-DD_scenario-name.md`
Finding #N from evidence package.

## Description

[What was observed and why it matters]

## Reproduction

[Commands or conditions]

## Suggested Fix

[Concrete next step]
```

## Label Conventions

Apply the following labels to every rehearsal-finding issue:

| Label | When to apply |
| --- | --- |
| `rehearsal-finding` | Always (primary identifier) |
| `hardening` | When the finding relates to reliability or operability |
| `bug` | When the finding is a defect in existing behavior |
| `docs` | When the finding is a documentation gap |
| `testing` | When the finding reveals missing test coverage |

Severity labels:

| Label | Meaning |
| --- | --- |
| `P1` | Blocks production readiness or causes data loss risk |
| `P2` | Degrades reliability or operator experience significantly |
| `P3` | Minor gap, workaround exists |
| `P4` | Cosmetic or nice-to-have improvement |

If `rehearsal-finding` does not yet exist as a GitHub label, create it with color `#D4C5F9` and description `Finding surfaced during incident rehearsal`.

## SLA Expectations

| Severity | Triage SLA | Resolution target |
| --- | --- | --- |
| P1 | Same day | Next release / hotfix |
| P2 | 2 working days | Within current sprint |
| P3 | 5 working days | Within current quarter |
| P4 | Best effort | Backlog; pick up when convenient |

"Triage" means the issue has been reviewed, assigned, and prioritized -- not necessarily started.

## Connecting Findings to Evidence

Every rehearsal-finding issue must link back to its source evidence file. Use the following format in the issue body:

```
Source rehearsal: docs/ops/rehearsals/YYYY-MM-DD_scenario-name.md
```

Conversely, the evidence file's "Follow-Up Issues" section must link forward to all filed issues:

```markdown
## Follow-Up Issues

- #NNN: [title]
```

This bidirectional linking ensures no finding is orphaned.

## Escalation

If a P1 finding is discovered during a rehearsal:

1. File the issue immediately (do not wait for the 2-day window).
2. Tag the issue with `P1` and `rehearsal-finding`.
3. Notify the team channel with a link to the issue.
4. The rehearsal lead owns triage until the issue is assigned.

## Related Documents

- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- rehearsal schedule and rotation
- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
- `docs/ops/GITHUB_LABEL_TAXONOMY.md` -- canonical label definitions
Loading
Loading