diff --git a/docs/MANUAL_TEST_CHECKLIST.md b/docs/MANUAL_TEST_CHECKLIST.md
index 92d3add81..52c2cbcfc 100644
--- a/docs/MANUAL_TEST_CHECKLIST.md
+++ b/docs/MANUAL_TEST_CHECKLIST.md
@@ -463,3 +463,14 @@ Summary scope:
 4. True-missing vs cross-user denial indistinguishability (B-90 to B-96)
 5. Error payload contract verification for auth/validation/sandbox paths (B-100 to B-110)
 6. Advanced controller families: ops/logs/users/abuse/llm-quota/agents/knowledge/webhooks/external-imports (B-130 to B-175)
+
+---
+
+## Incident Rehearsals
+
+For operational failure diagnosis and recovery validation beyond functional testing, see the incident rehearsal program:
+
+- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- schedule and rotation
+- `docs/ops/rehearsal-scenarios/` -- scenario templates
+- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
+- `docs/ops/rehearsals/` -- completed rehearsal evidence
diff --git a/docs/TESTING_GUIDE.md b/docs/TESTING_GUIDE.md
index e2808f681..113eec5db 100644
--- a/docs/TESTING_GUIDE.md
+++ b/docs/TESTING_GUIDE.md
@@ -604,6 +604,19 @@ Recommended execution pairing:
 - automated: API + frontend unit + E2E capture loop (`#210` delivered, retained as active regression path)
 - manual: capture friction/trust checks in `docs/MANUAL_TEST_CHECKLIST.md`
 
+## Incident Rehearsals
+
+Manual incident rehearsals complement automated tests by validating diagnosis and recovery workflows against realistic failure conditions. Rehearsals are scheduled monthly (lightweight, ~30 min) and quarterly (deep drill, ~2 hours).
+
+Key resources:
+- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- schedule, rotation, and process
+- `docs/ops/rehearsal-scenarios/` -- scenario templates (health degradation, telemetry gaps, deployment failures)
+- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
+- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- how rehearsal findings become tracked issues
+- `docs/ops/rehearsals/` -- completed rehearsal evidence packages
+
+Rehearsals are distinct from the automated failure-injection drill suite (`docs/ops/FAILURE_INJECTION_DRILLS.md`). Drills are scripted and CI-runnable; rehearsals are human-driven and focus on diagnosis speed, tooling gaps, and recovery muscle memory.
+
 ## Development Sandbox Mode
 
 For local development only, authorization bypass can be enabled via:
diff --git a/docs/ops/EVIDENCE_TEMPLATE.md b/docs/ops/EVIDENCE_TEMPLATE.md
new file mode 100644
index 000000000..f539c2fa4
--- /dev/null
+++ b/docs/ops/EVIDENCE_TEMPLATE.md
@@ -0,0 +1,119 @@
+# Rehearsal Evidence Package Template
+
+Last Updated: 2026-03-29
+Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
+
+## Purpose
+
+Every rehearsal produces an evidence package that records what happened, what was found, and what follow-up is needed. This template defines the required format.
+
+Evidence files are stored in `docs/ops/rehearsals/` with the naming convention:
+
+```
+YYYY-MM-DD_scenario-name.md
+```
+
+Example: `2026-03-29_degraded-api-health.md`
+
+---
+
+## Template
+
+Copy the block below into a new file and fill in each section.
+
+```markdown
+# Rehearsal Evidence: [Scenario Name]
+
+## Metadata
+
+| Field | Value |
+| --- | --- |
+| Date | YYYY-MM-DD |
+| Rehearsal type | Monthly / Quarterly deep drill |
+| Scenario | [scenario filename from docs/ops/rehearsal-scenarios/] |
+| Lead | [GitHub username] |
+| Participants | [comma-separated GitHub usernames] |
+| Commit SHA | [HEAD of main at rehearsal start] |
+| OS / Environment | [e.g., Windows 10 Pro, Docker Desktop 4.x] |
+| Duration | [actual elapsed time] |
+| Outcome | Pass / Partial / Fail |
+
+## Timeline
+
+Use ISO 8601 timestamps (UTC). Record each significant action or observation.
+
+| Timestamp (UTC) | Actor | Action / Observation |
+| --- | --- | --- |
+| 2026-03-29T14:00:00Z | @lead | Started API with injected fault |
+| 2026-03-29T14:02:30Z | @lead | Observed 503 on /health/ready |
+| ... | ... | ... |
+
+## Commands Run
+
+Record every command executed during the rehearsal, in order.
+
+```bash
+# Example
+dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj
+curl http://localhost:5000/health/ready
+```
+
+## Log Excerpts
+
+Include relevant log output. Redact any secrets or PII.
+
+```
+[relevant log lines here]
+```
+
+## Root Cause / Diagnosis Summary
+
+Describe what the injected fault was, how it was detected, and what the diagnosis path looked like.
+
+## Recovery Actions Taken
+
+Describe the steps taken to restore the system to a healthy state.
+
+## Findings
+
+List any issues, gaps, or improvements discovered during the rehearsal.
+
+- [ ] Finding 1: [description] -- Severity: [P1/P2/P3/P4] -- Issue: [#NNN or "to be filed"]
+- [ ] Finding 2: [description] -- Severity: [P1/P2/P3/P4] -- Issue: [#NNN or "to be filed"]
+
+## Sign-Off
+
+| Role | Name | Date | Approved |
+| --- | --- | --- | --- |
+| Rehearsal lead | @username | YYYY-MM-DD | [ ] |
+| Observer | @username | YYYY-MM-DD | [ ] |
+
+## Follow-Up Issues
+
+Link to any issues filed as a result of this rehearsal:
+
+- #NNN: [title]
+- #NNN: [title]
+```
+
+---
+
+## Required Artifacts Checklist
+
+Every evidence package must include:
+
+- [ ] Completed metadata table with all fields filled
+- [ ] Timeline with at least 3 entries (start, key observation, resolution)
+- [ ] Commands run section with actual commands (not placeholders)
+- [ ] At least one log excerpt or explanation of why logs were unavailable
+- [ ] Root cause / diagnosis summary
+- [ ] Recovery actions taken
+- [ ] Findings list (even if empty -- state "No new findings")
+- [ ] Sign-off from at least the rehearsal lead
+- [ ] Follow-up issues linked (or "None" if no issues were filed)
+
+## Related Documents
+
+- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- rehearsal schedule and rotation
+- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- how to file findings as issues
+- `docs/ops/rehearsal-scenarios/` -- scenario templates
diff --git a/docs/ops/INCIDENT_REHEARSAL_CADENCE.md b/docs/ops/INCIDENT_REHEARSAL_CADENCE.md
new file mode 100644
index 000000000..aa14fb988
--- /dev/null
+++ b/docs/ops/INCIDENT_REHEARSAL_CADENCE.md
@@ -0,0 +1,90 @@
+# Incident Rehearsal Cadence
+
+Last Updated: 2026-03-29
+Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
+
+## Purpose
+
+Rehearsals validate that the team can diagnose and recover from production-realistic failures using real tooling and documented procedures. They also surface gaps in observability, runbooks, and recovery automation before real incidents expose them.
+
+## Monthly Lightweight Rehearsal
+
+| Field | Detail |
+| --- | --- |
+| Cadence | First working Thursday of each month |
+| Duration | ~30 minutes |
+| Scope | Single scenario from `docs/ops/rehearsal-scenarios/` |
+| Lead | Rotating (see assignment model below) |
+| Participants | Rehearsal lead + one observer minimum |
+| Artifacts | Evidence package filed in `docs/ops/rehearsals/` |
+
+Steps:
+1. Lead selects a scenario from the scenario library (prefer unexercised or recently-failed scenarios).
+2. Announce the rehearsal in the team channel at least 24 hours in advance.
+3. Execute the scenario using the template's injection method and diagnosis path.
+4. Record an evidence package using `docs/ops/EVIDENCE_TEMPLATE.md`.
+5. File any discovered issues per `docs/ops/REHEARSAL_BACKOFF_RULES.md`.
+
+## Quarterly Deep Drill
+
+| Field | Detail |
+| --- | --- |
+| Cadence | Second week of Q1/Q2/Q3/Q4 (January, April, July, October) |
+| Duration | ~2 hours |
+| Scope | Combined or cascading scenario (e.g., degraded health + deployment failure) |
+| Lead | Rotating (same rotation, offset from monthly) |
+| Participants | All active contributors |
+| Artifacts | Evidence package + retrospective summary |
+
+Steps:
+1. Lead designs a combined scenario at least one week before the drill date.
+2. Distribute the scenario brief (pre-conditions, scope, goals) to all participants 48 hours in advance.
+3. Execute the drill with explicit role assignments: incident commander, investigator, communicator.
+4. Record the evidence package and a retrospective summary covering what went well, what was slow, and what tooling or documentation was missing.
+5. File findings and retrospective actions per `docs/ops/REHEARSAL_BACKOFF_RULES.md`.
+
+## Rotation and Assignment Model
+
+Rehearsal lead rotates alphabetically by GitHub username among active contributors.
+
+| Month | Lead selection |
+| --- | --- |
+| Month N | First contributor alphabetically who has not led in the current quarter |
+| Fallback | If the assigned lead is unavailable, the next person in rotation picks up |
+
+The rotation resets each quarter. Deep drills use the same rotation but are offset (the deep-drill lead should not be the same person who led the preceding monthly rehearsal).
+
+To check the current rotation state, see the most recent evidence file in `docs/ops/rehearsals/` -- the lead is recorded in the metadata section.
+
+## Calendar Integration
+
+Add rehearsal dates to the team calendar:
+
+- **Monthly**: recurring event on the first Thursday of each month, 30 minutes, titled `[Taskdeck] Monthly Incident Rehearsal`
+- **Quarterly**: recurring event in the second week of Jan/Apr/Jul/Oct, 2 hours, titled `[Taskdeck] Quarterly Deep Drill`
+
+Include the following in the calendar event description:
+
+```
+Scenario library: docs/ops/rehearsal-scenarios/
+Evidence template: docs/ops/EVIDENCE_TEMPLATE.md
+Backlog rules: docs/ops/REHEARSAL_BACKOFF_RULES.md
+```
+
+## Scenario Library
+
+Available scenarios in `docs/ops/rehearsal-scenarios/`:
+
+- `degraded-api-health.md` -- API health endpoint returns degraded/unhealthy status
+- `missing-telemetry-signal.md` -- Correlation ID missing from OpenTelemetry traces
+- `mcp-server-startup-regression.md` -- Optional MCP server fails at boot
+- `deployment-readiness-failure.md` -- Docker Compose startup fails readiness checks
+
+New scenarios should follow the same template structure (pre-conditions, injection, diagnosis, recovery, evidence checklist). File them in the `rehearsal-scenarios/` directory with a descriptive kebab-case filename.
+
+## Related Documents
+
+- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
+- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- issue filing and SLA rules for findings
+- `docs/ops/FAILURE_INJECTION_DRILLS.md` -- automated drill scripts (complementary to manual rehearsals)
+- `docs/ops/OBSERVABILITY_BASELINE.md` -- telemetry and dashboard contract
diff --git a/docs/ops/REHEARSAL_BACKOFF_RULES.md b/docs/ops/REHEARSAL_BACKOFF_RULES.md
new file mode 100644
index 000000000..ec44daed7
--- /dev/null
+++ b/docs/ops/REHEARSAL_BACKOFF_RULES.md
@@ -0,0 +1,118 @@
+# Rehearsal Backlog Handoff Rules
+
+Last Updated: 2026-03-29
+Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
+
+## Purpose
+
+Rehearsals surface real gaps. This document defines how findings from rehearsals become tracked work items with clear ownership and response expectations.
+
+## Filing Issues from Rehearsal Findings
+
+Every finding recorded in the evidence package that requires follow-up must be filed as a GitHub issue within 2 working days of the rehearsal.
+
+### Issue Title Convention
+
+```
+[rehearsal-finding] <short description>
+```
+
+Example: `[rehearsal-finding] Health endpoint does not report queue worker name on stale heartbeat`
+
+### Issue Body Requirements
+
+Each rehearsal-finding issue must include:
+
+1. **Source rehearsal link**: relative path to the evidence file (e.g., `docs/ops/rehearsals/2026-03-29_degraded-api-health.md`)
+2. **Finding description**: what was observed and why it matters
+3. **Reproduction steps**: commands or conditions that trigger the finding
+4. **Suggested fix or investigation path**: concrete next step, not just "look into this"
+5. **Severity label**: one of P1/P2/P3/P4 (see below)
+
+### Template
+
+```markdown
+## Source
+
+Rehearsal: `docs/ops/rehearsals/YYYY-MM-DD_scenario-name.md`
+Finding #N from evidence package.
+
+## Description
+
+[What was observed and why it matters]
+
+## Reproduction
+
+[Commands or conditions]
+
+## Suggested Fix
+
+[Concrete next step]
+```
+
+## Label Conventions
+
+Apply the following labels to every rehearsal-finding issue:
+
+| Label | When to apply |
+| --- | --- |
+| `rehearsal-finding` | Always (primary identifier) |
+| `hardening` | When the finding relates to reliability or operability |
+| `bug` | When the finding is a defect in existing behavior |
+| `docs` | When the finding is a documentation gap |
+| `testing` | When the finding reveals missing test coverage |
+
+Severity labels:
+
+| Label | Meaning |
+| --- | --- |
+| `P1` | Blocks production readiness or causes data loss risk |
+| `P2` | Degrades reliability or operator experience significantly |
+| `P3` | Minor gap, workaround exists |
+| `P4` | Cosmetic or nice-to-have improvement |
+
+If `rehearsal-finding` does not yet exist as a GitHub label, create it with color `#D4C5F9` and description `Finding surfaced during incident rehearsal`.
+
+## SLA Expectations
+
+| Severity | Triage SLA | Resolution target |
+| --- | --- | --- |
+| P1 | Same day | Next release / hotfix |
+| P2 | 2 working days | Within current sprint |
+| P3 | 5 working days | Within current quarter |
+| P4 | Best effort | Backlog; pick up when convenient |
+
+"Triage" means the issue has been reviewed, assigned, and prioritized -- not necessarily started.
+
+## Connecting Findings to Evidence
+
+Every rehearsal-finding issue must link back to its source evidence file. Use the following format in the issue body:
+
+```
+Source rehearsal: docs/ops/rehearsals/YYYY-MM-DD_scenario-name.md
+```
+
+Conversely, the evidence file's "Follow-Up Issues" section must link forward to all filed issues:
+
+```markdown
+## Follow-Up Issues
+
+- #NNN: [title]
+```
+
+This bidirectional linking ensures no finding is orphaned.
+
+## Escalation
+
+If a P1 finding is discovered during a rehearsal:
+
+1. File the issue immediately (do not wait for the 2-day window).
+2. Tag the issue with `P1` and `rehearsal-finding`.
+3. Notify the team channel with a link to the issue.
+4. The rehearsal lead owns triage until the issue is assigned.
+
+## Related Documents
+
+- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` -- rehearsal schedule and rotation
+- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format
+- `docs/ops/GITHUB_LABEL_TAXONOMY.md` -- canonical label definitions
diff --git a/docs/ops/rehearsal-scenarios/degraded-api-health.md b/docs/ops/rehearsal-scenarios/degraded-api-health.md
new file mode 100644
index 000000000..bbd5f625f
--- /dev/null
+++ b/docs/ops/rehearsal-scenarios/degraded-api-health.md
@@ -0,0 +1,151 @@
+# Scenario: Degraded API Health
+
+Last Updated: 2026-03-29
+Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
+
+## Overview
+
+Simulate a condition where the `/health/ready` endpoint returns `503 NotReady` due to a degraded subsystem (database unreachable, queue backlog exceeded, or worker heartbeat stale). Diagnose which check failed and recover the system to a healthy state.
+
+## Pre-Conditions
+
+- Repository checked out at a known commit on `main`.
+- Backend builds successfully: `dotnet build backend/Taskdeck.sln -c Release`
+- No other Taskdeck API instance running on port 5000.
+- SQLite database file is accessible (default: `taskdeck.db` in the API project directory).
+- `curl` or equivalent HTTP client available.
+
+## Injection Method
+
+Choose one of the following fault injection approaches:
+
+### Option A: Database Connectivity Fault
+
+Rename or lock the SQLite database file before starting the API so the database connectivity check fails.
+
+```bash
+# From repo root
+cd backend/src/Taskdeck.Api
+# Rename the DB file to simulate missing database
+mv taskdeck.db taskdeck.db.bak 2>/dev/null || true
+# Start the API
+dotnet run --project Taskdeck.Api.csproj
+```
+
+Note: EF Core with SQLite will auto-create a new empty database. To truly break connectivity, set the connection string to a read-only or non-existent directory:
+
+```bash
+ConnectionStrings__DefaultConnection="Data Source=/nonexistent/path/taskdeck.db" \
+  dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj
+```
+
+### Option B: Worker Heartbeat Staleness
+
+Start the API with queue processing enabled, then observe the heartbeat staleness window. The `queueToProposal` worker is considered stale if its last heartbeat exceeds `QueuePollIntervalSeconds * 3` (minimum 30 seconds). The `proposalHousekeeping` worker goes stale after 3 minutes.
+
+To inject staleness without code changes: start the API, wait for workers to begin heartbeating, then suspend the worker thread (not practical without code changes). Instead, inspect the staleness values reported by `/health/ready` and understand the thresholds.
+
+For a realistic rehearsal: modify `appsettings.Development.json` temporarily to set `Workers:QueuePollIntervalSeconds` to 1, then observe how quickly the worker goes stale if delayed.
+
+### Option C: Queue Backlog Overload
+
+Flood the LLM queue with pending items to exceed the threshold (`MaxBatchSize * 20`, minimum 100):
+
+```bash
+# Start the API
+dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj &
+
+# Register and authenticate
+curl -s -X POST http://localhost:5000/api/auth/register \
+  -H "Content-Type: application/json" \
+  -d '{"username":"rehearsal-user","password":"Rehearsal123!"}'
+
+TOKEN=$(curl -s -X POST http://localhost:5000/api/auth/login \
+  -H "Content-Type: application/json" \
+  -d '{"username":"rehearsal-user","password":"Rehearsal123!"}' | jq -r '.token')
+
+# Create a board to target
+BOARD_ID=$(curl -s -X POST http://localhost:5000/api/boards \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer $TOKEN" \
+  -d '{"name":"rehearsal-board"}' | jq -r '.id')
+
+# Submit many LLM queue items (adjust count to exceed threshold)
+for i in $(seq 1 120); do
+  curl -s -X POST http://localhost:5000/api/llm-queue \
+    -H "Content-Type: application/json" \
+    -H "Authorization: Bearer $TOKEN" \
+    -d "{\"boardId\":\"$BOARD_ID\",\"requestType\":\"Suggest\",\"payload\":\"item $i\"}" > /dev/null
+done
+```
+
+## Expected Diagnosis Path
+
+1. **Check health endpoints**:
+   ```bash
+   curl -s http://localhost:5000/health/live | jq .
+   curl -s http://localhost:5000/health/ready | jq .
+   ```
+
+2. **Interpret the response**: The `/health/ready` response includes a `checks` object with `database`, `queue`, and `workers` sub-checks. Each has a `status` field (`Healthy`, `Degraded`, `Unhealthy`, `Stale`, `Starting`, `Disabled`).
+
+3. **Identify the failing check**: Look for non-`Healthy` status values. Examples:
+   - `checks.database.status: "Unhealthy"` with an `error` field
+   - `checks.queue.status: "Degraded"` with `depth` exceeding `threshold`
+   - `checks.workers.queueToProposal.status: "Stale"` with `stalenessSeconds` exceeding `maxStalenessSeconds`
+
+4. **Correlate with logs**: Check the API console output for error messages related to the failing subsystem.
+
+5. **Verify with telemetry** (if OpenTelemetry is enabled): Check for `taskdeck.automation.queue.backlog` and `taskdeck.worker.heartbeat.staleness` metrics.
+
+## Recovery Steps
+
+### Database Fault Recovery
+
+```bash
+# Restore the original database
+mv taskdeck.db.bak taskdeck.db
+# Or fix the connection string and restart
+# Verify recovery
+curl -s http://localhost:5000/health/ready | jq .checks.database
+```
+
+### Queue Backlog Recovery
+
+The queue will drain naturally as the worker processes items. To accelerate:
+
+```bash
+# Check current queue depth
+curl -s http://localhost:5000/health/ready | jq .checks.queue
+
+# Wait for worker to process items, or restart with higher batch size
+# Workers__MaxBatchSize=20 dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj
+```
+
+### Worker Staleness Recovery
+
+```bash
+# If the worker is stuck, restart the API process
+# Verify heartbeats resume
+curl -s http://localhost:5000/health/ready | jq .checks.workers
+```
+
+## Evidence Checklist
+
+After completing the rehearsal, verify the evidence package includes:
+
+- [ ] Screenshot or captured output of the degraded `/health/ready` response (503)
+- [ ] Identification of which specific check failed (database, queue, or workers)
+- [ ] The exact `status`, `error`, or threshold values from the response
+- [ ] Commands used to inject the fault
+- [ ] Commands used to diagnose the fault
+- [ ] Commands used to recover
+- [ ] Captured output of the recovered `/health/ready` response (200)
+- [ ] Any log excerpts showing error or warning messages during the degraded state
+- [ ] Findings about gaps in the health response (e.g., missing context, unclear error messages)
+
+## Related Documents
+
+- `backend/src/Taskdeck.Api/Controllers/HealthController.cs` -- health endpoint implementation
+- `docs/ops/OBSERVABILITY_BASELINE.md` -- telemetry contract
+- `docs/ops/FAILURE_INJECTION_DRILLS.md` -- automated drill scripts
diff --git a/docs/ops/rehearsal-scenarios/deployment-readiness-failure.md b/docs/ops/rehearsal-scenarios/deployment-readiness-failure.md
new file mode 100644
index 000000000..717656c6c
--- /dev/null
+++ b/docs/ops/rehearsal-scenarios/deployment-readiness-failure.md
@@ -0,0 +1,182 @@
+# Scenario: Deployment Readiness Failure
+
+Last Updated: 2026-03-29
+Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
+
+## Overview
+
+Simulate a Docker Compose deployment where the API container starts but fails readiness checks. Triage whether the failure is in the container build, runtime configuration, networking, or dependent services.
+
+## Pre-Conditions
+
+- Repository checked out at a known commit on `main`.
+- Docker Engine with `docker compose` support installed and running.
+- `deploy/.env` configured per `deploy/.env.example` (at minimum, `TASKDECK_JWT_SECRET` must be set).
+- No other services occupying the default proxy port (8080).
+- `curl` or equivalent HTTP client available.
+
+## Injection Method
+
+### Option A: Missing Required Environment Variable
+
+Remove or empty the required `TASKDECK_JWT_SECRET` from `deploy/.env`:
+
+```bash
+# Back up the env file
+cp deploy/.env deploy/.env.bak
+
+# Remove the JWT secret (compose will fail at render time)
+sed -i 's/^TASKDECK_JWT_SECRET=.*/TASKDECK_JWT_SECRET=/' deploy/.env
+
+# Attempt to start
+docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d --build
+```
+
+### Option B: Invalid Database Path
+
+Override the database connection string to point to a non-writable path inside the container. The compose file sets `ConnectionStrings__DefaultConnection` directly, so override it at the shell level:
+
+```bash
+# Start with an invalid DB path (overrides the compose-file value)
+ConnectionStrings__DefaultConnection="Data Source=/readonly/taskdeck.db" \
+docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d --build
+
+# Or add the override in a docker-compose.override.yml environment block
+```
+
+### Option C: Port Conflict on Proxy
+
+Occupy the proxy port before starting compose:
+
+```bash
+# Block port 8080
+python -m http.server 8080 &
+BLOCKER_PID=$!
+
+# Attempt to start compose
+docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d --build
+
+# Clean up later
+kill $BLOCKER_PID
+```
+
+### Option D: Corrupted Image Build
+
+Introduce a build failure by temporarily modifying the Dockerfile:
+
+```bash
+# Back up
+cp deploy/docker/backend.Dockerfile deploy/docker/backend.Dockerfile.bak
+
+# Inject a failing step (add a bad RUN command)
+echo "RUN exit 1" >> deploy/docker/backend.Dockerfile
+
+# Attempt to build
+docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d --build
+```
+
+## Expected Diagnosis Path
+
+1. **Check container status**:
+   ```bash
+   docker compose -f deploy/docker-compose.yml --profile baseline ps
+   ```
+   Look for containers in `Exited`, `Restarting`, or `Created` (not `Running`) state.
+
+2. **Check container logs**:
+   ```bash
+   docker compose -f deploy/docker-compose.yml --profile baseline logs api
+   docker compose -f deploy/docker-compose.yml --profile baseline logs web
+   docker compose -f deploy/docker-compose.yml --profile baseline logs proxy
+   ```
+
+3. **Test readiness through the proxy**:
+   ```bash
+   curl -s http://localhost:8080/health/live | jq .
+   curl -s http://localhost:8080/health/ready | jq .
+   ```
+   If the proxy is up but the API is not, expect a `502 Bad Gateway` from nginx.
+
+4. **Test the API container directly** (bypassing proxy):
+   ```bash
+   # Get the API container's internal port mapping
+   docker compose -f deploy/docker-compose.yml --profile baseline port api 8080
+   # Or exec into the proxy container
+   docker compose -f deploy/docker-compose.yml --profile baseline exec proxy curl -s http://api:8080/health/ready
+   ```
+
+5. **Check build output** (if the build failed):
+   ```bash
+   docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline build 2>&1 | tail -30
+   ```
+
+6. **Verify environment variable injection**:
+   ```bash
+   docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline config | grep -A5 JWT
+   ```
+
+## Recovery Steps
+
+### Missing Environment Variable
+
+1. Restore the env file:
+   ```bash
+   mv deploy/.env.bak deploy/.env
+   ```
+2. Restart:
+   ```bash
+   docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d
+   ```
+
+### Invalid Database Path
+
+1. Fix the connection string in the environment configuration.
+2. If the container already created a bad state, remove the volume and restart:
+   ```bash
+   docker compose -f deploy/docker-compose.yml --profile baseline down -v
+   docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d
+   ```
+
+### Port Conflict
+
+1. Identify and stop the conflicting process.
+2. Or change the proxy port:
+   ```bash
+   TASKDECK_PROXY_PORT=8081 docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d
+   ```
+
+### Corrupted Dockerfile
+
+1. Restore the original Dockerfile:
+   ```bash
+   mv deploy/docker/backend.Dockerfile.bak deploy/docker/backend.Dockerfile
+   ```
+2. Rebuild:
+   ```bash
+   docker compose -f deploy/docker-compose.yml --env-file deploy/.env --profile baseline up -d --build
+   ```
+
+### Full Cleanup
+
+```bash
+docker compose -f deploy/docker-compose.yml --profile baseline down -v
+```
+
+## Evidence Checklist
+
+- [ ] `docker compose ps` output showing container states
+- [ ] Container logs for the failing service(s)
+- [ ] Health endpoint responses (or connection errors if unreachable)
+- [ ] `docker compose config` output showing effective environment (secrets redacted)
+- [ ] Build output if the failure was at build time
+- [ ] Commands used to diagnose the root cause
+- [ ] Commands used to recover
+- [ ] Verification of healthy deployment after recovery (`/health/live` and `/health/ready` both 200)
+- [ ] Any findings about error clarity in container logs or compose output
+
+## Related Documents
+
+- `docs/ops/DEPLOYMENT_CONTAINERS.md` -- container deployment baseline
+- `docs/ops/DEPLOYMENT_HARDENING_MATRIX.md` -- hardening verification matrix
+- `deploy/docker-compose.yml` -- compose configuration
+- `deploy/.env.example` -- required environment variables
diff --git a/docs/ops/rehearsal-scenarios/mcp-server-startup-regression.md b/docs/ops/rehearsal-scenarios/mcp-server-startup-regression.md
new file mode 100644
index 000000000..cbfa61c5b
--- /dev/null
+++ b/docs/ops/rehearsal-scenarios/mcp-server-startup-regression.md
@@ -0,0 +1,139 @@
+# Scenario: MCP Server Startup Regression
+
+Last Updated: 2026-03-29
+Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
+
+## Overview
+
+Simulate a failure in an optional MCP (Model Context Protocol) server used by CLI/IDE agents during development. The MCP server fails at boot due to a configuration or dependency error. Investigate the failure, determine whether it blocks core API functionality, and recover.
+
+## Pre-Conditions
+
+- Repository checked out at a known commit on `main`.
+- MCP server configuration present (`.mcp.json` or equivalent in repo root or `.claude/` directory).
+- Node.js / npx available (MCP servers are typically Node-based).
+- Backend API is not dependent on MCP servers for core operation (MCP is a development tooling layer).
+
+## Injection Method
+
+### Option A: Invalid MCP Server Command
+
+Modify the MCP server configuration to reference a non-existent command or package.
+
+```bash
+# Check current MCP config
+cat .mcp.json 2>/dev/null || cat .claude/mcp.json 2>/dev/null || echo "No MCP config found"
+
+# If .mcp.json exists, back it up and inject a fault
+cp .mcp.json .mcp.json.bak
+# Edit .mcp.json to change a server command to a non-existent binary
+# e.g., change "npx" to "npx-nonexistent" for one server entry
+```
+
+### Option B: Missing API Key for MCP Server
+
+Some MCP servers require API keys or tokens. Remove or invalidate the required environment variable.
+
+```bash
+# Start with an invalid key for a server that requires authentication
+GITHUB_PERSONAL_ACCESS_TOKEN="" npx @modelcontextprotocol/server-github
+```
+
+### Option C: Port Conflict
+
+If an MCP server binds to a specific port, start a conflicting listener first.
+
+```bash
+# Occupy the port before starting the MCP server
+python -m http.server 3000 &
+# Then attempt to start the MCP server that needs port 3000
+```
+
+## Expected Diagnosis Path
+
+1. **Attempt to start the MCP server and capture the error**:
+   ```bash
+   # Try to start the server using the configured command
+   # Capture stderr for error messages
+   npx @modelcontextprotocol/server-github 2>&1 | head -20
+   ```
+
+2. **Check if the core API is affected**:
+   ```bash
+   # MCP servers are optional tooling -- verify the API is unaffected
+   curl -s http://localhost:5000/health/live | jq .
+   curl -s http://localhost:5000/health/ready | jq .
+   ```
+   Both endpoints should return healthy. MCP server failure must not degrade API health.
+
+3. **Inspect the MCP configuration**:
+   ```bash
+   cat .mcp.json | jq .
+   # Or check the Claude config directory
+   ls -la .claude/
+   ```
+
+4. **Check for dependency issues**:
+   ```bash
+   # Verify the MCP server package is resolvable
+   npx --yes @modelcontextprotocol/server-github --version 2>&1
+   ```
+
+5. **Review MCP tooling guide for known issues**:
+   Reference `docs/MCP_TOOLING_GUIDE.md` for the current MCP server status and fallback rules.
+
+## Recovery Steps
+
+### Invalid Command
+
+1. Restore the original MCP configuration:
+   ```bash
+   mv .mcp.json.bak .mcp.json
+   ```
+2. Verify the server starts:
+   ```bash
+   # Test the corrected command
+   npx @modelcontextprotocol/server-github --help 2>&1 | head -5
+   ```
+
+### Missing API Key
+
+1. Set the required environment variable:
+   ```bash
+   export GITHUB_PERSONAL_ACCESS_TOKEN="ghp_..."
+   ```
+2. Restart the MCP server.
+
+### Port Conflict
+
+1. Identify the conflicting process:
+   ```bash
+   # Linux/Mac
+   lsof -i :3000
+   # Windows
+   netstat -ano | findstr :3000
+   ```
+2. Stop the conflicting process or reconfigure the MCP server port.
+
+### Fallback: Work Without MCP
+
+Per the MCP Tooling Guide, when MCP is unavailable:
+- Use shell/CLI as fallback for the same operations.
+- Note the MCP unavailability in the work summary.
+- Core development and API operations are unaffected.
+
+## Evidence Checklist
+
+- [ ] Captured error output from the failed MCP server startup
+- [ ] Verification that `/health/live` and `/health/ready` are unaffected by MCP failure
+- [ ] MCP configuration file contents (redact any tokens or secrets)
+- [ ] Diagnosis steps taken to identify the root cause
+- [ ] Recovery commands and verification of restored MCP server operation
+- [ ] Confirmation that the failure was isolated to development tooling (no production impact)
+- [ ] Any findings about MCP error messages (clear vs. cryptic, actionable vs. not)
+
+## Related Documents
+
+- `docs/MCP_TOOLING_GUIDE.md` -- MCP tool selection rules and fallback policy
+- `docs/tooling/MCP_OPERATIONS_RUNBOOK.md` -- credential setup and verification
+- `.mcp.json` -- MCP server configuration (if present)
diff --git a/docs/ops/rehearsal-scenarios/missing-telemetry-signal.md b/docs/ops/rehearsal-scenarios/missing-telemetry-signal.md
new file mode 100644
index 000000000..f15dae716
--- /dev/null
+++ b/docs/ops/rehearsal-scenarios/missing-telemetry-signal.md
@@ -0,0 +1,115 @@
+# Scenario: Missing Telemetry Signal
+
+Last Updated: 2026-03-29
+Issue: `#150` OPS-19 incident rehearsal and recovery evidence program
+
+## Overview
+
+Simulate a condition where the request correlation ID (`X-Request-Id`) is missing from OpenTelemetry trace spans. Diagnose whether the issue is in middleware configuration, telemetry export pipeline, or span attribute propagation.
+
+## Pre-Conditions
+
+- Repository checked out at a known commit on `main`.
+- Backend builds successfully: `dotnet build backend/Taskdeck.sln -c Release`
+- OpenTelemetry console exporter enabled for local inspection (no external collector required).
+- `curl` or equivalent HTTP client available.
+
+## Injection Method
+
+### Option A: Disable Correlation Middleware
+
+Temporarily comment out or misconfigure the request correlation middleware registration in `Program.cs` to simulate a deployment where correlation IDs stop being propagated to trace spans.
+
+For a non-code-change rehearsal: start the API with OpenTelemetry enabled and verify correlation attributes are present, then start a second instance with `Observability:EnableOpenTelemetry=false` and observe the absence.
+
+```bash
+# Start with telemetry enabled and console exporter
+Observability__EnableOpenTelemetry=true \
+Observability__EnableConsoleExporter=true \
+Observability__ServiceName=taskdeck-rehearsal \
+  dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj
+```
+
+### Option B: Missing OTLP Endpoint
+
+Configure the API to export to a non-existent OTLP collector endpoint. Traces will be generated but silently dropped.
+
+```bash
+Observability__EnableOpenTelemetry=true \
+Observability__OtlpEndpoint=http://localhost:4317 \
+Observability__EnableConsoleExporter=false \
+  dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj
+```
+
+This simulates a production scenario where the collector is down or misconfigured.
+
+## Expected Diagnosis Path
+
+1. **Make a request and check for correlation headers**:
+   ```bash
+   # Send a request and inspect response headers
+   curl -v http://localhost:5000/health/live 2>&1 | grep -i x-request-id
+   ```
+   The API should echo back an `X-Request-Id` header. If missing, correlation middleware is not running.
+
+2. **Check console exporter output** (if enabled):
+   Look for trace spans in the API console output. Expected attributes:
+   - `taskdeck.correlation_id`
+   - `taskdeck.request_id`
+
+   If these attributes are absent from spans but the `X-Request-Id` header is present in the HTTP response, the issue is in span attribute propagation (middleware runs but does not tag spans).
+
+3. **Verify OpenTelemetry configuration**:
+   ```bash
+   # Check appsettings for telemetry config
+   cat backend/src/Taskdeck.Api/appsettings.json | grep -A5 Observability
+   cat backend/src/Taskdeck.Api/appsettings.Development.json | grep -A5 Observability
+   ```
+
+4. **Check for OTLP export errors**:
+   If the OTLP endpoint is unreachable, the OpenTelemetry SDK may log warnings. Look for messages containing `OTLP`, `export`, or `gRPC` errors in the console output.
+
+5. **Verify worker spans include expected attributes**:
+   ```bash
+   # Trigger a worker cycle by submitting a queue item, then check console for worker span attributes
+   # Look for: taskdeck.worker.name, taskdeck.llm.request_id
+   ```
+
+## Recovery Steps
+
+### Correlation Middleware Missing
+
+1. Verify the middleware registration order in `Program.cs`.
+2. Confirm the correlation middleware runs before the endpoint routing middleware.
+3. Restart the API and verify `X-Request-Id` appears in response headers.
+
+### OTLP Endpoint Unreachable
+
+1. Verify the collector is running:
+   ```bash
+   curl -s http://localhost:4317 || echo "OTLP endpoint unreachable"
+   ```
+2. Fix the endpoint URL in configuration or start the collector.
+3. Restart the API (or wait for the next export interval).
+
+### Console Exporter Not Showing Spans
+
+1. Verify `Observability:EnableConsoleExporter` is `true`.
+2. Verify `Observability:EnableOpenTelemetry` is `true`.
+3. Check that the `Taskdeck.Api` activity source name matches the configured source in the telemetry setup.
+
+## Evidence Checklist
+
+- [ ] Captured output showing the presence or absence of `X-Request-Id` in response headers
+- [ ] Console exporter output showing trace spans with or without `taskdeck.correlation_id`
+- [ ] Configuration values for `Observability:*` settings used during the rehearsal
+- [ ] If OTLP was tested: evidence of export failure (log lines or connection errors)
+- [ ] Commands used to diagnose the telemetry pipeline
+- [ ] Recovery steps taken and verification of restored telemetry
+- [ ] Any findings about error visibility when telemetry export silently fails
+
+## Related Documents
+
+- `docs/ops/OBSERVABILITY_BASELINE.md` -- telemetry contract and expected attributes
+- `backend/src/Taskdeck.Api/Telemetry/` -- custom telemetry instrumentation
+- `backend/src/Taskdeck.Api/appsettings.json` -- observability configuration
diff --git a/docs/ops/rehearsals/2026-03-29_degraded-api-health.md b/docs/ops/rehearsals/2026-03-29_degraded-api-health.md
new file mode 100644
index 000000000..8929bcba0
--- /dev/null
+++ b/docs/ops/rehearsals/2026-03-29_degraded-api-health.md
@@ -0,0 +1,122 @@
+# Rehearsal Evidence: Degraded API Health
+
+## Metadata
+
+| Field | Value |
+| --- | --- |
+| Date | 2026-03-29 |
+| Rehearsal type | Monthly (initial program setup) |
+| Scenario | `docs/ops/rehearsal-scenarios/degraded-api-health.md` |
+| Lead | @Chris0Jeky |
+| Participants | @Chris0Jeky |
+| Commit SHA | `440a8c9dbfe63a9d631437ca97069ea34e51c81e` |
+| OS / Environment | Windows 10 Pro 10.0.19045, .NET 8, SQLite |
+| Duration | ~15 minutes |
+| Outcome | Partial |
+
+## Timeline
+
+| Timestamp (UTC) | Actor | Action / Observation |
+| --- | --- | --- |
+| 2026-03-29T03:27:00Z | @Chris0Jeky | Verified backend builds successfully (`dotnet build -c Release`, 0 warnings, 0 errors) |
+| 2026-03-29T03:27:30Z | @Chris0Jeky | Ran `HealthApiTests` -- 3/3 passing (Live, Ready, CaptureBacklogExclusion) |
+| 2026-03-29T03:28:08Z | @Chris0Jeky | Started API on port 5099, confirmed healthy baseline |
+| 2026-03-29T03:28:11Z | @Chris0Jeky | Captured `/health/live` response: `{"status":"Healthy"}` |
+| 2026-03-29T03:28:11Z | @Chris0Jeky | Captured `/health/ready` response: HTTP 200, all checks Healthy |
+| 2026-03-29T03:28:12Z | @Chris0Jeky | Stopped API, attempted injection Option A (invalid DB path) |
+| 2026-03-29T03:28:29Z | @Chris0Jeky | Started API with `ConnectionStrings__DefaultConnection="Data Source=/nonexistent/path/taskdeck.db"` |
+| 2026-03-29T03:28:29Z | @Chris0Jeky | **Finding**: API started healthy -- SQLite auto-created database at mapped path. Invalid Unix-style path was silently resolved on Windows |
+| 2026-03-29T03:28:40Z | @Chris0Jeky | Attempted injection via `Workers__EnableAutoQueueProcessing=false` to disable queue worker |
+| 2026-03-29T03:28:42Z | @Chris0Jeky | API started, all checks showed Healthy. Queue worker still heartbeating (launchSettings.json may override env vars) |
+| 2026-03-29T03:29:17Z | @Chris0Jeky | After 35s wait (past startup grace), all workers still Healthy. `proposalHousekeeping` staleness=8s (max=180s), `queueToProposal` staleness=2.9s (max=30s) |
+| 2026-03-29T03:29:20Z | @Chris0Jeky | Concluded rehearsal. Documented findings |
+
+## Commands Run
+
+```bash
+# Build verification
+dotnet build backend/Taskdeck.sln -c Release
+
+# Run health tests
+dotnet test backend/Taskdeck.sln -c Release --filter "FullyQualifiedName~HealthApiTests"
+
+# Healthy baseline
+dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj -- --urls "http://localhost:5099"
+curl -s http://localhost:5099/health/live
+curl -s http://localhost:5099/health/ready
+
+# Injection attempt A: invalid DB path
+ConnectionStrings__DefaultConnection="Data Source=/nonexistent/path/taskdeck.db" \
+  dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj -- --urls "http://localhost:5099"
+curl -s http://localhost:5099/health/ready
+
+# Injection attempt B: disable queue processing
+Workers__EnableAutoQueueProcessing=false \
+  dotnet run --project backend/src/Taskdeck.Api/Taskdeck.Api.csproj -- --urls "http://localhost:5099"
+curl -s http://localhost:5099/health/ready
+# Waited 35s for startup grace to expire, checked again
+curl -s http://localhost:5099/health/ready
+```
+
+## Log Excerpts
+
+Healthy baseline `/health/ready` response:
+```json
+{
+  "status": "Ready",
+  "timestamp": "2026-03-29T03:28:11.466Z",
+  "checks": {
+    "database": { "status": "Healthy" },
+    "queue": { "status": "Healthy", "depth": 0, "totalDepth": 0, "captureDepth": 0, "threshold": 100 },
+    "workers": {
+      "queueToProposal": { "status": "Healthy", "lastHeartbeat": "2026-03-29T03:28:09.177Z", "stalenessSeconds": 2.28, "maxStalenessSeconds": 30 },
+      "proposalHousekeeping": { "status": "Healthy", "lastHeartbeat": "2026-03-29T03:28:09.506Z", "stalenessSeconds": 1.96, "maxStalenessSeconds": 180 }
+    }
+  }
+}
+```
+
+Invalid DB path response (still healthy):
+```json
+{
+  "status": "Ready",
+  "timestamp": "2026-03-29T03:28:29.821Z",
+  "checks": {
+    "database": { "status": "Healthy" },
+    "queue": { "status": "Healthy", "depth": 0, "totalDepth": 0, "captureDepth": 0, "threshold": 100 },
+    "workers": {
+      "queueToProposal": { "status": "Healthy", "lastHeartbeat": "2026-03-29T03:28:29.586Z", "stalenessSeconds": 0.25, "maxStalenessSeconds": 30 },
+      "proposalHousekeeping": { "status": "Healthy", "lastHeartbeat": "2026-03-29T03:28:09.506Z", "stalenessSeconds": 20.31, "maxStalenessSeconds": 180 }
+    }
+  }
+}
+```
+
+## Root Cause / Diagnosis Summary
+
+The rehearsal targeted Option A (database connectivity fault) and Option B-adjacent (worker heartbeat staleness). Two key observations:
+
+1. **SQLite auto-creation resilience**: Setting `ConnectionStrings__DefaultConnection` to a non-existent Unix-style path (`/nonexistent/path/taskdeck.db`) did not degrade the health endpoint. On Windows, the path was silently resolved (likely to a relative path), and SQLite's `CanConnectAsync` succeeded because the provider auto-creates the database file. This means the database check is resilient to missing-DB scenarios but may mask genuine connection string misconfiguration.
+
+2. **Environment variable override vs launchSettings**: The `Workers__EnableAutoQueueProcessing=false` environment variable did not visibly change worker behavior when `launchSettings.json` was in use (via `dotnet run`). The `--no-launch-profile` flag would be needed to ensure environment variables take precedence.
+
+## Recovery Actions Taken
+
+No recovery was needed -- the system never entered a degraded state due to the resilience of the SQLite provider and the launchSettings override behavior.
+
+## Findings
+
+- [ ] Finding 1: SQLite `CanConnectAsync` succeeds even with a non-existent file path because the provider auto-creates the database. The health check's database connectivity test does not distinguish between "connected to the intended database" and "connected to an auto-created empty database." -- Severity: P3 -- Issue: to be filed
+- [ ] Finding 2: Environment variable overrides for worker settings are ineffective when using `dotnet run` with `launchSettings.json`. The scenario documentation should specify `--no-launch-profile` for reliable fault injection via environment variables. -- Severity: P3 -- Issue: documented in scenario update
+- [ ] Finding 3: The degraded-api-health scenario's Option A (invalid DB path) is not reliably reproducible on Windows due to Unix-to-Windows path resolution. The scenario should include Windows-specific injection guidance. -- Severity: P4 -- Issue: documented in scenario update
+
+## Sign-Off
+
+| Role | Name | Date | Approved |
+| --- | --- | --- | --- |
+| Rehearsal lead | @Chris0Jeky | 2026-03-29 | [x] |
+
+## Follow-Up Issues
+
+- Scenario documentation updated with findings (same PR)
+- P3 finding about SQLite auto-creation masking connection errors should be tracked in a future hardening issue