OPS-19: Incident rehearsal and recovery evidence program#503
OPS-19: Incident rehearsal and recovery evidence program#503Chris0Jeky merged 10 commits intomainfrom
Conversation
Define monthly lightweight (~30 min) and quarterly deep drill (~2 hour) rehearsal schedule with rotation model and calendar integration guidance. Part of OPS-19 (#150).
Define required format for rehearsal evidence: timeline with ISO timestamps, commands run, log excerpts, root cause, recovery actions, findings, and sign-off section. Part of OPS-19 (#150).
Define issue filing conventions for rehearsal findings: label taxonomy (rehearsal-finding + severity P1-P4), SLA expectations, bidirectional linking between evidence and issues. Part of OPS-19 (#150).
Three injection options: database connectivity fault, worker heartbeat staleness, and queue backlog overload. Includes diagnosis path referencing actual HealthController checks and recovery steps. Part of OPS-19 (#150).
Covers correlation ID absence from traces, OTLP endpoint misconfiguration, and console exporter verification. References actual OpenTelemetry attributes from OBSERVABILITY_BASELINE.md. Part of OPS-19 (#150).
Covers invalid command, missing API key, and port conflict injection. Verifies MCP failure isolation from core API health. References MCP_TOOLING_GUIDE.md fallback policy. Part of OPS-19 (#150).
Covers missing env vars, invalid DB path, port conflicts, and corrupted Dockerfile. References actual docker-compose.yml services and .env requirements. Part of OPS-19 (#150).
Exercised health endpoints against live codebase. Key findings: - SQLite auto-creation masks connection string errors in health check - Environment variable overrides need --no-launch-profile for dotnet run - Windows path resolution differs from Unix for fault injection Part of OPS-19 (#150).
Add cross-references to the incident rehearsal cadence, scenario templates, evidence format, and completed rehearsals. Part of OPS-19 (#150).
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Adversarial Self-ReviewReviewed all 10 files in the diff (1060 lines). Findings: Verified OK
Findings to note (not blocking)
VerdictNo blocking issues. The documents are grounded in the actual codebase, cross-references are valid, and the rehearsal evidence is genuine. The "Partial" outcome is a feature, not a bug -- it surfaced real findings about fault injection reliability on this stack. |
Adversarial Review of PR #503What I Verified
Issues Found1. FACTUAL ERROR:
|
TASKDECK_DB_PATH does not exist in the codebase. The Docker Compose file uses ConnectionStrings__DefaultConnection directly. Update the scenario to use the correct environment variable override.
Summary
Closes #150
rehearsal-finding, P1-P4 severity), SLA expectations, and bidirectional evidence-to-issue linkingTESTING_GUIDE.mdandMANUAL_TEST_CHECKLIST.mdTest plan