Context
Real-world systems degrade partially. A database lock, a slow LLM provider, or a full disk shouldn't crash the entire app. This issue covers testing the system's behavior when internal dependencies fail — verifying graceful degradation rather than cascading failures.
Backend Resilience Scenarios
Database Failures
- SQLite database locked (concurrent writer): API returns 503 or queues retry, doesn't crash. Capture endpoint specifically should queue locally if DB is temporarily unavailable.
- Database full (disk space): Write operations return 500 with clear error, read operations still work.
- Database corrupted: Startup fails gracefully with actionable error message (not stack trace). Health endpoint returns unhealthy.
Worker Resilience
- Worker exception in main loop: Single batch failure doesn't kill the worker — it logs and continues on next poll.
- Worker database access fails: Worker retries on next cycle, doesn't accumulate zombie items.
- All workers stopped: Queue items accumulate but don't corrupt. When workers restart, items are processed.
- Worker cancellation: `CancellationToken` fires → worker shuts down cleanly, no items left in limbo Processing state.
LLM Provider Degradation
- Provider timeout: Chat/triage requests timeout → user gets degraded response, not infinite spinner.
- Provider returns garbage: Invalid JSON → handled as degraded response with hint.
- Provider rate limited (429): Backoff applied, user notified with wait time.
- All providers unavailable: System continues working for non-LLM features (board CRUD, capture storage, etc.).
SignalR Degradation
- SignalR connection drops: Client auto-reconnects, board state refreshed on reconnect.
- SignalR backpressure: 100 events queued → delivered in order when connection resumes.
- Hub exception: One client's error doesn't affect other clients on same board.
External Service Failures
- Webhook target unreachable: Delivery retried per policy, dead-lettered after max retries.
- GitHub OAuth endpoint down: Login shows appropriate error, local auth still works.
- OpenTelemetry exporter down: App continues without telemetry, no performance impact.
Frontend Resilience Scenarios
Network Failures
- API unreachable: All views show appropriate error states, no white screen.
- Slow API (5+ second responses): Loading indicators shown, no duplicate requests from impatient clicks.
- Partial API failure: Board list loads but card fetch fails → board renders with error on card area.
- Network reconnect: After offline period, pending actions retry automatically.
State Corruption
- Local storage corrupted/cleared: App handles missing tokens gracefully (redirect to login).
- Pinia store receives malformed API response: Store rejects bad data, shows error, doesn't overwrite good data with bad.
- WebSocket message with unexpected shape: SignalR handler ignores malformed events, doesn't crash.
Concurrent User Actions
- Two browser tabs open: Actions in one tab reflected in other (via SignalR), no conflicts.
- Tab left idle for 1 hour: Returning to tab shows stale-data indicator, refresh available.
Implementation Notes
Backend
- For database failures: use a test SQLite instance and simulate locks via concurrent write transactions
- For worker resilience: inject failing services via DI, verify worker stays alive
- For provider degradation: mock providers to throw specific exceptions
- For cancellation: use `CancellationTokenSource.Cancel()` in test setup
Frontend
- For network failures: use Playwright `page.route()` to intercept and fail specific API calls
- For slow API: use `page.route()` with delayed responses
- For state corruption: directly manipulate `localStorage` or store state before action
Success Criteria
- No unhandled exceptions from any failure scenario
- No data loss from any failure scenario
- Clear error messages visible to user in every failure scenario
- System self-recovers when the failed dependency comes back
- Health endpoint accurately reflects system state during degradation
Considerations
- Many of these test the "boring" paths that never get tested but cause the most production incidents
- Some scenarios (database corruption, disk full) are hard to test in CI — consider running them in nightly with Docker
- The failure-injection drill suite (docs/ops/FAILURE_INJECTION_DRILLS.md) covers some of these at the ops level — these tests complement that at the code level
- This issue pairs well with the existing incident rehearsal cadence (docs/ops/INCIDENT_REHEARSAL_CADENCE.md)
Context
Real-world systems degrade partially. A database lock, a slow LLM provider, or a full disk shouldn't crash the entire app. This issue covers testing the system's behavior when internal dependencies fail — verifying graceful degradation rather than cascading failures.
Backend Resilience Scenarios
Database Failures
Worker Resilience
LLM Provider Degradation
SignalR Degradation
External Service Failures
Frontend Resilience Scenarios
Network Failures
State Corruption
Concurrent User Actions
Implementation Notes
Backend
Frontend
Success Criteria
Considerations