Skip to content

TST-53: Resilience and degraded-mode behavior tests — what happens when things break #720

@Chris0Jeky

Description

@Chris0Jeky

Context

Real-world systems degrade partially. A database lock, a slow LLM provider, or a full disk shouldn't crash the entire app. This issue covers testing the system's behavior when internal dependencies fail — verifying graceful degradation rather than cascading failures.

Backend Resilience Scenarios

Database Failures

  1. SQLite database locked (concurrent writer): API returns 503 or queues retry, doesn't crash. Capture endpoint specifically should queue locally if DB is temporarily unavailable.
  2. Database full (disk space): Write operations return 500 with clear error, read operations still work.
  3. Database corrupted: Startup fails gracefully with actionable error message (not stack trace). Health endpoint returns unhealthy.

Worker Resilience

  1. Worker exception in main loop: Single batch failure doesn't kill the worker — it logs and continues on next poll.
  2. Worker database access fails: Worker retries on next cycle, doesn't accumulate zombie items.
  3. All workers stopped: Queue items accumulate but don't corrupt. When workers restart, items are processed.
  4. Worker cancellation: `CancellationToken` fires → worker shuts down cleanly, no items left in limbo Processing state.

LLM Provider Degradation

  1. Provider timeout: Chat/triage requests timeout → user gets degraded response, not infinite spinner.
  2. Provider returns garbage: Invalid JSON → handled as degraded response with hint.
  3. Provider rate limited (429): Backoff applied, user notified with wait time.
  4. All providers unavailable: System continues working for non-LLM features (board CRUD, capture storage, etc.).

SignalR Degradation

  1. SignalR connection drops: Client auto-reconnects, board state refreshed on reconnect.
  2. SignalR backpressure: 100 events queued → delivered in order when connection resumes.
  3. Hub exception: One client's error doesn't affect other clients on same board.

External Service Failures

  1. Webhook target unreachable: Delivery retried per policy, dead-lettered after max retries.
  2. GitHub OAuth endpoint down: Login shows appropriate error, local auth still works.
  3. OpenTelemetry exporter down: App continues without telemetry, no performance impact.

Frontend Resilience Scenarios

Network Failures

  1. API unreachable: All views show appropriate error states, no white screen.
  2. Slow API (5+ second responses): Loading indicators shown, no duplicate requests from impatient clicks.
  3. Partial API failure: Board list loads but card fetch fails → board renders with error on card area.
  4. Network reconnect: After offline period, pending actions retry automatically.

State Corruption

  1. Local storage corrupted/cleared: App handles missing tokens gracefully (redirect to login).
  2. Pinia store receives malformed API response: Store rejects bad data, shows error, doesn't overwrite good data with bad.
  3. WebSocket message with unexpected shape: SignalR handler ignores malformed events, doesn't crash.

Concurrent User Actions

  1. Two browser tabs open: Actions in one tab reflected in other (via SignalR), no conflicts.
  2. Tab left idle for 1 hour: Returning to tab shows stale-data indicator, refresh available.

Implementation Notes

Backend

  • For database failures: use a test SQLite instance and simulate locks via concurrent write transactions
  • For worker resilience: inject failing services via DI, verify worker stays alive
  • For provider degradation: mock providers to throw specific exceptions
  • For cancellation: use `CancellationTokenSource.Cancel()` in test setup

Frontend

  • For network failures: use Playwright `page.route()` to intercept and fail specific API calls
  • For slow API: use `page.route()` with delayed responses
  • For state corruption: directly manipulate `localStorage` or store state before action

Success Criteria

  • No unhandled exceptions from any failure scenario
  • No data loss from any failure scenario
  • Clear error messages visible to user in every failure scenario
  • System self-recovers when the failed dependency comes back
  • Health endpoint accurately reflects system state during degradation

Considerations

  • Many of these test the "boring" paths that never get tested but cause the most production incidents
  • Some scenarios (database corruption, disk full) are hard to test in CI — consider running them in nightly with Docker
  • The failure-injection drill suite (docs/ops/FAILURE_INJECTION_DRILLS.md) covers some of these at the ops level — these tests complement that at the code level
  • This issue pairs well with the existing incident rehearsal cadence (docs/ops/INCIDENT_REHEARSAL_CADENCE.md)

Metadata

Metadata

Assignees

Projects

Status

Review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions