TST-53: Resilience and degraded-mode behavior tests — what happens when things break

## Context

Real-world systems degrade partially. A database lock, a slow LLM provider, or a full disk shouldn't crash the entire app. This issue covers testing the system's behavior when internal dependencies fail — verifying graceful degradation rather than cascading failures.

## Backend Resilience Scenarios

### Database Failures
1. **SQLite database locked** (concurrent writer): API returns 503 or queues retry, doesn't crash. Capture endpoint specifically should queue locally if DB is temporarily unavailable.
2. **Database full (disk space)**: Write operations return 500 with clear error, read operations still work.
3. **Database corrupted**: Startup fails gracefully with actionable error message (not stack trace). Health endpoint returns unhealthy.

### Worker Resilience
4. **Worker exception in main loop**: Single batch failure doesn't kill the worker — it logs and continues on next poll.
5. **Worker database access fails**: Worker retries on next cycle, doesn't accumulate zombie items.
6. **All workers stopped**: Queue items accumulate but don't corrupt. When workers restart, items are processed.
7. **Worker cancellation**: \`CancellationToken\` fires → worker shuts down cleanly, no items left in limbo Processing state.

### LLM Provider Degradation
8. **Provider timeout**: Chat/triage requests timeout → user gets degraded response, not infinite spinner.
9. **Provider returns garbage**: Invalid JSON → handled as degraded response with hint.
10. **Provider rate limited (429)**: Backoff applied, user notified with wait time.
11. **All providers unavailable**: System continues working for non-LLM features (board CRUD, capture storage, etc.).

### SignalR Degradation
12. **SignalR connection drops**: Client auto-reconnects, board state refreshed on reconnect.
13. **SignalR backpressure**: 100 events queued → delivered in order when connection resumes.
14. **Hub exception**: One client's error doesn't affect other clients on same board.

### External Service Failures
15. **Webhook target unreachable**: Delivery retried per policy, dead-lettered after max retries.
16. **GitHub OAuth endpoint down**: Login shows appropriate error, local auth still works.
17. **OpenTelemetry exporter down**: App continues without telemetry, no performance impact.

## Frontend Resilience Scenarios

### Network Failures
18. **API unreachable**: All views show appropriate error states, no white screen.
19. **Slow API (5+ second responses)**: Loading indicators shown, no duplicate requests from impatient clicks.
20. **Partial API failure**: Board list loads but card fetch fails → board renders with error on card area.
21. **Network reconnect**: After offline period, pending actions retry automatically.

### State Corruption
22. **Local storage corrupted/cleared**: App handles missing tokens gracefully (redirect to login).
23. **Pinia store receives malformed API response**: Store rejects bad data, shows error, doesn't overwrite good data with bad.
24. **WebSocket message with unexpected shape**: SignalR handler ignores malformed events, doesn't crash.

### Concurrent User Actions
25. **Two browser tabs open**: Actions in one tab reflected in other (via SignalR), no conflicts.
26. **Tab left idle for 1 hour**: Returning to tab shows stale-data indicator, refresh available.

## Implementation Notes

### Backend
- For database failures: use a test SQLite instance and simulate locks via concurrent write transactions
- For worker resilience: inject failing services via DI, verify worker stays alive
- For provider degradation: mock providers to throw specific exceptions
- For cancellation: use \`CancellationTokenSource.Cancel()\` in test setup

### Frontend
- For network failures: use Playwright \`page.route()\` to intercept and fail specific API calls
- For slow API: use \`page.route()\` with delayed responses
- For state corruption: directly manipulate \`localStorage\` or store state before action

## Success Criteria

- No unhandled exceptions from any failure scenario
- No data loss from any failure scenario
- Clear error messages visible to user in every failure scenario
- System self-recovers when the failed dependency comes back
- Health endpoint accurately reflects system state during degradation

## Considerations

- Many of these test the "boring" paths that never get tested but cause the most production incidents
- Some scenarios (database corruption, disk full) are hard to test in CI — consider running them in nightly with Docker
- The failure-injection drill suite (docs/ops/FAILURE_INJECTION_DRILLS.md) covers some of these at the ops level — these tests complement that at the code level
- This issue pairs well with the existing incident rehearsal cadence (docs/ops/INCIDENT_REHEARSAL_CADENCE.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST-53: Resilience and degraded-mode behavior tests — what happens when things break #720

Context

Backend Resilience Scenarios

Database Failures

Worker Resilience

LLM Provider Degradation

SignalR Degradation

External Service Failures

Frontend Resilience Scenarios

Network Failures

State Corruption

Concurrent User Actions

Implementation Notes

Backend

Frontend

Success Criteria

Considerations

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

TST-53: Resilience and degraded-mode behavior tests — what happens when things break #720

Description

Context

Backend Resilience Scenarios

Database Failures

Worker Resilience

LLM Provider Degradation

SignalR Degradation

External Service Failures

Frontend Resilience Scenarios

Network Failures

State Corruption

Concurrent User Actions

Implementation Notes

Backend

Frontend

Success Criteria

Considerations

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions