TST-53: Resilience and degraded-mode behavior tests#820
Conversation
Tests that LlmQueueToProposalWorker and ProposalHousekeepingWorker handle database failures, exceptions in the main loop, cancellation, and disabled processing without crashing or losing heartbeats.
…lability Tests that provider timeouts, exceptions, and total unavailability produce degraded responses or error contracts rather than infinite waits or crashes. Verifies non-LLM features (board CRUD, capture) still work when all providers are down.
…nflicts Tests that health endpoints report database status accurately, non-existent resources return 404 error contracts instead of 500, concurrent writes handle conflicts gracefully, and invalid data returns validation errors.
Tests that HubException on one client does not disconnect other clients, that invalid operations (joining non-existent boards, editing without joining) produce HubException without killing the connection, and that disconnected clients are properly removed from presence tracking.
…overy Tests that webhook delivery failures trigger retry scheduling with backoff, exceed max retries leading to dead-lettering, handle inactive subscriptions by dead-lettering, and recover stuck processing deliveries back to Pending status.
Tests that local authentication works regardless of external OAuth state, GitHub OAuth endpoints return proper 404 when not configured, and unauthenticated requests return 401 error contracts.
ExternalServiceFailureTests used fixed usernames for registration and login which could collide across test runs sharing the same database via IClassFixture. Add GUID suffixes to ensure uniqueness.
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive suite of resilience tests for the Taskdeck API, covering database operations, external service failures, LLM provider degradation, SignalR hub stability, webhook delivery retries, and background worker robustness. The tests ensure that the system handles errors gracefully without crashing or hanging. I have identified two instances in 'LlmProviderDegradationTests.cs' where '#pragma warning disable CS0162' is used to suppress unreachable code warnings; these should be refactored to avoid the need for suppression.
| #pragma warning disable CS0162 | ||
| yield break; | ||
| #pragma warning restore CS0162 |
| #pragma warning disable CS0162 | ||
| yield break; | ||
| #pragma warning restore CS0162 |
Self-Review FindingsReviewed all 6 test files for adversarial quality issues. Findings: Fixed
Verified as correct
No issues found
|
There was a problem hiding this comment.
Pull request overview
Adds a new backend resilience test suite under Taskdeck.Api.Tests/Resilience/ to validate degraded-mode behavior when key dependencies (DB, LLM providers, SignalR clients, webhooks, and external auth) fail, ensuring the API/workers fail gracefully.
Changes:
- Added worker resilience tests covering exception handling, cancellation, heartbeats, and disabled processing.
- Added degradation/resilience tests for LLM provider failures, SignalR hub isolation, webhook delivery retry/dead-letter/recovery, DB error contracts/health, and external auth unavailability.
- Introduced several new integration-style tests using
TestWebApplicationFactoryto validate end-to-end behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/tests/Taskdeck.Api.Tests/Resilience/WorkerResilienceTests.cs | Validates worker loops survive exceptions/cancellation and keep heartbeats. |
| backend/tests/Taskdeck.Api.Tests/Resilience/WebhookDeliveryResilienceTests.cs | Exercises retry scheduling, dead-lettering, and stuck-processing recovery via DB state transitions. |
| backend/tests/Taskdeck.Api.Tests/Resilience/SignalRDegradationTests.cs | Ensures hub errors are isolated per-connection and disconnects update presence. |
| backend/tests/Taskdeck.Api.Tests/Resilience/LlmProviderDegradationTests.cs | Swaps in failing LLM providers to verify degraded/error behavior and non-LLM feature continuity. |
| backend/tests/Taskdeck.Api.Tests/Resilience/ExternalServiceFailureTests.cs | Confirms local auth continues working and GitHub OAuth endpoints fail cleanly when unconfigured. |
| backend/tests/Taskdeck.Api.Tests/Resilience/DatabaseResilienceTests.cs | Confirms health endpoints include DB status and common DB/API error cases return correct contracts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var settings = new WorkerSettings | ||
| { | ||
| QueuePollIntervalSeconds = 0, | ||
| EnableAutoQueueProcessing = true, | ||
| MaxBatchSize = 5, | ||
| MaxConcurrency = 1, | ||
| RetryBackoffSeconds = new[] { 0 } | ||
| }; |
| var settings = new WorkerSettings | ||
| { | ||
| QueuePollIntervalSeconds = 0, | ||
| EnableAutoQueueProcessing = false, // Disabled | ||
| MaxBatchSize = 5, | ||
| MaxConcurrency = 1, | ||
| RetryBackoffSeconds = new[] { 0 } | ||
| }; |
| using System.Net; | ||
| using System.Net.Http.Json; |
| // Give a moment for presence events to propagate. | ||
| await Task.Delay(500); | ||
|
|
| using FluentAssertions; | ||
| using Microsoft.Extensions.DependencyInjection; | ||
| using Taskdeck.Application.Interfaces; | ||
| using Taskdeck.Application.Services; | ||
| using Taskdeck.Domain.Entities; |
| public async IAsyncEnumerable<LlmTokenEvent> StreamAsync( | ||
| ChatCompletionRequest request, | ||
| [EnumeratorCancellation] CancellationToken ct = default) | ||
| { | ||
| await Task.Delay(TimeSpan.FromSeconds(60), ct); | ||
| yield return new LlmTokenEvent("timeout", true); | ||
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4d54ab75de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| QueuePollIntervalSeconds = 0, | ||
| EnableAutoQueueProcessing = false, // Disabled | ||
| MaxBatchSize = 5, |
There was a problem hiding this comment.
Use nonzero poll interval in disabled worker resilience test
Setting QueuePollIntervalSeconds to 0 here can make LlmQueueToProposalWorker.ExecuteAsync run in a tight synchronous loop when auto-processing is disabled, because the loop body reaches only Task.Delay(0) and never performs an asynchronous yield. Under BackgroundService.StartAsync, that can block StartAsync itself and hang the test before cancellation/cleanup logic runs. Use a positive poll interval (or force an async yield) so the worker loop can be canceled deterministically.
Useful? React with 👍 / 👎.
| callCount.Should().BeGreaterThan(0, | ||
| "worker should have attempted at least one batch despite DB throwing"); |
There was a problem hiding this comment.
Assert multiple iterations when testing continue-after-error
This test is named to verify the worker continues polling after an exception, but callCount > 0 only proves one attempt happened; the assertion still passes if the worker crashes immediately after the first failure. Tightening this to require at least two iterations (or equivalent evidence) is needed for the resilience claim this test is meant to protect.
Useful? React with 👍 / 👎.
Review: Quality issues identifiedP1 -- Tight-loop CPU spin (WorkerResilienceTests.cs)Lines 39 and 174: Fix: Change to P2 -- Weak assertion (WorkerResilienceTests.cs line 62)Test is named "continue after error" but only asserts Fix: Already asserted via the log and heartbeat checks below. Tighten P2 -- Timing-based wait (SignalRDegradationTests.cs line 99)
Fix: Replace with P3 -- Unused usings
P3 -- Unnecessary #pragma warning disable CS0162
P2 -- TimeoutProviderStub.StreamAsync hangs for 60 seconds
Fix: Use a short delay (e.g., 100ms) with an internal CancellationTokenSource that cancels quickly, matching the pattern already used in Additional findings from adversarial reviewWorkerResilienceTests.cs line 122: The cancellation test also uses WorkerResilienceTests.cs line 174: The disabled-processing test also uses All issues will be fixed in follow-up commits. |
…ker tests QueuePollIntervalSeconds = 0 caused Task.Delay(0) in the worker loop, spinning thousands of iterations during the test window. Change to 1 second and extend delay windows to 1500ms to ensure at least one iteration completes. Also tighten the weak callCount assertion.
…ests Use SignalRTestHelper.WaitForEventsAsync with a presence collector instead of Task.Delay(500) to avoid flaky timing on slow CI. Also remove the unused System.Net using while keeping System.Net.Http.Json which is needed for PostAsJsonAsync.
…nc hang Replace #pragma warning disable CS0162 with helper methods that throw, making yield break reachable without suppressing warnings. Fix TimeoutProviderStub.StreamAsync to use a short internal cancellation (50ms) instead of blocking for 60 seconds.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0ac892fa2a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| finalClaimed.Should().BeTrue(); | ||
|
|
||
| await dbContext.Entry(delivery).ReloadAsync(); | ||
| delivery.MarkDeadLetter("HTTP 500 on final attempt", 500); |
There was a problem hiding this comment.
Exercise retry decision path instead of forcing dead-letter
This test bypasses the production retry-threshold logic by calling delivery.MarkDeadLetter(...) directly, so it never validates the branch that decides between ScheduleRetry and dead-lettering (implemented in OutboundWebhookDeliveryWorker.MarkFailure). Because of that, an off-by-one or config regression in max-retry handling would still leave this test green, even though real webhook deliveries would behave incorrectly. Drive the third failure through the worker/retry decision path and then assert the resulting status.
Useful? React with 👍 / 👎.
Update STATUS.md with post-merge housekeeping entry, recertified test counts (4279 backend + 2245 frontend = ~6500+), and delivered status for distributed caching, SSO/OIDC/MFA, and staged rollout. Update TESTING_GUIDE.md with current test counts and new test categories (resilience, MFA/OIDC, telemetry, cache). Update IMPLEMENTATION_MASTERPLAN.md marking all expansion wave items as delivered. Extend AUTHENTICATION.md with OIDC/SSO login flow, MFA setup/verify/ recovery, API key management, and account linking endpoints. Update MANUAL_TEST_CHECKLIST.md: mark all PRs as merged, add testing tasks for error tracking (#811), MCP HTTP transport (#819), distributed caching (#805), and resilience tests (#820).
Summary
backend/tests/Taskdeck.Api.Tests/Resilience/All 3,825 tests pass across the full suite with 0 failures.
Closes #720
Test plan