PLAT-02: Distributed caching strategy and cache-invalidation semantics#805
PLAT-02: Distributed caching strategy and cache-invalidation semantics#805Chris0Jeky merged 22 commits intomainfrom
Conversation
Document the decision to use cache-aside with Redis for production and IMemoryCache for local dev. Covers key strategy, TTL policy, invalidation triggers, safe degradation, and observability requirements. Refs #85
Define generic cache-aside abstraction with GetAsync, SetAsync, RemoveAsync, and RemoveByPrefixAsync. All operations degrade safely without throwing exceptions to callers. Refs #85
- CacheSettings: configurable provider (Redis/InMemory/None), TTLs, key prefix - InMemoryCacheService: thread-safe ConcurrentDictionary with periodic sweep - RedisCacheService: StackExchange.Redis with lazy connection and safe degradation - NoOpCacheService: passthrough for explicitly disabled caching - DI registration: auto-selects provider based on Cache:Provider config All implementations degrade safely — cache failures never propagate to callers. Refs #85
- GetBoardDetailAsync: cache board detail with configurable TTL (120s default) - ListBoardsAsync: cache unfiltered board list per user (60s default TTL) - Invalidate board detail cache on update/delete/archive - Invalidate user board list cache on create/update/delete - CacheKeys: centralized key definitions for consistent invalidation - DI: wire ICacheService and CacheSettings into BoardService Cache is optional — service operates correctly without it. Refs #85
- InMemoryCacheServiceTests (12 tests): get/set, expiry, remove, prefix removal, overwrite, idempotent remove, metrics logging, key prefix isolation, entry count, concurrent access safety - NoOpCacheServiceTests (5 tests): all operations are no-op and safe - BoardServiceCacheTests (15 tests): cache hit returns cached value, cache miss queries DB and populates cache, filtered queries bypass cache, not-found results are not cached, invalidation on create/ update/delete, fallback to DB when no cache service Refs #85
Document distributed caching layer in current implementation snapshot and add delivery entry in masterplan with follow-up items. Refs #85
There was a problem hiding this comment.
Code Review
This pull request implements a distributed caching strategy for the BoardService using a cache-aside pattern. It introduces the ICacheService interface with Redis, in-memory, and no-op implementations, along with centralized cache key management and configuration settings. The caching layer is integrated into board detail and listing retrievals, with invalidation logic added to creation, update, and deletion flows. Review feedback highlights several critical areas for improvement: adding exception handling to the background timer in the in-memory service to prevent process termination, avoiding synchronous O(N) scans during cache eviction to prevent latency spikes, addressing stale cache issues for shared boards, and replacing blocking Redis key enumeration with a batched approach to ensure production stability.
| _keyPrefix = keyPrefix; | ||
|
|
||
| // Sweep expired entries every 60 seconds | ||
| _sweepTimer = new Timer(_ => SweepExpiredEntries(), null, TimeSpan.FromSeconds(60), TimeSpan.FromSeconds(60)); |
There was a problem hiding this comment.
The Timer callback SweepExpiredEntries lacks exception handling. In .NET, an unhandled exception in a background timer thread can terminate the process. It is safer to wrap the callback in a try-catch block to ensure application stability.
_sweepTimer = new Timer(_ =>
{
try { SweepExpiredEntries(); }
catch (Exception ex) { _logger.LogError(ex, "Cache sweep failed"); }
}, null, TimeSpan.FromSeconds(60), TimeSpan.FromSeconds(60));| if (_cache.Count >= MaxEntries) | ||
| { | ||
| SweepExpiredEntries(); | ||
| } |
There was a problem hiding this comment.
Executing SweepExpiredEntries() synchronously when the cache reaches MaxEntries can cause significant latency spikes on the request thread, as it performs an O(N) scan of the dictionary. Consider triggering the sweep asynchronously to avoid impacting the caller's performance.
if (_cache.Count >= MaxEntries)
{
_ = Task.Run(SweepExpiredEntries);
}| if (board.OwnerId.HasValue) | ||
| await InvalidateBoardListCacheAsync(board.OwnerId.Value, cancellationToken); |
There was a problem hiding this comment.
The invalidation logic only clears the board list cache for the owner. If a board is shared with other users, their cached lists will remain stale until the TTL expires. This discrepancy with ADR-0023 (which specifies invalidating all accessible users' caches) could lead to data inconsistency in multi-user scenarios. Consider using RemoveByPrefixAsync to invalidate all user board lists or updating the design document to reflect this limitation.
| var keys = server.Keys(pattern: $"{fullPrefix}*").ToArray(); | ||
| if (keys.Length > 0) | ||
| { | ||
| var db = connection.GetDatabase(); | ||
| await db.KeyDeleteAsync(keys); | ||
| _logger.LogDebug("Cache removed {Count} keys with prefix {CacheKeyPrefix}", keys.Length, fullPrefix); |
There was a problem hiding this comment.
Calling server.Keys(...).ToArray() is risky in production as it can block the Redis server and cause memory pressure in the API if the number of keys is large. It is recommended to iterate over the keys and process them in batches to ensure safe degradation and performance stability.
var keys = server.Keys(pattern: $"{fullPrefix}*");
var db = connection.GetDatabase();
var count = 0;
foreach (var batch in keys.Chunk(500))
{
await db.KeyDeleteAsync(batch);
count += batch.Length;
}
_logger.LogDebug("Cache removed {Count} keys with prefix {CacheKeyPrefix}", count, fullPrefix);- Change CacheMetric logging from Information to Debug level in both InMemoryCacheService and RedisCacheService to avoid log noise in high-traffic production scenarios - Clear ConcurrentDictionary on InMemoryCacheService.Dispose() for complete resource cleanup - Remove unused _logger field from BoardService (was declared but never referenced) - Update test assertion to match Debug log level Refs #85
Adversarial Self-Review FindingsFixed in follow-up commit (9473f35)
Acknowledged but acceptable
Thread safety verification
Security check
|
There was a problem hiding this comment.
Pull request overview
Introduces a cache-aside caching layer for hot board read paths, with in-memory / Redis / no-op implementations, DI wiring, and supporting documentation/tests.
Changes:
- Added
ICacheServiceabstraction +CacheKeys/CacheSettings, and applied caching toBoardServicelist/detail reads with TTLs + targeted invalidation. - Implemented
InMemoryCacheService,RedisCacheService, andNoOpCacheService, plus DI registration based on config. - Added ADR-0023 and updated status/masterplan docs; added unit tests for cache services and BoardService caching behavior.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/STATUS.md | Documents the new distributed caching layer and semantics. |
| docs/IMPLEMENTATION_MASTERPLAN.md | Adds the PLAT-02 delivery entry to the master plan. |
| docs/decisions/INDEX.md | Registers ADR-0023 in the ADR index. |
| docs/decisions/ADR-0023-distributed-caching-cache-aside.md | Defines the intended cache-aside strategy, TTLs, key strategy, and invalidation semantics. |
| backend/tests/Taskdeck.Application.Tests/Services/BoardServiceCacheTests.cs | Tests cache-aside behavior and invalidation expectations for BoardService. |
| backend/tests/Taskdeck.Api.Tests/NoOpCacheServiceTests.cs | Verifies no-op behavior and singleton semantics expectations. |
| backend/tests/Taskdeck.Api.Tests/InMemoryCacheServiceTests.cs | Tests in-memory cache operations, expiry, prefix removal, logging, and concurrency. |
| backend/src/Taskdeck.Infrastructure/Taskdeck.Infrastructure.csproj | Adds dependencies for caching backends (Memory + Redis). |
| backend/src/Taskdeck.Infrastructure/Services/RedisCacheService.cs | Redis cache implementation with lazy connection + safe degradation. |
| backend/src/Taskdeck.Infrastructure/Services/NoOpCacheService.cs | No-op cache implementation for disabled caching mode. |
| backend/src/Taskdeck.Infrastructure/Services/InMemoryCacheService.cs | In-memory cache implementation with sweep timer + JSON serialization. |
| backend/src/Taskdeck.Infrastructure/DependencyInjection.cs | Registers ICacheService + CacheSettings from configuration. |
| backend/src/Taskdeck.Application/Services/CacheSettings.cs | Adds configurable cache settings (provider, TTLs, prefix, Redis connstr). |
| backend/src/Taskdeck.Application/Services/CacheKeys.cs | Centralizes cache key formats for board list/detail. |
| backend/src/Taskdeck.Application/Services/BoardService.cs | Applies caching to list/detail endpoints and adds invalidation on board mutations. |
| backend/src/Taskdeck.Application/Interfaces/ICacheService.cs | Defines cache abstraction contract (get/set/remove/remove-by-prefix). |
| backend/src/Taskdeck.Api/Extensions/ApplicationServiceRegistration.cs | Updates DI to inject cache service + cache settings into BoardService. |
Comments suppressed due to low confidence (2)
backend/src/Taskdeck.Application/Services/BoardService.cs:22
_loggeris injected and stored but never used in this service. If logging is intended at the BoardService level, add concrete log statements (e.g., for cache bypass conditions); otherwise remove the field/parameter to avoid dead dependencies and potential analyzer warnings.
public BoardService(IUnitOfWork unitOfWork)
: this(unitOfWork, authorizationService: null, realtimeNotifier: null, historyService: null)
{
}
backend/src/Taskdeck.Application/Services/BoardService.cs:268
- Board list invalidation on update only clears the owner’s list cache. Because non-owners with write access (Admin/Editor) can rename/archive boards, their own
boards:user:{actingUserId}cache can remain stale after a successful mutation. Consider invalidating the acting user’s list cache as well (you may need to threadactingUserIdinto the internal update path or handle invalidation in the overload that hasactingUserId).
return Result.Success(MapToDto(board));
}
catch (DomainException ex)
{
return Result.Failure<BoardDto>(ex.ErrorCode, ex.Message);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| var result = JsonSerializer.Deserialize<T>(value!); | ||
| _logger.LogDebug("Cache hit for key {CacheKey}", fullKey); | ||
| LogCacheMetric("hit", key); | ||
| return result; |
There was a problem hiding this comment.
JsonSerializer.Deserialize<T>(value!) is passing a RedisValue into System.Text.Json, which won’t deserialize as intended (and may not compile depending on implicit conversions). Convert the Redis value to a string (or byte[]) explicitly before deserializing, and handle deserialization failures as a cache miss.
| // Use SCAN to find keys by prefix — safe for production (non-blocking). | ||
| // Note: This requires access to a server endpoint. | ||
| var endpoints = connection.GetEndPoints(); | ||
| foreach (var endpoint in endpoints) | ||
| { | ||
| var server = connection.GetServer(endpoint); | ||
| var keys = server.Keys(pattern: $"{fullPrefix}*").ToArray(); | ||
| if (keys.Length > 0) | ||
| { | ||
| var db = connection.GetDatabase(); | ||
| await db.KeyDeleteAsync(keys); | ||
| _logger.LogDebug("Cache removed {Count} keys with prefix {CacheKeyPrefix}", keys.Length, fullPrefix); |
There was a problem hiding this comment.
RemoveByPrefixAsync enumerates all matching keys with server.Keys(...).ToArray(). Even though it’s SCAN-based, fully materializing keys can be expensive on large keyspaces and is often unsupported/restricted on managed/clustered Redis setups. Prefer a batched/streaming scan with incremental deletes (and consider avoiding prefix invalidation in Redis entirely by using versioned keys).
| // Lazy initialization: connection is established on first cache access, | ||
| // not at startup. This prevents startup failures when Redis is unavailable. | ||
| _lazyConnection = new Lazy<ConnectionMultiplexer?>(() => | ||
| { | ||
| try | ||
| { | ||
| var options = ConfigurationOptions.Parse(connectionString); | ||
| options.AbortOnConnectFail = false; // Allow startup without Redis | ||
| options.ConnectTimeout = 3000; // 3 second connect timeout | ||
| options.SyncTimeout = 1000; // 1 second sync timeout | ||
| options.AsyncTimeout = 1000; // 1 second async timeout | ||
| var connection = ConnectionMultiplexer.Connect(options); | ||
| _logger.LogInformation("Redis cache connected successfully"); | ||
| return connection; | ||
| } | ||
| catch (Exception ex) | ||
| { | ||
| _logger.LogWarning(ex, "Redis cache connection failed — operating in degraded (no-cache) mode"); | ||
| return null; | ||
| } | ||
| }); |
There was a problem hiding this comment.
The lazy connection is only attempted once; if the initial Connect fails and returns null, the service will stay permanently in no-cache mode until process restart. Consider retrying connection establishment on subsequent operations (or keeping a Lazy<Task<ConnectionMultiplexer?>> that can be recreated after failures) so Redis can recover without a restart.
| try | ||
| { | ||
| if (_cache.Count >= MaxEntries) | ||
| { | ||
| SweepExpiredEntries(); | ||
| } | ||
|
|
||
| var serialized = JsonSerializer.Serialize(value); | ||
| var entry = new CacheEntry(serialized, DateTime.UtcNow.Add(ttl)); | ||
| _cache[fullKey] = entry; | ||
| _logger.LogDebug("Cache set for key {CacheKey} with TTL {Ttl}s", fullKey, ttl.TotalSeconds); |
There was a problem hiding this comment.
MaxEntries is described as a maximum, but the current logic only sweeps expired entries when the threshold is reached and does not cap growth if entries are still valid. This can still lead to unbounded memory growth under sustained unique-key traffic. Consider enforcing a hard cap (evict oldest/random) or refusing to cache when above the limit.
|
|
||
| private void LogCacheMetric(string outcome, string keyPrefix) | ||
| { | ||
| // Extract the resource type from the key for tagging (e.g., "boards" from "boards:user:...") | ||
| var resource = keyPrefix.Split(':').FirstOrDefault() ?? "unknown"; | ||
| _logger.LogDebug("CacheMetric outcome={Outcome} resource={Resource}", outcome, resource); |
There was a problem hiding this comment.
LogCacheMetric emits LogInformation on every cache hit/miss. For hot read paths this can create very high log volume and cost in production. Consider lowering to Debug/Trace or using OpenTelemetry Meter counters for hit/miss/error while keeping warnings for failures.
| ### Metrics | ||
|
|
||
| - Cache hit/miss/error counters emitted via `ILogger` structured logging | ||
| - Metric names: `cache.hit`, `cache.miss`, `cache.error` | ||
| - Tagged with `cache_key_prefix` for per-resource analysis | ||
|
|
There was a problem hiding this comment.
The ADR specifies metric names cache.hit/cache.miss/cache.error tagged with cache_key_prefix, but the implementation currently logs a generic CacheMetric message with {Outcome} and {Resource}. Please align the ADR and implementation so downstream log parsing/metrics dashboards don’t break (either update the ADR or change the log event schema).
| <PrivateAssets>all</PrivateAssets> | ||
| <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets> | ||
| </PackageReference> | ||
| <PackageReference Include="Microsoft.Extensions.Caching.Memory" Version="8.0.1" /> | ||
| <PackageReference Include="Microsoft.Extensions.Logging.Abstractions" Version="10.0.5" /> | ||
| <PackageReference Include="StackExchange.Redis" Version="2.8.24" /> | ||
| <PackageReference Include="Microsoft.IdentityModel.Tokens" Version="8.17.0" /> |
There was a problem hiding this comment.
Microsoft.Extensions.Caching.Memory is added as a dependency but doesn’t appear to be used anywhere in Taskdeck.Infrastructure (the in-memory cache is implemented via ConcurrentDictionary). Remove this package reference to avoid unnecessary dependencies, or refactor InMemoryCacheService to actually use IMemoryCache.
| public sealed class NoOpCacheService : ICacheService | ||
| { | ||
| public static readonly NoOpCacheService Instance = new(); | ||
|
|
There was a problem hiding this comment.
NoOpCacheService is described/tested as a singleton (Instance), but the class has an implicit public constructor so callers can still create additional instances. If singleton semantics are intended, make the constructor private to enforce it.
| [Fact] | ||
| public async Task GetBoardDetail_FallsBackToDatabase_WhenCacheThrows() | ||
| { | ||
| var boardId = Guid.NewGuid(); | ||
| var board = new Board("Fallback Board", "desc", _userId); | ||
|
|
||
| // Cache throws on get — should fallback to DB | ||
| _cacheMock.Setup(c => c.GetAsync<BoardDetailDto>( | ||
| It.IsAny<string>(), It.IsAny<CancellationToken>())) | ||
| .ThrowsAsync(new InvalidOperationException("Redis down")); | ||
|
|
||
| _boardRepoMock.Setup(r => r.GetByIdWithDetailsAsync(boardId, It.IsAny<CancellationToken>())) | ||
| .ReturnsAsync(board); | ||
|
|
||
| // Use the InMemoryCacheService that wraps exceptions, or just test with no-cache fallback | ||
| // This tests the pattern: when ICacheService itself is implemented correctly | ||
| // it should never throw. But the BoardService constructor accepts null cache | ||
| // and the service still works. | ||
| var serviceWithoutCache = new BoardService(_unitOfWorkMock.Object); | ||
| var result = await serviceWithoutCache.GetBoardDetailAsync(boardId); | ||
|
|
||
| result.IsSuccess.Should().BeTrue(); | ||
| result.Value.Name.Should().Be("Fallback Board"); | ||
| } |
There was a problem hiding this comment.
This test sets up _cacheMock to throw, but then creates serviceWithoutCache (not using _cacheMock) and never exercises the cache-throws path. As written it only verifies the no-cache constructor works. Either update the test to actually use a cache implementation that swallows exceptions, or rename/remove the unused setup so the intent matches the assertions.
| private void LogCacheMetric(string outcome, string keyPrefix) | ||
| { | ||
| var resource = keyPrefix.Split(':').FirstOrDefault() ?? "unknown"; | ||
| _logger.LogDebug("CacheMetric outcome={Outcome} resource={Resource}", outcome, resource); | ||
| } |
There was a problem hiding this comment.
LogCacheMetric uses LogInformation for every cache outcome. Like the in-memory implementation, this can generate very high log volume on hot paths. Consider lowering the level or switching to real metrics (OpenTelemetry counters), keeping warnings for failures.
Adversarial Review: PR #805 -- Distributed Caching Strategy (PLAT-02)SummaryThis PR introduces a cache-aside pattern for CRITICAL IssuesC1: Stale cache from ColumnService/CardService mutations -- board detail cache is never invalidated by sibling servicesFiles:
This is not just a "data freshness" concern -- it is a correctness bug. A user adds a column, then immediately views the board, and the column is missing. They move a card, the card count is wrong. Fix: Either:
C2:
|
| Severity | Count | Action Required |
|---|---|---|
| CRITICAL | 2 | Must fix before merge |
| HIGH | 4 | Should fix before merge |
| MEDIUM | 5 | Fix or document as follow-up |
| LOW | 4 | Informational |
…e mutations BoardDetailDto includes columns with card counts that are mutated by ColumnService and CardService without cache awareness. Caching board detail would serve stale column/card information for up to 120 seconds after any column or card mutation. Board list caching is safe because BoardDto excludes columns and card counts. Addresses review finding C1 (CRITICAL).
Lazy<T> caches its factory result permanently. If Redis is temporarily down during first access, the cache stays disabled forever until restart. Replace with a volatile field + double-checked locking + 15s reconnect throttle so transient Redis outages do not permanently disable caching. Also check _disposed before cache operations to avoid ObjectDisposedException spam during graceful shutdown. Addresses review findings C2 (CRITICAL) and H2 (HIGH).
…CacheService When all 10,000 entries have long TTLs, SweepExpiredEntries removes nothing and the cache grows unboundedly. Add EvictOldestEntries to enforce the hard limit by evicting entries closest to expiry when sweep is insufficient. Wrap sweep timer callback in try/catch to prevent unhandled exceptions from permanently terminating the timer. Addresses review findings H1 (HIGH) and M4 (MEDIUM).
The InMemoryCacheService uses ConcurrentDictionary directly, not IMemoryCache. This package was added but never referenced in code. Addresses review finding H3 (HIGH).
Typos like 'Rediss' or 'inmem' silently fall back to InMemory with no indication. Add explicit warning log so operators notice misconfigurations. Addresses review finding M2 (MEDIUM).
…x misleading test Remove board detail cache tests (detail is no longer cached). Update invalidation tests to verify only board list cache. Replace misleading FallsBackToDatabase_WhenCacheThrows test that tested null-cache path instead of exception path with honest test names. Addresses review finding M1 (MEDIUM).
Correct ADR-0023: InMemory uses ConcurrentDictionary not IMemoryCache, board detail is intentionally not cached, update key strategy and TTL sections. Update STATUS.md and IMPLEMENTATION_MASTERPLAN.md to reflect reconnection-capable Redis, capacity-capped InMemory, and board-detail exclusion. Addresses review finding M5 (MEDIUM).
…805) Pre-resolve expected merge conflicts across all 10 platform expansion PRs: - STATUS.md: consolidated wave summary, new CI workflow entries - IMPLEMENTATION_MASTERPLAN.md: delivery entry #130, next-steps item #20 - TESTING_GUIDE.md: cross-browser, visual regression, mutation, Testcontainers - decisions/INDEX.md: canonical ADR numbering (0023-0027) resolving the 5-way ADR-0023 collision across PRs
…ache-semantics # Conflicts: # docs/decisions/INDEX.md
- RedisCacheService: Convert RedisValue to string before deserializing - RedisCacheService: Batch SCAN/delete operations to avoid materializing all keys - RedisCacheService: Add retry logic with exponential backoff for connection - InMemoryCacheService: Enforce hard cap on cache size with proper locking
Resolved conflicts by accepting main's versions: - AutomationProposalService.cs: whitespace changes - UnitOfWork.cs: error message update - ConcurrencyRaceConditionStressTests.cs: relaxed assertions for SQLite CI - AutomationProposalServiceEdgeCaseTests.cs: aligned test names and error messages
…s0Jeky/Taskdeck into feature/distributed-cache-semantics
Take main's versions for files unrelated to the cache feature (AuthController, auth tests, authApi, decisions index). Add MfaCredentials to UnitOfWork to match interface from auto-merged IUnitOfWork. Docs already include cache feature documentation.
Update SetRecoveryCodes test to match domain method: empty/null now clears codes instead of throwing (supports exhausted recovery codes). Remove AuthenticationRegistrationTests (belongs to PR #813, not cache PR). Take main's versions for non-cache files.
Take main's versions for all non-cache files now that #813 provides OIDC/MFA infrastructure. Keep MfaCredentialTests fix with both empty-string and null clearing test cases.
Update STATUS.md with post-merge housekeeping entry, recertified test counts (4279 backend + 2245 frontend = ~6500+), and delivered status for distributed caching, SSO/OIDC/MFA, and staged rollout. Update TESTING_GUIDE.md with current test counts and new test categories (resilience, MFA/OIDC, telemetry, cache). Update IMPLEMENTATION_MASTERPLAN.md marking all expansion wave items as delivered. Extend AUTHENTICATION.md with OIDC/SSO login flow, MFA setup/verify/ recovery, API key management, and account linking endpoints. Update MANUAL_TEST_CHECKLIST.md: mark all PRs as merged, add testing tasks for error tracking (#811), MCP HTTP transport (#819), distributed caching (#805), and resilience tests (#820).
Summary
ICacheServiceabstraction in Application layer with cache-aside pattern for hot read pathsInMemoryCacheService(thread-safe ConcurrentDictionary with periodic sweep),RedisCacheService(StackExchange.Redis with lazy connection and safe degradation), andNoOpCacheServiceBoardService.ListBoardsAsync(60s TTL) andGetBoardDetailAsync(120s TTL) with invalidation on create/update/delete/archiveappsettings.json(Cache:Provider,Cache:RedisConnectionString, TTLs)Test plan
Closes #85