-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Problem
When a Midaz pod scales horizontally or restarts, its in-memory tenant config cache is empty. If the tenant-manager is unavailable, the new pod cannot serve tenant requests. The current 1h in-memory TTL means only 1h of tolerance for tenant-manager downtime.
Root causes:
- In-memory cache is per-pod (not shared across replicas)
- Cache doesn't survive pod restarts — new pods start cold
- No shared cache layer between pods
Goal
Redis/Valkey as primary shared cache (survives pod restarts, shared across all pods) + in-memory as fast L1 fallback. Tenant-manager downtime tolerated for Redis TTL (12-24h).
Architecture
Request → L1: In-Memory (fast, 5min TTL, per-pod)
↓ (miss)
L2: Redis/Valkey (shared, 12h TTL, survives restarts)
↓ (miss)
HTTP to tenant-manager → Secrets Manager
↓ (success)
Write to L2 (Redis) + L1 (in-memory)
Invalidation flow (suspend/purge):
revalidatePoolSettingscallsGetTenantConfig(WithSkipCache())→ bypasses L1+L2 → gets 403 → evicts connection + deletes from both caches
Implementation Plan
Phase 1: Redis ConfigCache Adapter
New file: commons/tenant-manager/cache/redis.go
Implement existing ConfigCache interface (cache/config_cache.go:19-37) with Redis:
type RedisCache struct {
client redis.UniversalClient
prefix string
}
func NewRedisCache(client redis.UniversalClient, opts ...RedisCacheOption) *RedisCache
func (c *RedisCache) Get(ctx, key) (string, error) // redis GET, returns ErrCacheMiss on nil
func (c *RedisCache) Set(ctx, key, value, ttl) error // redis SET with TTL
func (c *RedisCache) Del(ctx, key) error // redis DELTests: commons/tenant-manager/cache/redis_test.go — use miniredis
Phase 2: TieredCache Wrapper
New file: commons/tenant-manager/cache/tiered.go
Composes two ConfigCache implementations:
type TieredCache struct {
l1 ConfigCache
l2 ConfigCache
l1TTL time.Duration
l2TTL time.Duration
}
func NewTieredCache(l1, l2 ConfigCache, l1TTL, l2TTL time.Duration) *TieredCacheGet: L1 → miss → L2 → miss → ErrCacheMiss. On L2 hit, write back to L1 with l1TTL.Set: write to L1 (short TTL) + L2 (long TTL).Del: delete from both.
Tests: commons/tenant-manager/cache/tiered_test.go
Phase 3: WithTieredCache Convenience Option
File: commons/tenant-manager/client/client.go
func WithTieredCache(redisClient redis.UniversalClient, opts ...TieredCacheOption) ClientOptionOptions: WithL1TTL(time.Duration), WithL2TTL(time.Duration), WithRedisPrefix(string)
Defaults: L1=5min, L2=12h, prefix=tm-config:
Creates TieredCache(InMemoryCache, RedisCache) and passes via existing WithCache.
Existing Code to Reuse
| Item | Location |
|---|---|
ConfigCache interface |
cache/config_cache.go:19-37 |
InMemoryCache |
cache/memory.go (used as L1) |
WithCache(cc ConfigCache) |
client/client.go:138-144 |
ErrCacheMiss |
cache/config_cache.go:14 |
defaultCacheTTL |
client/client.go:29 |
Key Files
| File | Action |
|---|---|
commons/tenant-manager/cache/redis.go |
New |
commons/tenant-manager/cache/redis_test.go |
New |
commons/tenant-manager/cache/tiered.go |
New |
commons/tenant-manager/cache/tiered_test.go |
New |
commons/tenant-manager/client/client.go |
Edit — add WithTieredCache option |
Backward Compatibility
- Without
WithTieredCache, behavior unchanged (in-memory only) WithCachestill works for custom implementationsWithCacheTTLstill controls L1 TTL when usingWithTieredCache
Downstream Integration (midaz — separate PR)
After this is merged, midaz components will use WithTieredCache:
| File | Change |
|---|---|
components/ledger/internal/bootstrap/config.go |
Add env vars + WithTieredCache |
components/onboarding/internal/bootstrap/config.go |
Add env vars |
components/transaction/internal/bootstrap/config.go |
Add env vars |
components/crm/internal/bootstrap/config.go |
Add env vars |
components/crm/internal/bootstrap/config.tenant.go |
Use WithTieredCache |
Env vars: MULTI_TENANT_CACHE_L1_TTL_SEC (default 300), MULTI_TENANT_CACHE_L2_TTL_SEC (default 43200)
Verification
- Start Midaz with env vars → verify Redis key created on first request
- Restart pod → first request reads from Redis (no HTTP to tenant-manager)
- Scale to 3 pods → all read from shared Redis
- Stop tenant-manager → requests continue from Redis for 12h
InvalidateConfig→ both L1 and L2 cleared- Suspend tenant → revalidation bypasses cache → detects 403 → evicts both layers
- Run all tests:
go test ./commons/tenant-manager/... -count=1 -race