Skip to content

feat: Two-Layer Distributed Cache (Redis L2 + In-Memory L1) for Tenant Config #381

@jeffersonrodrigues92

Description

@jeffersonrodrigues92

Problem

When a Midaz pod scales horizontally or restarts, its in-memory tenant config cache is empty. If the tenant-manager is unavailable, the new pod cannot serve tenant requests. The current 1h in-memory TTL means only 1h of tolerance for tenant-manager downtime.

Root causes:

  • In-memory cache is per-pod (not shared across replicas)
  • Cache doesn't survive pod restarts — new pods start cold
  • No shared cache layer between pods

Goal

Redis/Valkey as primary shared cache (survives pod restarts, shared across all pods) + in-memory as fast L1 fallback. Tenant-manager downtime tolerated for Redis TTL (12-24h).

Architecture

Request → L1: In-Memory (fast, 5min TTL, per-pod)
              ↓ (miss)
          L2: Redis/Valkey (shared, 12h TTL, survives restarts)
              ↓ (miss)
          HTTP to tenant-manager → Secrets Manager
              ↓ (success)
          Write to L2 (Redis) + L1 (in-memory)

Invalidation flow (suspend/purge):

  • revalidatePoolSettings calls GetTenantConfig(WithSkipCache()) → bypasses L1+L2 → gets 403 → evicts connection + deletes from both caches

Implementation Plan

Phase 1: Redis ConfigCache Adapter

New file: commons/tenant-manager/cache/redis.go

Implement existing ConfigCache interface (cache/config_cache.go:19-37) with Redis:

type RedisCache struct {
    client redis.UniversalClient
    prefix string
}

func NewRedisCache(client redis.UniversalClient, opts ...RedisCacheOption) *RedisCache
func (c *RedisCache) Get(ctx, key) (string, error)   // redis GET, returns ErrCacheMiss on nil
func (c *RedisCache) Set(ctx, key, value, ttl) error  // redis SET with TTL
func (c *RedisCache) Del(ctx, key) error               // redis DEL

Tests: commons/tenant-manager/cache/redis_test.go — use miniredis

Phase 2: TieredCache Wrapper

New file: commons/tenant-manager/cache/tiered.go

Composes two ConfigCache implementations:

type TieredCache struct {
    l1    ConfigCache
    l2    ConfigCache
    l1TTL time.Duration
    l2TTL time.Duration
}

func NewTieredCache(l1, l2 ConfigCache, l1TTL, l2TTL time.Duration) *TieredCache
  • Get: L1 → miss → L2 → miss → ErrCacheMiss. On L2 hit, write back to L1 with l1TTL.
  • Set: write to L1 (short TTL) + L2 (long TTL).
  • Del: delete from both.

Tests: commons/tenant-manager/cache/tiered_test.go

Phase 3: WithTieredCache Convenience Option

File: commons/tenant-manager/client/client.go

func WithTieredCache(redisClient redis.UniversalClient, opts ...TieredCacheOption) ClientOption

Options: WithL1TTL(time.Duration), WithL2TTL(time.Duration), WithRedisPrefix(string)

Defaults: L1=5min, L2=12h, prefix=tm-config:

Creates TieredCache(InMemoryCache, RedisCache) and passes via existing WithCache.


Existing Code to Reuse

Item Location
ConfigCache interface cache/config_cache.go:19-37
InMemoryCache cache/memory.go (used as L1)
WithCache(cc ConfigCache) client/client.go:138-144
ErrCacheMiss cache/config_cache.go:14
defaultCacheTTL client/client.go:29

Key Files

File Action
commons/tenant-manager/cache/redis.go New
commons/tenant-manager/cache/redis_test.go New
commons/tenant-manager/cache/tiered.go New
commons/tenant-manager/cache/tiered_test.go New
commons/tenant-manager/client/client.go Edit — add WithTieredCache option

Backward Compatibility

  • Without WithTieredCache, behavior unchanged (in-memory only)
  • WithCache still works for custom implementations
  • WithCacheTTL still controls L1 TTL when using WithTieredCache

Downstream Integration (midaz — separate PR)

After this is merged, midaz components will use WithTieredCache:

File Change
components/ledger/internal/bootstrap/config.go Add env vars + WithTieredCache
components/onboarding/internal/bootstrap/config.go Add env vars
components/transaction/internal/bootstrap/config.go Add env vars
components/crm/internal/bootstrap/config.go Add env vars
components/crm/internal/bootstrap/config.tenant.go Use WithTieredCache

Env vars: MULTI_TENANT_CACHE_L1_TTL_SEC (default 300), MULTI_TENANT_CACHE_L2_TTL_SEC (default 43200)

Verification

  1. Start Midaz with env vars → verify Redis key created on first request
  2. Restart pod → first request reads from Redis (no HTTP to tenant-manager)
  3. Scale to 3 pods → all read from shared Redis
  4. Stop tenant-manager → requests continue from Redis for 12h
  5. InvalidateConfig → both L1 and L2 cleared
  6. Suspend tenant → revalidation bypasses cache → detects 403 → evicts both layers
  7. Run all tests: go test ./commons/tenant-manager/... -count=1 -race

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions