Skip to content

Conversation

@aa1ex
Copy link

@aa1ex aa1ex commented Jan 19, 2026

Description

Optimize Redis Operator for managing large numbers of objects (hundreds/thousands of Redis resources).

Problems solved:

  1. High API server load — controllers were polling every 10s even when nothing changed
  2. Slow error recovery — default RateLimiter backoff up to 1000s (~16 min) caused objects to get stuck
  3. Bug: objects stuck waiting for StatefulSetReconciled() was used instead of RequeueAfter() when StatefulSet is not ready
  4. No event-driven reconciliation — Redis and RedisReplication controllers relied only on polling, not watching StatefulSet changes

Changes:

  • Add Owns(&appsv1.StatefulSet{}) to Redis and RedisReplication controllers for event-driven reconciliation instead of polling
  • Add custom RateLimiter with max backoff of 30s instead of default 1000s
  • Fix bug where Reconciled() was used instead of RequeueAfter when StatefulSet is not ready, causing objects to get stuck
  • Increase periodic reconcile interval from 10s to 5min for healthy state

Expected impact:

Metric Before After
Max backoff on errors ~16 min 30 sec
Polling interval (healthy) 10 sec 5 min
API server load (1000 objects) ~100 req/s constant ~3 req/s + events

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • All existing tests pass (no new tests added).
  • Functionality/bugs have been confirmed to be unchanged or fixed.
  • I have performed a self-review of my own code.
  • Documentation has been updated or added where necessary.

- Add Owns(&appsv1.StatefulSet{}) to Redis and RedisReplication controllers
  for event-driven reconciliation instead of polling
- Add custom RateLimiter with max backoff of 30s instead of default 1000s
- Fix bug where Reconciled() was used instead of RequeueAfter when
  StatefulSet is not ready, causing objects to get stuck
- Increase periodic reconcile interval from 10s to 5min for healthy state

Co-authored-by: Denis Khachyan <khachyanda@gmail.com>
Signed-off-by: Aleksandrov Aleksandr <aaleksandrov.cy@gmail.com>
@aa1ex aa1ex force-pushed the fix/large-cluster-optimization branch from 4851540 to 8715fda Compare January 20, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants