fix: optimize reconciliation for large clusters #1

aa1ex · 2026-01-19T14:54:01Z

Description

Optimize Redis Operator for managing large numbers of objects (hundreds/thousands of Redis resources).

Problems solved:

High API server load — controllers were polling every 10s even when nothing changed
Slow error recovery — default RateLimiter backoff up to 1000s (~16 min) caused objects to get stuck
Bug: objects stuck waiting for StatefulSet — Reconciled() was used instead of RequeueAfter() when StatefulSet is not ready
No event-driven reconciliation — Redis and RedisReplication controllers relied only on polling, not watching StatefulSet changes

Changes:

Add Owns(&appsv1.StatefulSet{}) to Redis and RedisReplication controllers for event-driven reconciliation instead of polling
Add custom RateLimiter with max backoff of 30s instead of default 1000s
Fix bug where Reconciled() was used instead of RequeueAfter when StatefulSet is not ready, causing objects to get stuck
Increase periodic reconcile interval from 10s to 5min for healthy state

Expected impact:

Metric	Before	After
Max backoff on errors	~16 min	30 sec
Polling interval (healthy)	10 sec	5 min
API server load (1000 objects)	~100 req/s constant	~3 req/s + events

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist

All existing tests pass (no new tests added).
Functionality/bugs have been confirmed to be unchanged or fixed.
I have performed a self-review of my own code.
Documentation has been updated or added where necessary.

- Add Owns(&appsv1.StatefulSet{}) to Redis and RedisReplication controllers for event-driven reconciliation instead of polling - Add custom RateLimiter with max backoff of 30s instead of default 1000s - Fix bug where Reconciled() was used instead of RequeueAfter when StatefulSet is not ready, causing objects to get stuck - Increase periodic reconcile interval from 10s to 5min for healthy state Co-authored-by: Denis Khachyan <khachyanda@gmail.com> Signed-off-by: Aleksandrov Aleksandr <aaleksandrov.cy@gmail.com>

aa1ex assigned dkhachyan Jan 19, 2026

aa1ex force-pushed the fix/large-cluster-optimization branch from 4851540 to 8715fda Compare January 20, 2026 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: optimize reconciliation for large clusters #1

fix: optimize reconciliation for large clusters #1

Uh oh!

aa1ex commented Jan 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: optimize reconciliation for large clusters #1

Are you sure you want to change the base?

fix: optimize reconciliation for large clusters #1

Uh oh!

Conversation

aa1ex commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aa1ex commented Jan 19, 2026 •

edited

Loading