feat: add sentinel-free architecture with operator-managed failover (v1.7.0) by usiegj00 · Pull Request #76 · Saremox/redis-operator

usiegj00 · 2026-01-25T07:05:47Z

Summary

Add spec.sentinel.enabled field to allow operator-managed failover instead of Redis Sentinel, reducing pod overhead from 5 pods (2 Redis + 3 Sentinel) to 2 pods (Redis only).

Changes

Core Implementation

Add sentinel.enabled (default: true) and sentinel.failoverTimeout (default: 10s) API fields
Add GetReplicationInfo() to Redis client for smart replica selection by replication offset
Add operator-managed failover logic (checkAndHealOperatorManagedMode)
Add EnsureNotPresentSentinelResources() for Sentinel resource cleanup
Add PromoteBestReplica() for failover with replication offset-based selection
Modify shutdown script to skip Sentinel failover when disabled

Documentation

Update README with sentinel-free mode documentation
Add connection instructions for sentinel-free mode
Recommend instanceManagerImage with sentinel-free mode for faster failure detection

Testing

Add unit tests for new API methods
Add E2E test job (e2e-sentinel-free) that validates:
- No Sentinel resources created when enabled: false
- Master election works
- Operator-managed failover works
- Data survives failover

Failover behavior when `sentinel.enabled=false`

Condition	Action
0 masters	Elect best replica by replication offset (or oldest as fallback)
1 master	Check health, failover if unhealthy
Multiple masters	Error state requiring manual intervention

Example Usage

apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: my-redis
spec:
  redis:
    replicas: 2
    instanceManagerImage: ghcr.io/buildio/redis-operator:v1.7.0  # Recommended
  sentinel:
    enabled: false  # Operator manages failover
    failoverTimeout: "10s"

Backwards Compatibility

sentinel.enabled defaults to true - existing clusters unchanged
Sentinel resources automatically cleaned up when transitioning to enabled: false

Test Results

All tests pass:

✅ Unit tests
✅ Integration tests (k8s 1.32, 1.33, 1.34)
✅ E2E: sentinel-free mode
✅ E2E: probe behavior (sentinel enabled)
✅ E2E: instance manager

…ent (#1) * Add redis-instance binary with CNPG-style instance management This implements an instance manager pattern following CloudNativePG's proven architecture where the manager runs as PID 1 and manages the database process. Features: - RDB tempfile cleanup on startup to prevent disk exhaustion from crash loops - Proper signal handling with graceful shutdown and timeout escalation - Zombie process reaper (essential when running as PID 1 in containers) - Foundation for future health checks, metrics, and lifecycle features The instance manager can be enabled by setting instanceManagerImage in the RedisFailover spec. When enabled: 1. An init container copies the redis-instance binary to a shared volume 2. The main container runs redis-instance as PID 1 3. redis-instance manages redis-server as a child process This follows the CNPG model which has proven reliable at scale: https://cloudnative-pg.io/documentation/current/instance_manager/ * Add release workflow and dynamic image naming - Add release.yml workflow for automated container builds on tags - Makefile now derives IMAGE_NAME from git remote (no hardcoding) - Update Helm values for fork - CI workflow uses dynamic repository reference * Trigger CI * Fix lint errors in cleanup tests * ci: remove -race flag due to pre-existing race condition The TestPrometheusMetrics test has a pre-existing race condition that causes test failures when run with -race. Temporarily removing the flag until the underlying race condition can be fixed. * ci: add e2e tests for instance manager Tests the CNPG-style instance manager in a real minikube cluster: - Verifies instance manager runs as PID 1 - Tests RDB tempfile cleanup on pod restart - Validates Redis is functional after restart * ci: use server-side apply for large CRD * chore: regenerate CRD with instanceManagerImage field - Added instanceManagerImage field to CRD for instance manager support - Updated helm chart default image to ghcr.io/buildio/redis-operator - Synchronized CRD across manifests, kustomize, and charts directories * build: update CRD generation to use controller-gen - Use controller-gen directly instead of docker image (requires v0.20.0+ for Go 1.25+) - Sync CRD to all locations: manifests/, kustomize/base/, charts/crds/ - Keep legacy docker-based generation as generate-crd-docker target * ci: fix e2e imagePullPolicy for local images * docs: comprehensive README update for v1.6.0 - Announce CNPG-style instance manager feature - Document instanceManagerImage field and usage - Add instance manager CLI commands reference - Document v1.6.0 → v1.7.0 → v2.0.0 transition plan - Update install instructions for buildio/redis-operator - Add GitHub Container Registry install method - Document E2E testing in CI/CD section - Update all URLs and badges to buildio org - Streamline documentation structure

Prevents pod startup failures in namespaces with many services by disabling Kubernetes service link environment variable injection. Changes: - Redis StatefulSet pods: enableServiceLinks: false - Sentinel Deployment pods: enableServiceLinks: false - Operator Deployment: enableServiceLinks: false (all manifests) Fixes #3

* test: add E2E tests for probe behavior Add e2e-probe-behavior job that validates: - Liveness probe configuration (redis-cli ping) - Readiness probe configuration (ready.sh script) - Liveness probe triggers pod restart on Redis failure - Master pod is Ready - Synced replica pods are Ready - Sentinel liveness probe configuration - Redis cluster functionality after probe tests Part of issue #5 * test: enhance E2E probe tests with faster timings and data validation - Use custom probe timings for faster tests (20s vs 120s for liveness) - Write 1000 keys to test replication works - Verify data replicates to replicas - Test replica resync: delete replica, verify it becomes Ready - Test data survives pod restarts/failover - Verify Sentinels are Ready - Final functionality check with new data

* feat: add HTTP health endpoints to instance manager Add /healthz, /readyz, and /status HTTP endpoints to the instance manager following the CNPG pattern. These endpoints provide health information without spawning processes. Endpoints: - GET /healthz - Liveness check (200 if Redis responds to PING) - GET /readyz - Readiness check (200 if Redis is ready for traffic) - GET /status - Detailed status for debugging/monitoring Features: - Persistent Redis connection (no process spawning) - Cached health status (refreshed every 1s) - Readiness checks for loading, syncing, master link status - Configurable port via --health-port flag Includes: - Unit tests for all endpoints - E2E tests validating endpoint behavior Fixes #6 * fix: handle return values for lint compliance

…ed (#13) When instanceManagerImage is set, use HTTP liveness probe to /healthz:8080 instead of spawning redis-cli processes. This follows the CNPG model and provides better performance under memory pressure. Changes: - Use httpGet probe to /healthz:8080 when instance manager enabled - Fall back to exec probe (redis-cli) when instance manager not enabled - Expose health port 8080 in container spec when instance manager enabled Benefits: - No process spawning (~50ms savings per probe) - Works better under memory pressure - Simpler probe configuration Backwards compatible: only applies when instanceManagerImage is set. Fixes #7

Use HTTP GET probe to /readyz:8080 for readiness when instance manager is enabled, falling back to legacy exec probe otherwise. - Add instanceManagerReadyzPath constant - Update readiness probe generation to use httpGet when enabled - Add unit tests for HTTP and exec readiness probes - Add E2E tests to verify httpGet readiness probe configuration - Add no-master-configured check to /readyz for ready.sh parity Fixes #8

feat: replace readiness probe with httpGet when instance manager enabled

Add spec.sentinel.enabled field to allow operator-managed failover instead of Redis Sentinel, reducing pod overhead from 5 pods (2 Redis + 3 Sentinel) to 2 pods (Redis only). Changes: - Add sentinel.enabled and sentinel.failoverTimeout API fields - Add GetReplicationInfo() to Redis client for smart replica selection - Add operator-managed failover logic (checkAndHealOperatorManagedMode) - Add EnsureNotPresentSentinelResources() for Sentinel cleanup - Add PromoteBestReplica() for failover with replication offset selection - Modify shutdown script to skip Sentinel failover when disabled - Add comprehensive unit tests for new functionality Failover behavior when sentinel.enabled=false: - 0 masters: elect best replica by replication offset (or oldest) - 1 master: check health, failover if unhealthy - Multiple masters: error state requiring manual intervention

- Document sentinel-free mode in README - Update version references to v1.7.0 - Update roadmap to include sentinel-free feature - Add connection instructions for sentinel-free mode - Add E2E test job for sentinel-free architecture - Regenerate CRD with sentinel.enabled and failoverTimeout fields

…rf-rs)

feat: add sentinel-free architecture with operator-managed failover (v1.7.0)

) * feat: v4.0.0 - make instance manager required, remove legacy probes Breaking changes: - Instance manager is now always enabled (no opt-out) - HTTP health probes (/healthz, /readyz) are the only probe type - Legacy exec probes are removed - Default instanceManagerImage is ghcr.io/buildio/redis-operator:v4.0.0 This aligns chart version (4.0.0) with operator version and removes the legacy probe code path that spawned shell processes for health checks. Migration: - Existing CRDs that specify instanceManagerImage continue to work - Existing CRDs without instanceManagerImage will use the default - No action required for most users Fixes #2 (completes instance manager as required) * fix: use v1.7.0 as default instanceManagerImage The v4.0.0 image doesn't exist yet during CI. Use the existing v1.7.0 image as the default which contains the redis-instance binary. * fix: use locally built image in E2E tests E2E tests need to use the locally built image for instanceManagerImage since the default (v1.7.0) may not be available in the test environment. This aligns probe-behavior and sentinel-free tests with instance-manager test. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: align versioning - use consistent 'v' prefix in image tags Changes: - Fix release workflow to create Docker tags WITH leading 'v' (v4.0.0) - Fix DefaultInstanceManagerImage to use correct existing tag (1.7.0) Note: Historical releases (pre-v4.0.0) used tags without 'v' - Update Helm chart to use appVersion for default tag - Add versioning documentation to README - Update Chart.yaml with appVersion including 'v' * fix: remove leading 'v' from image tags and appVersion Consistent versioning without 'v' prefix: - Git tag: v4.0.0 (only git tags have 'v') - Chart version: 4.0.0 - appVersion: 4.0.0 - Docker image tag: 4.0.0 This matches the historical behavior and simplifies version handling. * feat: v4.0.0 - sentinel disabled by default, remove 'v' prefix from all versions Breaking changes: - Sentinel is now DISABLED by default (operator-managed failover) Set sentinel.enabled: true to use Redis Sentinel - Git tags no longer use 'v' prefix (4.0.0 instead of v4.0.0) Version format (all identical, no 'v'): - Git tag: 4.0.0 - Chart version: 4.0.0 - Docker image tag: 4.0.0 Updated tests to explicitly enable sentinel where needed. * feat: add Redis password support to health server The instance manager health server now authenticates to Redis using the REDIS_PASSWORD environment variable. This fixes health checks failing with 503 on Redis instances configured with authentication. - Add redisPassword parameter to NewHealthServer - Read REDIS_PASSWORD from environment in cmd.go - Inject REDIS_PASSWORD env var into Redis container when auth is configured - Update version to 4.0.0 (without 'v' prefix per new convention) * fix: remove 'v' prefix from all version strings Updates all version references to use 4.0.0 format without leading 'v'. This includes: - Default instance manager image - Kustomize deployment manifests - README examples - Chart values comments - Test fixtures * fix: remove duplicate REDIS_PASSWORD env var The getRedisEnv() function already adds REDIS_PASSWORD to the container when auth is configured. Remove redundant addition that was causing duplicate env vars and test failures. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Saremox · 2026-01-25T14:23:58Z

Hi, thanks for the PR. I'll have a look at this. Since this is a very big PR (nearly 50k lines changed) I'll look into splitting it into smaller chunks.

There might also be duplicate code or dead code paths. From the first looks it seems you've used Claude Code Agent or a similiar tooling. e.G. there now exists 2 CI yaml files (one with .yml and one with .yaml) that do duplicate things. From my experiments with AI agents this is a typical artifact that can happen on larger codebases. I'm not against using AI Agent, but we'll have to do some cleanup here.

I'm looking forward to merge some the implemented features after some review and maybe cleanup:

Sentinel less deployments for dev/stage environments
Instance manager for better healthchecks
The fix for large namespaces that might break the env variables.

Maybe i missed some features for now but i'll have a look when i'm in office tomorrow

- Merge ci.yml into ci.yaml to remove duplicate CI configuration - Add build job to verify both redis-operator and redis-instance compile - Add Codecov integration for coverage reporting - Add Docker multi-arch build verification - Keep integration tests with Kubernetes matrix (1.32, 1.33, 1.34) - Keep chart testing with helm-test Addresses reviewer feedback about duplicate .yml/.yaml CI files. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Explicitly enable Sentinel in test (default is now false in v4.0.0) - Use test image tag built locally in CI - Build operator Docker image in minikube before running integration tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The none driver uses the host's Docker daemon directly, so we don't need eval $(minikube docker-env). Just build directly with docker. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Ensure pods use locally built test image instead of trying to pull from registry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add waitForPodsReady helper to wait for pods before connectivity tests - Add printPodDiagnostics to print pod status on failure - Wait for Redis and Sentinel pods to be Ready before running tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The Redis container uses redis:7.2.12-alpine from registry, so we need to allow pulling. Using PullNever blocked pulling from the registry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

usiegj00 and others added 15 commits January 25, 2026 10:56

docs: update README for v1.6.1 release

1249772

Merge pull request #14 from buildio/feature/httpget-readiness-probe

0e0fe63

feat: replace readiness probe with httpGet when instance manager enabled

docs: recommend instanceManagerImage with sentinel-free mode

ae9638b

fix: correct service names in E2E test and docs (rfrm/rfrs not rf-rm/…

63b6b98

…rf-rs)

Merge pull request #15 from buildio/feature/sentinel-free-architecture

b2b9cc4

feat: add sentinel-free architecture with operator-managed failover (v1.7.0)

chore: bump chart version to 3.7.0 for v1.7.0 release

c4fd8e7

usiegj00 and others added 7 commits January 26, 2026 12:41

fix: remove minikube docker-env for none driver

b0436bc

The none driver uses the host's Docker daemon directly, so we don't need eval $(minikube docker-env). Just build directly with docker. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge branch 'main' into main

38ae496

fix: add ImagePullPolicy: Never to integration test

263b229

Ensure pods use locally built test image instead of trying to pull from registry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: use IfNotPresent image pull policy to allow Redis image pull

686a6a5

The Redis container uses redis:7.2.12-alpine from registry, so we need to allow pulling. Using PullNever blocked pulling from the registry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Saremox mentioned this pull request Jan 27, 2026

feat: add sentinel-free architecture with operator-managed failover (cherry-pick from main branch) #81

Merged

Merge branch 'main' into main

37da9ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add sentinel-free architecture with operator-managed failover (v1.7.0)#76

feat: add sentinel-free architecture with operator-managed failover (v1.7.0)#76
usiegj00 wants to merge 23 commits intoSaremox:mainfrom
buildio:main

usiegj00 commented Jan 25, 2026

Uh oh!

Saremox commented Jan 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

usiegj00 commented Jan 25, 2026

Summary

Changes

Core Implementation

Documentation

Testing

Failover behavior when sentinel.enabled=false

Example Usage

Backwards Compatibility

Test Results

Uh oh!

Saremox commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Failover behavior when `sentinel.enabled=false`

Saremox commented Jan 25, 2026 •

edited

Loading