Skip to content

Disable server-side cluster health override by default#247

Open
mhotan wants to merge 1 commit intomainfrom
mike/disable-server-side-cluster-health-override
Open

Disable server-side cluster health override by default#247
mhotan wants to merge 1 commit intomainfrom
mike/disable-server-side-cluster-health-override

Conversation

@mhotan
Copy link
Contributor

@mhotan mhotan commented Feb 21, 2026

Summary

  • Defaults requiredDependentServices to an empty map ({}) in the cluster service config
  • This disables the server-side health override in determineClusterHealth() (cloud/cluster/service/cluster.go:602-621) that re-evaluates operator-reported monitoring_info and overrides cluster health to UNHEALTHY
  • Deployments that want server-side health gating can explicitly configure the dependent services they require

Context

In self-hosted deployments, propeller can crashloop briefly at startup. The operator reports health:HEALTHY (since requiredForHealth defaults to false), but the cluster service sees propeller in monitoring_info with consecutive_failures >= 2 and overrides the cluster to UNHEALTHY. The workflow service then filters it out, returning "no clusters found" to end users.

The cluster health determination is split between two places with two separate configs — this change removes the server-side override so the operator is the single source of truth for cluster health.

Status: Needs rescoping — the proper fix is to enable requiredForHealth=true on the operator (DP) and remove the redundant CP-side override together. See comment.

Related: FAB-109 — longer-term consolidation of cluster health into the dataplane operator.

Test plan

  • Verify cluster service starts without errors with empty requiredDependentServices
  • Verify cluster health reflects operator-reported health (no server-side override)
  • Verify existing deployments that explicitly set requiredDependentServices are unaffected

🤖 Generated with Claude Code

@aviator-app
Copy link
Contributor

aviator-app bot commented Feb 21, 2026

Current Aviator status

Aviator will automatically update this comment as the status of the PR changes.
Comment /aviator refresh to force Aviator to re-examine your PR (or learn about other /aviator commands).

This pull request is currently open (not queued).

How to merge

To merge this PR, comment /aviator merge or add the mergequeue label.


See the real-time status of this PR on the Aviator webapp.
Use the Aviator Chrome Extension to see the status of your PR within GitHub.

@mhotan mhotan force-pushed the mike/disable-server-side-cluster-health-override branch from 654f8eb to 6e70f1c Compare February 21, 2026 01:51
@mhotan
Copy link
Contributor Author

mhotan commented Feb 27, 2026

ref FAB-119 — this change mitigates the "no clusters" issue by disabling the server-side cluster health override that was incorrectly marking dataplanes as unhealthy when propeller crashlooped at startup.

The cluster service has a `requiredDependentServices` config that
re-evaluates operator-reported monitoring_info and overrides cluster
health to UNHEALTHY server-side. This causes "no clusters found" errors
when dependent services (e.g. propeller) are temporarily unavailable at
startup, even though the operator reports the cluster as healthy.

Default this to an empty map so the cluster service trusts the
operator's self-reported health. Deployments that want server-side
health gating can explicitly configure the services they need.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@katrogan
Copy link
Contributor

katrogan commented Mar 4, 2026

The workflow service then filters it out, returning "no clusters found" to end users.

Hey @mhotan sorry if I'm not following but isn't this behavior desirable? If propeller is crashlooping the cluster isn't healthy and therefore ineligible for work. Are we not correctly resetting the consecutive failures and eventually updating the cluster status to healthy? That seems like an issue then

@mhotan
Copy link
Contributor Author

mhotan commented Mar 4, 2026

Good point — you're right that if propeller is crashlooping, the cluster shouldn't be accepting work that depends on it (V1 executions).

The real issue is that this health check lives on the control plane (cluster service) rather than the data plane (operator). The operator already has requiredForHealth config that can gate cluster health on dependent services, but it defaults to false. The fix should be:

  1. Enable requiredForHealth = true on the data plane operator config
  2. Let the operator be the single source of truth for cluster health
  3. Remove the redundant server-side re-evaluation in the cluster service

That's a bigger change than what this PR does — I'll pull this out of the review stack and scope it properly. The other PRs in the stack (#226, #269) don't depend on this.

@mhotan
Copy link
Contributor Author

mhotan commented Mar 4, 2026

The workflow service then filters it out, returning "no clusters found" to end users.

Hey @mhotan sorry if I'm not following but isn't this behavior desirable? If propeller is crashlooping the cluster isn't healthy and therefore ineligible for work. Are we not correctly resetting the consecutive failures and eventually updating the cluster status to healthy? That seems like an issue then

Good point @katrogan — you're right that if propeller is crashlooping, the cluster shouldn't be accepting work that depends on it (V1 executions).

The real issue is that this health check lives on the control plane (cluster service) rather than the data plane (operator). The operator already has requiredForHealth config that can gate cluster health on dependent services, but it defaults to false. The fix should be:

  1. Enable requiredForHealth = true on the data plane operator config
  2. Let the operator be the single source of truth for cluster health
  3. Remove the redundant server-side re-evaluation in the cluster service

That's a bigger change than what this PR does — I'll pull this out of the review stack and scope it properly. The other PRs in the stack (#226, #269) don't depend on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants