Disable server-side cluster health override by default#247
Disable server-side cluster health override by default#247
Conversation
Current Aviator status
This pull request is currently open (not queued). How to mergeTo merge this PR, comment
See the real-time status of this PR on the
Aviator webapp.
Use the Aviator Chrome Extension
to see the status of your PR within GitHub.
|
654f8eb to
6e70f1c
Compare
|
ref FAB-119 — this change mitigates the "no clusters" issue by disabling the server-side cluster health override that was incorrectly marking dataplanes as unhealthy when propeller crashlooped at startup. |
The cluster service has a `requiredDependentServices` config that re-evaluates operator-reported monitoring_info and overrides cluster health to UNHEALTHY server-side. This causes "no clusters found" errors when dependent services (e.g. propeller) are temporarily unavailable at startup, even though the operator reports the cluster as healthy. Default this to an empty map so the cluster service trusts the operator's self-reported health. Deployments that want server-side health gating can explicitly configure the services they need. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6e70f1c to
6d229e0
Compare
Hey @mhotan sorry if I'm not following but isn't this behavior desirable? If propeller is crashlooping the cluster isn't healthy and therefore ineligible for work. Are we not correctly resetting the consecutive failures and eventually updating the cluster status to healthy? That seems like an issue then |
|
Good point — you're right that if propeller is crashlooping, the cluster shouldn't be accepting work that depends on it (V1 executions). The real issue is that this health check lives on the control plane (cluster service) rather than the data plane (operator). The operator already has
That's a bigger change than what this PR does — I'll pull this out of the review stack and scope it properly. The other PRs in the stack (#226, #269) don't depend on this. |
Good point @katrogan — you're right that if propeller is crashlooping, the cluster shouldn't be accepting work that depends on it (V1 executions). The real issue is that this health check lives on the control plane (cluster service) rather than the data plane (operator). The operator already has
That's a bigger change than what this PR does — I'll pull this out of the review stack and scope it properly. The other PRs in the stack (#226, #269) don't depend on this. |
Summary
requiredDependentServicesto an empty map ({}) in the cluster service configdetermineClusterHealth()(cloud/cluster/service/cluster.go:602-621) that re-evaluates operator-reportedmonitoring_infoand overrides cluster health toUNHEALTHYContext
In self-hosted deployments, propeller can crashloop briefly at startup. The operator reports
health:HEALTHY(sincerequiredForHealthdefaults tofalse), but the cluster service sees propeller inmonitoring_infowithconsecutive_failures >= 2and overrides the cluster toUNHEALTHY. The workflow service then filters it out, returning"no clusters found"to end users.The cluster health determination is split between two places with two separate configs — this change removes the server-side override so the operator is the single source of truth for cluster health.
Status: Needs rescoping — the proper fix is to enable
requiredForHealth=trueon the operator (DP) and remove the redundant CP-side override together. See comment.Related: FAB-109 — longer-term consolidation of cluster health into the dataplane operator.
Test plan
requiredDependentServicesrequiredDependentServicesare unaffected🤖 Generated with Claude Code