Disable server-side cluster health override by default by mhotan · Pull Request #247 · unionai/helm-charts

mhotan · 2026-02-21T01:37:34Z

Summary

Defaults requiredDependentServices to an empty map ({}) in the cluster service config
This disables the server-side health override in determineClusterHealth() (cloud/cluster/service/cluster.go:602-621) that re-evaluates operator-reported monitoring_info and overrides cluster health to UNHEALTHY
Deployments that want server-side health gating can explicitly configure the dependent services they require

Context

In self-hosted deployments, propeller can crashloop briefly at startup. The operator reports health:HEALTHY (since requiredForHealth defaults to false), but the cluster service sees propeller in monitoring_info with consecutive_failures >= 2 and overrides the cluster to UNHEALTHY. The workflow service then filters it out, returning "no clusters found" to end users.

The cluster health determination is split between two places with two separate configs — this change removes the server-side override so the operator is the single source of truth for cluster health.

Status: Needs rescoping — the proper fix is to enable requiredForHealth=true on the operator (DP) and remove the redundant CP-side override together. See comment.

Related: FAB-109 — longer-term consolidation of cluster health into the dataplane operator.

Test plan

Verify cluster service starts without errors with empty requiredDependentServices
Verify cluster health reflects operator-reported health (no server-side override)
Verify existing deployments that explicitly set requiredDependentServices are unaffected

🤖 Generated with Claude Code

aviator-app · 2026-02-21T01:37:37Z

Current Aviator status

Aviator will automatically update this comment as the status of the PR changes.
Comment /aviator refresh to force Aviator to re-examine your PR (or learn about other /aviator commands).

This pull request is currently open (not queued).

How to merge

To merge this PR, comment /aviator merge or add the mergequeue label.

See the real-time status of this PR on the Aviator webapp.

Use the Aviator Chrome Extension to see the status of your PR within GitHub.

mhotan · 2026-02-27T02:31:19Z

ref FAB-119 — this change mitigates the "no clusters" issue by disabling the server-side cluster health override that was incorrectly marking dataplanes as unhealthy when propeller crashlooped at startup.

The cluster service has a `requiredDependentServices` config that re-evaluates operator-reported monitoring_info and overrides cluster health to UNHEALTHY server-side. This causes "no clusters found" errors when dependent services (e.g. propeller) are temporarily unavailable at startup, even though the operator reports the cluster as healthy. Default this to an empty map so the cluster service trusts the operator's self-reported health. Deployments that want server-side health gating can explicitly configure the services they need. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

katrogan · 2026-03-04T07:20:31Z

The workflow service then filters it out, returning "no clusters found" to end users.

Hey @mhotan sorry if I'm not following but isn't this behavior desirable? If propeller is crashlooping the cluster isn't healthy and therefore ineligible for work. Are we not correctly resetting the consecutive failures and eventually updating the cluster status to healthy? That seems like an issue then

mhotan · 2026-03-04T16:14:49Z

Good point — you're right that if propeller is crashlooping, the cluster shouldn't be accepting work that depends on it (V1 executions).

The real issue is that this health check lives on the control plane (cluster service) rather than the data plane (operator). The operator already has requiredForHealth config that can gate cluster health on dependent services, but it defaults to false. The fix should be:

Enable requiredForHealth = true on the data plane operator config
Let the operator be the single source of truth for cluster health
Remove the redundant server-side re-evaluation in the cluster service

That's a bigger change than what this PR does — I'll pull this out of the review stack and scope it properly. The other PRs in the stack (#226, #269) don't depend on this.

mhotan · 2026-03-04T16:19:15Z

The workflow service then filters it out, returning "no clusters found" to end users.

Hey @mhotan sorry if I'm not following but isn't this behavior desirable? If propeller is crashlooping the cluster isn't healthy and therefore ineligible for work. Are we not correctly resetting the consecutive failures and eventually updating the cluster status to healthy? That seems like an issue then

Good point @katrogan — you're right that if propeller is crashlooping, the cluster shouldn't be accepting work that depends on it (V1 executions).

The real issue is that this health check lives on the control plane (cluster service) rather than the data plane (operator). The operator already has requiredForHealth config that can gate cluster health on dependent services, but it defaults to false. The fix should be:

Enable requiredForHealth = true on the data plane operator config
Let the operator be the single source of truth for cluster health
Remove the redundant server-side re-evaluation in the cluster service

That's a bigger change than what this PR does — I'll pull this out of the review stack and scope it properly. The other PRs in the stack (#226, #269) don't depend on this.

mhotan force-pushed the mike/disable-server-side-cluster-health-override branch from 654f8eb to 6e70f1c Compare February 21, 2026 01:51

mhotan force-pushed the mike/disable-server-side-cluster-health-override branch from 6e70f1c to 6d229e0 Compare March 3, 2026 21:20

This was referenced Mar 3, 2026

Consolidate namespace_mapping into single canonical Helm value #226

Merged

Add identity injection and header forwarding for selfhosted deployments #263

Draft

Fix flyte-to-union prefix and add singleTenantOrgID to selfhosted DP #269

Merged

laurabarton approved these changes Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable server-side cluster health override by default#247

Disable server-side cluster health override by default#247
mhotan wants to merge 1 commit intomainfrom
mike/disable-server-side-cluster-health-override

mhotan commented Feb 21, 2026 •

edited

Loading

Uh oh!

aviator-app bot commented Feb 21, 2026

Uh oh!

mhotan commented Feb 27, 2026

Uh oh!

katrogan commented Mar 4, 2026 •

edited

Loading

Uh oh!

mhotan commented Mar 4, 2026

Uh oh!

mhotan commented Mar 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mhotan commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Uh oh!

aviator-app bot commented Feb 21, 2026

Current Aviator status

How to merge

Uh oh!

mhotan commented Feb 27, 2026

Uh oh!

katrogan commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhotan commented Mar 4, 2026

Uh oh!

mhotan commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mhotan commented Feb 21, 2026 •

edited

Loading

katrogan commented Mar 4, 2026 •

edited

Loading

mhotan commented Mar 4, 2026 •

edited

Loading