new approach to featuregates for coordination in the cluster #1373

deads2k · 2023-03-29T19:52:51Z

This enhancement aims to reduce the effort required to add a feature gate to TechPreviewNoUpgrade and to promote
that feature gate to Default.
Feature gates in OpenShift are enabled and disabled in a particular FeatureSet, mixing and matching is not allowed.
Prior to this enhancement, it is necessary to vendor openshift/api into every impacted repository.
After this enhancement, it will only be necessary to vendor into cluster-config-operator.

cc @openshift/openshift-staff-engineers @JoelSpeed

JoelSpeed

Looks great, just got one question about how we are going to test feature promotion

dev-guide/feature-gate.md

bparees · 2023-03-30T23:11:15Z

dev-guide/feature-gate.md

+
+1. Change thresholds for requiring TechPreviewNoUpgrade.
+   This is a possible future goal, but an update here would be required.
+2. Change thresholds for promoting from TechPreviewNoUpgrade to Default.


do we feel we have a well defined existing threshold? if not, this non-goal seems to conflict w/ the user story above of "i have sufficient evidence to promote this to on by default" (I don't know how we meet that user story w/o defining those thresholds, so if they aren't well defined then by implication they need to be changed)

It's been informal so far. I'm willing to describe in more detail, but to my knowledge, the informal, existing mechanism has been working well and I wasn't inclined to change it.

bparees · 2023-03-30T23:15:47Z

dev-guide/feature-gate.md

+To do this we will
+1. Update FeatureGateStatus to contain a list of enabled and disabled feature gates for up to every version listed
+   in the CVO history.
+2. Update cluster-config-operator to have a control loop owned by api-approvers to set the FeatureGateStatus


again not really sure why would be owned by api-approvers.

I see api review/approval as a distinct need/skillset/requirement that is unrelated to "definition of the maturity of a feature"

is it just the lack of another obvious team/set of people to own this code?

is it just the lack of another obvious team/set of people to own this code?

Yes. And avoiding the dumping ground of unowned cruft that was developing six months back or so. Small quantity of code to bootstrap a configuration API owned by the people approving those configuration APIs.

bparees · 2023-03-30T23:19:34Z

dev-guide/feature-gate.md

+   1. wait for FeatureGates to be read from the cluster for this version
+   2. make it easy to function with a development build on a patched cluster
+   3. make it easy to read the current state of FeatureGates
+   4. by default, exit when feature gates change (most processes don't react cleanly to changes at runtime)


this is going to result in extra+stampeding restarts (of processes, not nodes and not pods, i realize) on upgrades, right?

i know we have a semi-common pattern of self-terminating processes to pick up changes, but this seems particularly widespread and likely to happen in close temporal proximity. Maybe not a concern, but felt worthy of note/discussion.

wondering if there are optimizations to avoid this.

e.g. there's no reason for the process to restart if the only hange in the featuregates is to versions they are ignoring anyway

so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.

Same section describes how it could be more gracefully. But the default works. Going from default to upgradable resulting a mass blip of operators isn't disruptive to the cluster as a whole. Remember, this is the operator, not the operand.

so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.

The config operator today is run level 30, which for context is the same as MAPI and comes after most of the control plane components. If we were to promote the run level, then the new version feature gate information should be available early enough in the upgrade to avoid the duplicate restarts of all the operators.

I agree the default works so maybe this is something we want to trial and then optimise later?

Remember, this is the operator, not the operand.

What are the odds that various components bake this directly into the operand? I could see it in machine API for example, if there's an AWS specific feature, it may be more obvious to put the feature gate processing into the AWS implementation rather than using a flag controlled by the operator.

Maybe we want to clarify in the documentation for this process that operands should use feature flags controlled by operators to avoid the exits/need to gracefully handle changes

dev-guide/feature-gate.md

deads2k · 2023-03-31T18:58:46Z

Removed contentious bits and repushed.

rphillips · 2023-04-04T15:13:02Z

dev-guide/feature-gate.md

+#### Tech Preview -> GA
+
+Since this is actually controlling the gates, it is not practical to pre-test it.
+Once unit tests pass and clusters successfully install equivalently using this mechanism, PRs will be opened against 


I am interested how the MCO and conversely the Kubelet would digest the configured FeatureGates at bootstrap.

MCO Reference for Features Gates: openshift/machine-config-operator#2668

JoelSpeed · 2023-04-05T10:27:25Z

dev-guide/feature-gate.md

+   1. wait for FeatureGates to be read from the cluster for this version
+   2. make it easy to function with a development build on a patched cluster
+   3. make it easy to read the current state of FeatureGates
+   4. by default, exit when feature gates change (most processes don't react cleanly to changes at runtime)


so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.

The config operator today is run level 30, which for context is the same as MAPI and comes after most of the control plane components. If we were to promote the run level, then the new version feature gate information should be available early enough in the upgrade to avoid the duplicate restarts of all the operators.

I agree the default works so maybe this is something we want to trial and then optimise later?

Remember, this is the operator, not the operand.

What are the odds that various components bake this directly into the operand? I could see it in machine API for example, if there's an AWS specific feature, it may be more obvious to put the feature gate processing into the AWS implementation rather than using a flag controlled by the operator.

Maybe we want to clarify in the documentation for this process that operands should use feature flags controlled by operators to avoid the exits/need to gracefully handle changes

JoelSpeed · 2023-04-05T10:36:28Z

dev-guide/feature-gate.md

+
+### Upgrade / Downgrade Strategy
+
+On upgrade to the first level with this change, the cluster-config-operator goes first, so the FeatureGateStatus will


Is this true? I just double checked and the CCO deployment manifest is at run level 30, so I wouldn't expect it to be upgraded until after all of the core control plane components

JoelSpeed · 2023-04-24T11:16:32Z

dev-guide/feature-gate.md

+For developers using this, the first phase will look like
+1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117).
+   The PR should be confined to just the feature gate change and should include a link to a merged enhancement.
+2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.


Is it worth noting that nagging API approvers should be in #forum-api-review to balance the load rather than going directly to <insert your favourite API approver here> in a DM?

Should we have a slack alias to use to allow the group to be summoned to a particular thread?

We have precedent elsewhere for delegating decisions like this to team leads so we can maintain velocity and team autonomy. Let's do that here, and put appropriate training in place, so we don't have everyone queued up for 2 people to review important changes.

We talked about this in the staff engineer call today and agreed to move ahead with the technical aspects of this to unblock some other work, and have a deeper conversation about delegation of the process portion in the future.

Given these comments are a few months old, we've had an opportunity now to see how people are holding the feature gate tooling and how they're expecting them to work. There have been occasions where I have noticed teams incorrectly holding the feature gate tooling and have caught them early enough to stop themselves either breaking payloads or shooting themselves in the foot.

We've been updating the FAQs along the way but some hadn't even read those. I think we need to either do a very good job of socialising the expectations and how to hold the tools, or, have a limited number of knowledgable (staff?) engineers who can be the SMEs for the tooling and help others in their pursuit of feature gated features.

It still seems, from what I've seen, that more often than not, the PRs to promote features to stable are not coming with proof by default, and that pushing back is required to ask for proof that the promotion of certain features isn't going to break payload. An education point or documentation point maybe?

JoelSpeed · 2023-04-24T11:21:51Z

dev-guide/feature-gate.md

+type FeatureGateDetails struct {
+	// version matches the version provided by the ClusterVersion and in the ClusterOperator.Status.Versions field.
+	// +kubebuilder:validation:Required
+	// +required


Which tools actually still use this? I thought we had a preference to just use +kubebuilder:validation:Required?

JoelSpeed · 2023-04-24T11:23:51Z

dev-guide/feature-gate.md

+
+type FeatureGateAttributes struct {
+	// name is the name of the FeatureGate
+	// +kubebuilder:validation:Pattern=`^([A-Za-z0-9-]+\.)*[A-Za-z0-9-]+\.?$`


Do we have any validation at the point that feature gates are added to the API that matches this? I don't see it in the featuregates API but I know there's an admission plugin for that which may be doing the validation

dhellmann · 2023-04-24T12:54:58Z

dev-guide/feature-gate.md

+For developers using this, the first phase will look like
+1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117).
+   The PR should be confined to just the feature gate change and should include a link to a merged enhancement.
+2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.


We have precedent elsewhere for delegating decisions like this to team leads so we can maintain velocity and team autonomy. Let's do that here, and put appropriate training in place, so we don't have everyone queued up for 2 people to review important changes.

dhellmann · 2023-04-24T12:56:48Z

dev-guide/feature-gate.md

+### User Stories
+
+As a staff engineer or release manager, I want to have no-developer-action result in features of unproven reliability
+inaccessible-by-default in upgradeable clusters.


I still don't understand this sentence. Is there some way to say it without the jargon or hyphenated words?

@deads2k Maybe this should've been "accessible-by-default" ? You want all features to go through a tech-preview phase where they're expected to show that their tests pass at historic product wide baseline before they're available without TechPreviewNoUpgrade featureset.

wking · 2023-05-08T23:10:53Z

dev-guide/feature-gate.md

+This will ensure that older versions do not enable feature gates that were not GA until the later version.
+To do this we will
+1. Update FeatureGateStatus to contain a list of enabled and disabled feature gates for up to every version listed
+   in the CVO history.


ClusterVersion status.history is not exhaustive: openshift/cluster-version-operator#791, openshift/cluster-version-operator#805, #1153. It's possible that the fraction of history maintained by #1153 is sufficient for this enhancement. And it's possible that implicit in this enhancement is that status.featureGates will be pruned to keep up with ClusterVersion status.history pruning.

I'm not all that clear on why you'd want such deep history in featureGates though. Wouldn't the most recent Completed version, and any subsequent Partial versions be sufficient? Or possibly the useful version horizon extends back to the previous minor version, to accommodate possible eventual support for rollbacks within a z stream? But once you've Completed a 4.y.z, I don't understand why any 4.(y-1) or older feature gate information would matter.

openshift-bot · 2023-06-06T01:15:59Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

JoelSpeed · 2023-06-06T09:35:09Z

/remove-lifecycle stale.

petr-muller · 2023-06-06T13:28:54Z

/remove-lifecycle stale

openshift-bot · 2023-07-05T01:15:55Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2023-07-12T08:45:22Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

sdodson · 2023-07-17T17:18:36Z

/remove-lifecycle stale
My understanding is that this is actually essentially implemented at this point. Perhaps a quick audit for accuracy with current implementation and then merge?

openshift-bot · 2023-07-25T00:16:04Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2023-07-25T00:16:15Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed · 2023-07-25T08:37:14Z

/reopen
/remove-lifecycle rotten

Yeah this is basically complete, we should review and merge for future us to have a record of what we did

openshift-ci · 2023-07-25T08:37:39Z

@JoelSpeed: Reopened this PR.

Details

In response to this:

/reopen
/remove-lifecycle rotten

Yeah this is basically complete, we should review and merge for future us to have a record of what we did

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-07-25T08:37:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from deads2k. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-07-25T08:45:51Z

@deads2k: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

JoelSpeed · 2023-07-31T16:35:54Z

dev-guide/feature-gate.md

+For developers using this, the first phase will look like
+1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117).
+   The PR should be confined to just the feature gate change and should include a link to a merged enhancement.
+2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.


Given these comments are a few months old, we've had an opportunity now to see how people are holding the feature gate tooling and how they're expecting them to work. There have been occasions where I have noticed teams incorrectly holding the feature gate tooling and have caught them early enough to stop themselves either breaking payloads or shooting themselves in the foot.

We've been updating the FAQs along the way but some hadn't even read those. I think we need to either do a very good job of socialising the expectations and how to hold the tools, or, have a limited number of knowledgable (staff?) engineers who can be the SMEs for the tooling and help others in their pursuit of feature gated features.

It still seems, from what I've seen, that more often than not, the PRs to promote features to stable are not coming with proof by default, and that pushing back is required to ask for proof that the promotion of certain features isn't going to break payload. An education point or documentation point maybe?

JoelSpeed · 2023-07-31T16:36:55Z

dev-guide/feature-gate.md

+1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117).
+   The PR should be confined to just the feature gate change and should include a link to a merged enhancement.
+2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.
+3. Automation vendors openshift/api into cluster-config-operator and opens a PR in a few hours.


We haven't done this bit yet, is anyone prioritising it?

openshift-bot · 2023-08-29T01:15:12Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2023-09-05T08:46:05Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2023-09-13T00:15:29Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2023-09-13T00:16:04Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dhellmann · 2023-09-15T13:55:19Z

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

dhellmann · 2023-09-22T12:52:11Z

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

dhellmann · 2023-10-06T13:25:51Z

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

dhellmann · 2023-10-13T13:34:41Z

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

new approach to featuregates for coordination in the cluster

7d2648d

openshift-ci bot requested review from JoelSpeed and bparees March 29, 2023 19:53

JoelSpeed reviewed Mar 30, 2023

View reviewed changes

dev-guide/feature-gate.md Outdated Show resolved Hide resolved

bparees reviewed Mar 30, 2023

View reviewed changes

dhellmann reviewed Mar 31, 2023

View reviewed changes

deads2k added 3 commits March 31, 2023 14:53

to squash: comments round one

ee41c7f

to squash: comments round two

18e077a

to squash: comments round three

1f307f2

rphillips reviewed Apr 4, 2023

View reviewed changes

JoelSpeed reviewed Apr 5, 2023

View reviewed changes

JoelSpeed reviewed Apr 24, 2023

View reviewed changes

dhellmann reviewed Apr 24, 2023

View reviewed changes

JoelSpeed mentioned this pull request Apr 27, 2023

OCPBUGS-13547: [OCPCLOUD-2034] Update Library-go and API for new featuregate changes openshift/machine-config-operator#3688

Merged

wking reviewed May 8, 2023

View reviewed changes

deads2k mentioned this pull request May 11, 2023

OCPBUGS-13547: Remove featureset flag and use only the manifest openshift/cluster-kube-controller-manager-operator#735

Merged

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2023

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2023

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2023

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 12, 2023

openshift-ci bot closed this Jul 25, 2023

openshift-ci bot reopened this Jul 25, 2023

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 25, 2023

JoelSpeed reviewed Jul 31, 2023

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2023

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 5, 2023

openshift-ci bot closed this Sep 13, 2023

dhellmann removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 13, 2023


		### Upgrade / Downgrade Strategy

		On upgrade to the first level with this change, the cluster-config-operator goes first, so the FeatureGateStatus will

new approach to featuregates for coordination in the cluster #1373

new approach to featuregates for coordination in the cluster #1373

Uh oh!

Conversation

deads2k commented Mar 29, 2023

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deads2k commented Mar 31, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-bot commented Jun 6, 2023

Uh oh!

JoelSpeed commented Jun 6, 2023

Uh oh!

petr-muller commented Jun 6, 2023

Uh oh!

openshift-bot commented Jul 5, 2023

Uh oh!

openshift-bot commented Jul 12, 2023

Uh oh!

sdodson commented Jul 17, 2023

Uh oh!

openshift-bot commented Jul 25, 2023

Uh oh!

openshift-ci bot commented Jul 25, 2023

Uh oh!

JoelSpeed commented Jul 25, 2023

Uh oh!