-
Notifications
You must be signed in to change notification settings - Fork 537
new approach to featuregates for coordination in the cluster #1373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JoelSpeed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just got one question about how we are going to test feature promotion
|
|
||
| 1. Change thresholds for requiring TechPreviewNoUpgrade. | ||
| This is a possible future goal, but an update here would be required. | ||
| 2. Change thresholds for promoting from TechPreviewNoUpgrade to Default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we feel we have a well defined existing threshold? if not, this non-goal seems to conflict w/ the user story above of "i have sufficient evidence to promote this to on by default" (I don't know how we meet that user story w/o defining those thresholds, so if they aren't well defined then by implication they need to be changed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been informal so far. I'm willing to describe in more detail, but to my knowledge, the informal, existing mechanism has been working well and I wasn't inclined to change it.
| To do this we will | ||
| 1. Update FeatureGateStatus to contain a list of enabled and disabled feature gates for up to every version listed | ||
| in the CVO history. | ||
| 2. Update cluster-config-operator to have a control loop owned by api-approvers to set the FeatureGateStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again not really sure why would be owned by api-approvers.
I see api review/approval as a distinct need/skillset/requirement that is unrelated to "definition of the maturity of a feature"
is it just the lack of another obvious team/set of people to own this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it just the lack of another obvious team/set of people to own this code?
Yes. And avoiding the dumping ground of unowned cruft that was developing six months back or so. Small quantity of code to bootstrap a configuration API owned by the people approving those configuration APIs.
| 1. wait for FeatureGates to be read from the cluster for this version | ||
| 2. make it easy to function with a development build on a patched cluster | ||
| 3. make it easy to read the current state of FeatureGates | ||
| 4. by default, exit when feature gates change (most processes don't react cleanly to changes at runtime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to result in extra+stampeding restarts (of processes, not nodes and not pods, i realize) on upgrades, right?
i know we have a semi-common pattern of self-terminating processes to pick up changes, but this seems particularly widespread and likely to happen in close temporal proximity. Maybe not a concern, but felt worthy of note/discussion.
wondering if there are optimizations to avoid this.
e.g. there's no reason for the process to restart if the only hange in the featuregates is to versions they are ignoring anyway
so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same section describes how it could be more gracefully. But the default works. Going from default to upgradable resulting a mass blip of operators isn't disruptive to the cluster as a whole. Remember, this is the operator, not the operand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.
The config operator today is run level 30, which for context is the same as MAPI and comes after most of the control plane components. If we were to promote the run level, then the new version feature gate information should be available early enough in the upgrade to avoid the duplicate restarts of all the operators.
I agree the default works so maybe this is something we want to trial and then optimise later?
Remember, this is the operator, not the operand.
What are the odds that various components bake this directly into the operand? I could see it in machine API for example, if there's an AWS specific feature, it may be more obvious to put the feature gate processing into the AWS implementation rather than using a flag controlled by the operator.
Maybe we want to clarify in the documentation for this process that operands should use feature flags controlled by operators to avoid the exits/need to gracefully handle changes
|
Removed contentious bits and repushed. |
| #### Tech Preview -> GA | ||
|
|
||
| Since this is actually controlling the gates, it is not practical to pre-test it. | ||
| Once unit tests pass and clusters successfully install equivalently using this mechanism, PRs will be opened against |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am interested how the MCO and conversely the Kubelet would digest the configured FeatureGates at bootstrap.
MCO Reference for Features Gates: openshift/machine-config-operator#2668
| 1. wait for FeatureGates to be read from the cluster for this version | ||
| 2. make it easy to function with a development build on a patched cluster | ||
| 3. make it easy to read the current state of FeatureGates | ||
| 4. by default, exit when feature gates change (most processes don't react cleanly to changes at runtime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.
The config operator today is run level 30, which for context is the same as MAPI and comes after most of the control plane components. If we were to promote the run level, then the new version feature gate information should be available early enough in the upgrade to avoid the duplicate restarts of all the operators.
I agree the default works so maybe this is something we want to trial and then optimise later?
Remember, this is the operator, not the operand.
What are the odds that various components bake this directly into the operand? I could see it in machine API for example, if there's an AWS specific feature, it may be more obvious to put the feature gate processing into the AWS implementation rather than using a flag controlled by the operator.
Maybe we want to clarify in the documentation for this process that operands should use feature flags controlled by operators to avoid the exits/need to gracefully handle changes
|
|
||
| ### Upgrade / Downgrade Strategy | ||
|
|
||
| On upgrade to the first level with this change, the cluster-config-operator goes first, so the FeatureGateStatus will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I just double checked and the CCO deployment manifest is at run level 30, so I wouldn't expect it to be upgraded until after all of the core control plane components
| For developers using this, the first phase will look like | ||
| 1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117). | ||
| The PR should be confined to just the feature gate change and should include a link to a merged enhancement. | ||
| 2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth noting that nagging API approvers should be in #forum-api-review to balance the load rather than going directly to <insert your favourite API approver here> in a DM?
Should we have a slack alias to use to allow the group to be summoned to a particular thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have precedent elsewhere for delegating decisions like this to team leads so we can maintain velocity and team autonomy. Let's do that here, and put appropriate training in place, so we don't have everyone queued up for 2 people to review important changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this in the staff engineer call today and agreed to move ahead with the technical aspects of this to unblock some other work, and have a deeper conversation about delegation of the process portion in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given these comments are a few months old, we've had an opportunity now to see how people are holding the feature gate tooling and how they're expecting them to work. There have been occasions where I have noticed teams incorrectly holding the feature gate tooling and have caught them early enough to stop themselves either breaking payloads or shooting themselves in the foot.
We've been updating the FAQs along the way but some hadn't even read those. I think we need to either do a very good job of socialising the expectations and how to hold the tools, or, have a limited number of knowledgable (staff?) engineers who can be the SMEs for the tooling and help others in their pursuit of feature gated features.
It still seems, from what I've seen, that more often than not, the PRs to promote features to stable are not coming with proof by default, and that pushing back is required to ask for proof that the promotion of certain features isn't going to break payload. An education point or documentation point maybe?
| type FeatureGateDetails struct { | ||
| // version matches the version provided by the ClusterVersion and in the ClusterOperator.Status.Versions field. | ||
| // +kubebuilder:validation:Required | ||
| // +required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which tools actually still use this? I thought we had a preference to just use +kubebuilder:validation:Required?
|
|
||
| type FeatureGateAttributes struct { | ||
| // name is the name of the FeatureGate | ||
| // +kubebuilder:validation:Pattern=`^([A-Za-z0-9-]+\.)*[A-Za-z0-9-]+\.?$` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any validation at the point that feature gates are added to the API that matches this? I don't see it in the featuregates API but I know there's an admission plugin for that which may be doing the validation
| For developers using this, the first phase will look like | ||
| 1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117). | ||
| The PR should be confined to just the feature gate change and should include a link to a merged enhancement. | ||
| 2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have precedent elsewhere for delegating decisions like this to team leads so we can maintain velocity and team autonomy. Let's do that here, and put appropriate training in place, so we don't have everyone queued up for 2 people to review important changes.
| ### User Stories | ||
|
|
||
| As a staff engineer or release manager, I want to have no-developer-action result in features of unproven reliability | ||
| inaccessible-by-default in upgradeable clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand this sentence. Is there some way to say it without the jargon or hyphenated words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k Maybe this should've been "accessible-by-default" ? You want all features to go through a tech-preview phase where they're expected to show that their tests pass at historic product wide baseline before they're available without TechPreviewNoUpgrade featureset.
| This will ensure that older versions do not enable feature gates that were not GA until the later version. | ||
| To do this we will | ||
| 1. Update FeatureGateStatus to contain a list of enabled and disabled feature gates for up to every version listed | ||
| in the CVO history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClusterVersion status.history is not exhaustive: openshift/cluster-version-operator#791, openshift/cluster-version-operator#805, #1153. It's possible that the fraction of history maintained by #1153 is sufficient for this enhancement. And it's possible that implicit in this enhancement is that status.featureGates will be pruned to keep up with ClusterVersion status.history pruning.
I'm not all that clear on why you'd want such deep history in featureGates though. Wouldn't the most recent Completed version, and any subsequent Partial versions be sufficient? Or possibly the useful version horizon extends back to the previous minor version, to accommodate possible eventual support for rollbacks within a z stream? But once you've Completed a 4.y.z, I don't understand why any 4.(y-1) or older feature gate information would matter.
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale. |
|
/remove-lifecycle stale |
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
|
/remove-lifecycle stale |
|
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/reopen Yeah this is basically complete, we should review and merge for future us to have a record of what we did |
|
@JoelSpeed: Reopened this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@deads2k: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| For developers using this, the first phase will look like | ||
| 1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117). | ||
| The PR should be confined to just the feature gate change and should include a link to a merged enhancement. | ||
| 2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given these comments are a few months old, we've had an opportunity now to see how people are holding the feature gate tooling and how they're expecting them to work. There have been occasions where I have noticed teams incorrectly holding the feature gate tooling and have caught them early enough to stop themselves either breaking payloads or shooting themselves in the foot.
We've been updating the FAQs along the way but some hadn't even read those. I think we need to either do a very good job of socialising the expectations and how to hold the tools, or, have a limited number of knowledgable (staff?) engineers who can be the SMEs for the tooling and help others in their pursuit of feature gated features.
It still seems, from what I've seen, that more often than not, the PRs to promote features to stable are not coming with proof by default, and that pushing back is required to ask for proof that the promotion of certain features isn't going to break payload. An education point or documentation point maybe?
| 1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117). | ||
| The PR should be confined to just the feature gate change and should include a link to a merged enhancement. | ||
| 2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it. | ||
| 3. Automation vendors openshift/api into cluster-config-operator and opens a PR in a few hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't done this bit yet, is anyone prioritising it?
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
|
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
3 similar comments
|
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
|
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
|
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
This enhancement aims to reduce the effort required to add a feature gate to TechPreviewNoUpgrade and to promote
that feature gate to Default.
Feature gates in OpenShift are enabled and disabled in a particular FeatureSet, mixing and matching is not allowed.
Prior to this enhancement, it is necessary to vendor openshift/api into every impacted repository.
After this enhancement, it will only be necessary to vendor into cluster-config-operator.
cc @openshift/openshift-staff-engineers @JoelSpeed