Skip to content

Conversation

@deads2k
Copy link
Contributor

@deads2k deads2k commented Mar 29, 2023

This enhancement aims to reduce the effort required to add a feature gate to TechPreviewNoUpgrade and to promote
that feature gate to Default.
Feature gates in OpenShift are enabled and disabled in a particular FeatureSet, mixing and matching is not allowed.
Prior to this enhancement, it is necessary to vendor openshift/api into every impacted repository.
After this enhancement, it will only be necessary to vendor into cluster-config-operator.

cc @openshift/openshift-staff-engineers @JoelSpeed

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just got one question about how we are going to test feature promotion


1. Change thresholds for requiring TechPreviewNoUpgrade.
This is a possible future goal, but an update here would be required.
2. Change thresholds for promoting from TechPreviewNoUpgrade to Default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we feel we have a well defined existing threshold? if not, this non-goal seems to conflict w/ the user story above of "i have sufficient evidence to promote this to on by default" (I don't know how we meet that user story w/o defining those thresholds, so if they aren't well defined then by implication they need to be changed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been informal so far. I'm willing to describe in more detail, but to my knowledge, the informal, existing mechanism has been working well and I wasn't inclined to change it.

To do this we will
1. Update FeatureGateStatus to contain a list of enabled and disabled feature gates for up to every version listed
in the CVO history.
2. Update cluster-config-operator to have a control loop owned by api-approvers to set the FeatureGateStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again not really sure why would be owned by api-approvers.

I see api review/approval as a distinct need/skillset/requirement that is unrelated to "definition of the maturity of a feature"

is it just the lack of another obvious team/set of people to own this code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it just the lack of another obvious team/set of people to own this code?

Yes. And avoiding the dumping ground of unowned cruft that was developing six months back or so. Small quantity of code to bootstrap a configuration API owned by the people approving those configuration APIs.

1. wait for FeatureGates to be read from the cluster for this version
2. make it easy to function with a development build on a patched cluster
3. make it easy to read the current state of FeatureGates
4. by default, exit when feature gates change (most processes don't react cleanly to changes at runtime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to result in extra+stampeding restarts (of processes, not nodes and not pods, i realize) on upgrades, right?

i know we have a semi-common pattern of self-terminating processes to pick up changes, but this seems particularly widespread and likely to happen in close temporal proximity. Maybe not a concern, but felt worthy of note/discussion.

wondering if there are optimizations to avoid this.

e.g. there's no reason for the process to restart if the only hange in the featuregates is to versions they are ignoring anyway

so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same section describes how it could be more gracefully. But the default works. Going from default to upgradable resulting a mass blip of operators isn't disruptive to the cluster as a whole. Remember, this is the operator, not the operand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.

The config operator today is run level 30, which for context is the same as MAPI and comes after most of the control plane components. If we were to promote the run level, then the new version feature gate information should be available early enough in the upgrade to avoid the duplicate restarts of all the operators.

I agree the default works so maybe this is something we want to trial and then optimise later?

Remember, this is the operator, not the operand.

What are the odds that various components bake this directly into the operand? I could see it in machine API for example, if there's an AWS specific feature, it may be more obvious to put the feature gate processing into the AWS implementation rather than using a flag controlled by the operator.

Maybe we want to clarify in the documentation for this process that operands should use feature flags controlled by operators to avoid the exits/need to gracefully handle changes

@deads2k
Copy link
Contributor Author

deads2k commented Mar 31, 2023

Removed contentious bits and repushed.

#### Tech Preview -> GA

Since this is actually controlling the gates, it is not practical to pre-test it.
Once unit tests pass and clusters successfully install equivalently using this mechanism, PRs will be opened against
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am interested how the MCO and conversely the Kubelet would digest the configured FeatureGates at bootstrap.

MCO Reference for Features Gates: openshift/machine-config-operator#2668

1. wait for FeatureGates to be read from the cluster for this version
2. make it easy to function with a development build on a patched cluster
3. make it easy to read the current state of FeatureGates
4. by default, exit when feature gates change (most processes don't react cleanly to changes at runtime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.

The config operator today is run level 30, which for context is the same as MAPI and comes after most of the control plane components. If we were to promote the run level, then the new version feature gate information should be available early enough in the upgrade to avoid the duplicate restarts of all the operators.

I agree the default works so maybe this is something we want to trial and then optimise later?

Remember, this is the operator, not the operand.

What are the odds that various components bake this directly into the operand? I could see it in machine API for example, if there's an AWS specific feature, it may be more obvious to put the feature gate processing into the AWS implementation rather than using a flag controlled by the operator.

Maybe we want to clarify in the documentation for this process that operands should use feature flags controlled by operators to avoid the exits/need to gracefully handle changes


### Upgrade / Downgrade Strategy

On upgrade to the first level with this change, the cluster-config-operator goes first, so the FeatureGateStatus will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I just double checked and the CCO deployment manifest is at run level 30, so I wouldn't expect it to be upgraded until after all of the core control plane components

For developers using this, the first phase will look like
1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117).
The PR should be confined to just the feature gate change and should include a link to a merged enhancement.
2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth noting that nagging API approvers should be in #forum-api-review to balance the load rather than going directly to <insert your favourite API approver here> in a DM?

Should we have a slack alias to use to allow the group to be summoned to a particular thread?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have precedent elsewhere for delegating decisions like this to team leads so we can maintain velocity and team autonomy. Let's do that here, and put appropriate training in place, so we don't have everyone queued up for 2 people to review important changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this in the staff engineer call today and agreed to move ahead with the technical aspects of this to unblock some other work, and have a deeper conversation about delegation of the process portion in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these comments are a few months old, we've had an opportunity now to see how people are holding the feature gate tooling and how they're expecting them to work. There have been occasions where I have noticed teams incorrectly holding the feature gate tooling and have caught them early enough to stop themselves either breaking payloads or shooting themselves in the foot.

We've been updating the FAQs along the way but some hadn't even read those. I think we need to either do a very good job of socialising the expectations and how to hold the tools, or, have a limited number of knowledgable (staff?) engineers who can be the SMEs for the tooling and help others in their pursuit of feature gated features.

It still seems, from what I've seen, that more often than not, the PRs to promote features to stable are not coming with proof by default, and that pushing back is required to ask for proof that the promotion of certain features isn't going to break payload. An education point or documentation point maybe?

type FeatureGateDetails struct {
// version matches the version provided by the ClusterVersion and in the ClusterOperator.Status.Versions field.
// +kubebuilder:validation:Required
// +required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which tools actually still use this? I thought we had a preference to just use +kubebuilder:validation:Required?


type FeatureGateAttributes struct {
// name is the name of the FeatureGate
// +kubebuilder:validation:Pattern=`^([A-Za-z0-9-]+\.)*[A-Za-z0-9-]+\.?$`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any validation at the point that feature gates are added to the API that matches this? I don't see it in the featuregates API but I know there's an admission plugin for that which may be doing the validation

For developers using this, the first phase will look like
1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117).
The PR should be confined to just the feature gate change and should include a link to a merged enhancement.
2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have precedent elsewhere for delegating decisions like this to team leads so we can maintain velocity and team autonomy. Let's do that here, and put appropriate training in place, so we don't have everyone queued up for 2 people to review important changes.

### User Stories

As a staff engineer or release manager, I want to have no-developer-action result in features of unproven reliability
inaccessible-by-default in upgradeable clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand this sentence. Is there some way to say it without the jargon or hyphenated words?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k Maybe this should've been "accessible-by-default" ? You want all features to go through a tech-preview phase where they're expected to show that their tests pass at historic product wide baseline before they're available without TechPreviewNoUpgrade featureset.

This will ensure that older versions do not enable feature gates that were not GA until the later version.
To do this we will
1. Update FeatureGateStatus to contain a list of enabled and disabled feature gates for up to every version listed
in the CVO history.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterVersion status.history is not exhaustive: openshift/cluster-version-operator#791, openshift/cluster-version-operator#805, #1153. It's possible that the fraction of history maintained by #1153 is sufficient for this enhancement. And it's possible that implicit in this enhancement is that status.featureGates will be pruned to keep up with ClusterVersion status.history pruning.

I'm not all that clear on why you'd want such deep history in featureGates though. Wouldn't the most recent Completed version, and any subsequent Partial versions be sufficient? Or possibly the useful version horizon extends back to the previous minor version, to accommodate possible eventual support for rollbacks within a z stream? But once you've Completed a 4.y.z, I don't understand why any 4.(y-1) or older feature gate information would matter.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2023
@JoelSpeed
Copy link
Contributor

/remove-lifecycle stale.

@petr-muller
Copy link
Member

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2023
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2023
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 12, 2023
@sdodson
Copy link
Member

sdodson commented Jul 17, 2023

/remove-lifecycle stale
My understanding is that this is actually essentially implemented at this point. Perhaps a quick audit for accuracy with current implementation and then merge?

@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jul 25, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 25, 2023

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Contributor

/reopen
/remove-lifecycle rotten

Yeah this is basically complete, we should review and merge for future us to have a record of what we did

@openshift-ci openshift-ci bot reopened this Jul 25, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 25, 2023

@JoelSpeed: Reopened this PR.

Details

In response to this:

/reopen
/remove-lifecycle rotten

Yeah this is basically complete, we should review and merge for future us to have a record of what we did

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 25, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 25, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from deads2k. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 25, 2023

@deads2k: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

For developers using this, the first phase will look like
1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117).
The PR should be confined to just the feature gate change and should include a link to a merged enhancement.
2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these comments are a few months old, we've had an opportunity now to see how people are holding the feature gate tooling and how they're expecting them to work. There have been occasions where I have noticed teams incorrectly holding the feature gate tooling and have caught them early enough to stop themselves either breaking payloads or shooting themselves in the foot.

We've been updating the FAQs along the way but some hadn't even read those. I think we need to either do a very good job of socialising the expectations and how to hold the tools, or, have a limited number of knowledgable (staff?) engineers who can be the SMEs for the tooling and help others in their pursuit of feature gated features.

It still seems, from what I've seen, that more often than not, the PRs to promote features to stable are not coming with proof by default, and that pushing back is required to ask for proof that the promotion of certain features isn't going to break payload. An education point or documentation point maybe?

1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117).
The PR should be confined to just the feature gate change and should include a link to a merged enhancement.
2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.
3. Automation vendors openshift/api into cluster-config-operator and opens a PR in a few hours.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't done this bit yet, is anyone prioritising it?

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2023
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 5, 2023
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Sep 13, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 13, 2023

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

3 similar comments
@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann dhellmann removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants