-
Notifications
You must be signed in to change notification settings - Fork 217
OTA-1159: [2/3] Maintain ReconciliationIssues condition
#1032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-1159: [2/3] Maintain ReconciliationIssues condition
#1032
Conversation
|
@petr-muller: This pull request references OTA-1159 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@petr-muller: This pull request references OTA-1159 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a96e7bd to
3fd6099
Compare
4e58482 to
33e06ec
Compare
|
/retest |
33e06ec to
fc5033e
Compare
pkg/start/start.go
Outdated
| var enabledGates cvo.FeatureGates | ||
| for _, g := range gate.Status.FeatureGates { | ||
| if g.Version != version.Version.String() { | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this approach discussed briefly in openshift/api#1405, but it feels a bit weird. If a gate claims enabled in 4.y.z and we happen to be running a different version (e.g. because we're updating into a new CVO) it looks like we'll disble all the gates until whatever manages FeatureGate catches up with us. But we don't want to blip these off early in the update, do we? We might be unique vs. other existing consumers in going first in the update, while most other consumers trail behind the Kube API server operator (except etcd, which is in parallel). Anyhow, I'm not entirely sure how this plays out, and it would be nice to have the commit message talk me through the behavior as we understand it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I think I misunderstood how exactly the FeatureGate works during upgrades - I for some reason believed the resource is managed by the config-operator for some reason I lived in the world where CVO is upgraded after config-operator, but that does not seem to be true (0000_00_cluster-version-operator_03_deployment goes before 0000_10_config-operator_08_clusteroperator. I'm clearly the ideal person to write these PRs :-P
I started following o/enhancements dev-guide/featuresets.md that lead me to FeatureGateAccess where the controller errors and requeues until the gates are set, and the dev-guide setup code makes the controller startup block until that happens. That felt scary for CVO so I thought running briefly without gates enalbed (which is fine for the current state of gates in CVO where there is one which is off by default) is acceptable (aggravated by my false belief that featuregatestopper would kill us once status finally updates).
I guess our only two real options is to assume some sane default or block? And I guess we cannot really block because of our role in the upgrades because we upgrade before config-operator so we would never see the new version gates?
You are totally correct that whatever corners we cut, we need to document them better. I'll spend more time looking into this and at the very least write much better commit messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by adding the UnknownVersion meta-flag that informs flag consumer code in CVO that we were not able to read enabled/disabled features for our version, so we should follow some sane default behavior, but can tolerate evidence of previously enabled features if we find them in the cluster.
It brings some complexity but I think it is the only solution for CVO to tolerate FeatureGate status not having a matching version while not blocking startup on that.
pkg/start/start.go
Outdated
| return false, nil | ||
| default: | ||
| startingFeatureSet = string(gate.Spec.FeatureSet) | ||
| enabledGates = cvoGatesFrom(gate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that we're going to start depending on cluster FeatureGate status here, we'll need something more granular than our current feature-set watcher to notice those changes and keep us up to date. Maybe pkg/featurechangestopper (already watching/polling) should grow an API to call a callback on changes, and we have a lock on Operator and a goroutine that updates Operator.enabledGates on changes? Or rolls the CVO container? Or...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mistakenly believed that featurechangestopper would kill us on any change but that's really not what happens. library-go's FeatureGateAccess terminates when the set of enabled/disabled gates changes. Our featurechangestopper terminates when featureset changes.
I think for our case it could be appropriate to do something like what FeatureGateAccess does and simply terminate on observed gate changes, probably check just the ones that CVO cares about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c1b2767 by making featurechangestopper terminate CVO when CVO-relevant flags change.
pkg/cvo/status_test.go
Outdated
| expectedRriCondition: nil, | ||
| }, | ||
| { | ||
| name: "RRI disabled, version unknown, failure => condition present", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe add "existing condition" to the name? And a test case for RRI disabled, version unknown, no existing condition, failure => condition not present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 added in 3da7615
|
/test all |
|
/test e2e-agnostic-operator |
|
/test lint |
|
Huh for some reason this PR is not picking up recent commits like petr-muller@24b845e |
|
/close |
|
@petr-muller: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
9929ed6 to
24b845e
Compare
|
/reopen |
|
@petr-muller: This pull request references OTA-1159 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@petr-muller Please take a look above test, in step 6, when ResourceReconciliationIssues is true but the message is not a json: |
|
Upgrade cluster from normal 4.15 to latest with this PR:
7 Check featuregate again: |
Thanks for testing, I made a mistake in my notes - the JSON part is actually done in a followup PR #1033, not this one. This PR only needs to have a correct True/False value and natural language message. Sorry, it mixed up in my head. |
|
@petr-muller ok, so I think the change is not break install and upgrade and also can set ResourceReconciliationIssues correctly. |
|
@JianLi-RH no worries about time, it's not extra urgent. If we can get it tested this week then we're fine 👍 |
pkg/cvo/cvo.go
Outdated
| // This is analogical to VersionForOperatorFromEnv() from o/library-go but the import is pretty | ||
| // heavy for a single, simple os.Getenv wrapper, so we just inline the logic here. | ||
| operatorVersion := os.Getenv("OPERATOR_IMAGE_VERSION") | ||
| klog.Infof("Looking up feature gates for version %s", operatorVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should not be printing this log always after we move this out of feature-gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later commit removed this specific line because logging this would be excessive (we'd do it in a loop):
cluster-version-operator/pkg/featuregates/featuregates.go
Lines 79 to 84 in 7bd3096
| // CvoGatesFromFeatureGate finds feature gates for a given version in a FeatureGate resource and returns | |
| // CvoGates that reflects them, or the default gates if given version was not found in the FeatureGate | |
| func CvoGatesFromFeatureGate(gate *configv1.FeatureGate, version string) CvoGates { | |
| enabledGates := DefaultCvoGates(version) | |
| for _, g := range gate.Status.FeatureGates { |
But one-time log of how feature gates were detected (what version do we see, if we found matching item in FeatureGate .status, etc) is definitely useful and we should have it. Right now we have a one-time log like this:
$ /usr/bin/grep -e start.go -e featuregates.go -e stopper.go openshift-cluster-version_cluster-version-operator-77b9674d7-58k2s_cluster-version-operator.log
I0325 15:16:00.615836 1 start.go:23] ClusterVersionOperator v1.0.0-1183-gb6dbd748-dirty
I0325 15:16:00.747858 1 start.go:281] Waiting on 1 outstanding goroutines.
I0325 15:20:52.964166 1 start.go:549] FeatureGate found in cluster, using its feature set "" at startup
I0325 15:20:54.090127 1 start.go:574] CVO features for version 4.16.0-0.ci.test-2024-03-25-143851-ci-op-drnlyhbh-latest enabled at startup: {desiredVersion:4.16.0-0.ci.test-2024-03-25-143851-ci-op-drnlyhbh-latest unknownVersion:false resourceReconciliationIssuesCondition:false}
I0325 15:20:54.090240 1 featurechangestopper.go:123] Starting stop-on-features-change controller with startingRequiredFeatureSet="" startingCvoGates={desiredVersion:4.16.0-0.ci.test-2024-03-25-143851-ci-op-drnlyhbh-latest unknownVersion:false resourceReconciliationIssuesCondition:false}| resourceReconciliationIssuesConditionType v1.ClusterStatusConditionType = "ResourceReconciliationIssues" | ||
|
|
||
| noResourceReconciliationIssuesReason string = "NoIssues" | ||
| noResourceReconciliationIssuesMessage string = "No issues found during resource reconciliation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't resource reconciliation and reconciliation is same thing in this case? If yes, we should drop resource word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let me clean up the code 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed all instances of "resource reconciliation" to just "reconciliation" in cc78232
|
The PR overall looks good to me. I have some small nitpicks which I mentioned in my review. |
ReconciliationIssues condition
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
cc78232 to
1604448
Compare
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Finished below regression: enable non-TechPreviewNoUpgrade featureset in featuregate Wait some minutes, check co again: |
|
Setup a 4.15 cluster without TP: Create an image: Upgrade to above payload: When the upgrading finished, check above ns and co again: |
|
Continue the test with the cluster in case 3. Check co: Check if upgradable: Check manifest: Unset featureset to disable TechPreviewNoUpgrade |
|
hi @petr-muller I have finished 4 feature gate related regression cases, they all work normal. |
|
/label qe-approved |
|
@petr-muller: This pull request references OTA-1159 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@JianLi-RH Thank you, great work! 👍 |
|
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-version-operator-container-v4.16.0-202403281444.p0.ge3677e0.assembly.stream.el9 for distgit cluster-version-operator. |
cvo: set ResourceReconciliationIssues condition
When an appropriate feature flag is set, maintain a
ResourceReconciliationIssuescondition on the CV status. This condition is False when no issues were
encountered (signalled by the
Failurefield on theSyncWorkerStatusparameter) and True otherwise.
cvo: read enabled feature flags from cluster
Read CVO-related Feature Flags from the cluster resource and propagate
them to CVO controllers. Multiplex the coarse cluster feature flag into
smaller, CVO-specific flags for easier maintenance in the future.
Builds on top of #1031
Because of OCPBUGS-30080, we cannot easily determine running CVO version by a single
os.Getenv(), like other operators can. CVO can determine its version from the initial payload it loads from disk though, but this happens a bit later in the code flow, after leadership lease is acquired and all informers are started. At that point we can provide the feature gate / featureset knowledge to the structures that need it: actual CVO controller and the feature changestopper, but these structures also need to be initialized earlier (they require informers which are already started). This leads to a slightly awkard delayed initialization scheme, where the controller structures are initialized early and populated with early content like informers etc. Later, when informers are started and CVO loads its initial payload, we can extract the version from it and use it to populate the feature gate in the controller structures. Because enabled feature gates are avaiable later in the flow, it also means part of the CVO code cannot be gated by a feature gate (like controller initialization, or initial payload loading). We do not need that now but it may cause issues later.The high-level sequence after this commit looks like this:
enabledFeatureGatechecker with one panics when used (no code can check for gates before we know them)