Skip to content

Conversation

@tchap
Copy link
Contributor

@tchap tchap commented Jan 27, 2026

Currently the controller degrades on any error, including an apply conflict. This can happen during upgrades and should not be the case.

The cause is probably the fact that we delete and create the guard pod in the very same sync call. But instead of handling this specifically, the apply call now simply ignores conflict errors and sync is automatically invoked again as the changes propagate.

This is to remove degrading like

GuardControllerDegraded: Unable to apply pod openshift-kube-scheduler-guard-ip-10-0-106-117.us-east-2.compute.internal changes: Operation cannot be fulfilled on pods \"openshift-kube-scheduler-guard-ip-10-0-106-117.us-east-2.compute.internal\": the object has been modified; please apply your changes to the latest version and try again

Currently the controller degrades on any error, including an apply
conflict. This can happen during upgrades and should not be the case.

The cause is probably the fact that we delete and create the guard pod
in the very same sync call. But instead of handling this specifically,
the apply call now simply ignores conflict errors and sync is
automatically invoked again as the changes propagate.
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jan 27, 2026
@openshift-ci-robot
Copy link

@tchap: This pull request references Jira Issue OCPBUGS-38662, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Currently the controller degrades on any error, including an apply conflict. This can happen during upgrades and should not be the case.

The cause is probably the fact that we delete and create the guard pod in the very same sync call. But instead of handling this specifically, the apply call now simply ignores conflict errors and sync is automatically invoked again as the changes propagate.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from dgrisonnet and tkashem January 27, 2026 11:30
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tchap
Once this PR has been reviewed and has the lgtm label, please assign dgrisonnet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 27, 2026

@tchap: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err))
if !apierrors.IsConflict(err) {
errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure logging such an error is useful, but I left the log statement there.

if err != nil {
klog.Errorf("Unable to apply pod %v changes: %v", pod.Name, err)
errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err))
if !apierrors.IsConflict(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this PR handling conflicts gracefully? 🙂

In my opinion, we should not hide errors.

A conflict error is still an error.
For example, it might indicate a conflict with a different actor in the system.

What we could do instead (if we aren’t already) is consider retrying certain types of requests when a conflict error occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that ignoring is very graceful 🙂

Or we can simply return when we delete the pod and wait for another round. I think that that's the cause. Because the deployment is using Recreate strategy, there should not be multiple pods at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will investigate further how to handle this in a better way and what the root cause actually is...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually racing with Kubelet that is updating the status after pod creation, from I can tell. This seems to be normal.

What do you mean retrying requests? This is actually retrying the request, but once an up-to-date object is fetched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants