-
Notifications
You must be signed in to change notification settings - Fork 259
OCPBUGS-38662: guard controller: Handle conflict gracefully #2093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Currently the controller degrades on any error, including an apply conflict. This can happen during upgrades and should not be the case. The cause is probably the fact that we delete and create the guard pod in the very same sync call. But instead of handling this specifically, the apply call now simply ignores conflict errors and sync is automatically invoked again as the changes propagate.
|
@tchap: This pull request references Jira Issue OCPBUGS-38662, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tchap The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@tchap: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err)) | ||
| if !apierrors.IsConflict(err) { | ||
| errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure logging such an error is useful, but I left the log statement there.
| if err != nil { | ||
| klog.Errorf("Unable to apply pod %v changes: %v", pod.Name, err) | ||
| errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err)) | ||
| if !apierrors.IsConflict(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this PR handling conflicts gracefully? 🙂
In my opinion, we should not hide errors.
A conflict error is still an error.
For example, it might indicate a conflict with a different actor in the system.
What we could do instead (if we aren’t already) is consider retrying certain types of requests when a conflict error occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that ignoring is very graceful 🙂
Or we can simply return when we delete the pod and wait for another round. I think that that's the cause. Because the deployment is using Recreate strategy, there should not be multiple pods at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will investigate further how to handle this in a better way and what the root cause actually is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually racing with Kubelet that is updating the status after pod creation, from I can tell. This seems to be normal.
What do you mean retrying requests? This is actually retrying the request, but once an up-to-date object is fetched.
Currently the controller degrades on any error, including an apply conflict. This can happen during upgrades and should not be the case.
The cause is probably the fact that we delete and create the guard pod in the very same sync call. But instead of handling this specifically, the apply call now simply ignores conflict errors and sync is automatically invoked again as the changes propagate.
This is to remove degrading like