Incorrectly written BodyExecutionCallback implementations lead to zombie executions#60
Incorrectly written BodyExecutionCallback implementations lead to zombie executions#60mrginglymus wants to merge 1 commit intojenkinsci:masterfrom
Conversation
|
It's also worth noting that I first saw this issue from an unhandled in |
dwnusbaum
left a comment
There was a problem hiding this comment.
@mrginglymus Thanks for the PR!
If someone can quickly identify the quick fix for this then that would be greatly appreciated, otherwise I will put on my spelunking gear.
I think the fix would be over in https://github.com/jenkinsci/workflow-cps-plugin (and I would probably go ahead and move this PR over there as well and add the new test to CpsBodyExecutionTest).
I am not sure exactly what a fix would look like. I would probably start by adding some try/catch blocks in CpsBodyExecution.start, CpsBodyExecution$FailureAdapter.receive, and CpsBodyExecution$SuccessAdapter.receive that just log exceptions thrown by the callbacks, and then I would add a test that fails in each method to see what other changes would be needed to prevent the execution's state from being corrupted if an exception is thrown by a BodyExecutionCallback. I am not sure if we can handle this case in general though: I think that for some steps (maybe parallel), the callbacks may need to execute for the step to be able to be cleaned up successfully. I think we should be able to handle failures in onStart though by just immediately aborting the step.
|
Great, thanks for the pointers - I realised after raising this that this probably wasn't the right place. I'll move it over and have a play. |
|
Finally got round to raising jenkinsci/workflow-cps-plugin#422 |
I'm raising this PR as a minimal reproduction of the issue I noticed from a bug in the
withChecksstep of thechecks-apiplugin: jenkinsci/checks-api-plugin#83The
withChecksstep does not handle all possible exceptions inonStart,onSuccess, andonFailure. If an exception gets thrown by one of these methods, while the job appears to fail correctly, the executor on which the step is running does not, and you end up with a execution that must be killed manually.Whilst it is obviously the responsibility of the plugin to behave correctly, this failure mode is particularly pernicious. If you run a jenkins instance with on-demand cloud agents then any zombie executions will prevent that cloud agent from being shut down until you manually kill it. I have seen three day old zombies on Monday morning from an executor that was unnecessarily up all weekend.
This PR does not (yet) contain a fix as I haven't looked into the plumbing of how this actually works. If someone can quickly identify the quick fix for this then that would be greatly appreciated, otherwise I will put on my spelunking gear.