Skip to content

Handle failure in publish invocation#83

Closed
mrginglymus wants to merge 3 commits intojenkinsci:masterfrom
mrginglymus:less-explodey
Closed

Handle failure in publish invocation#83
mrginglymus wants to merge 3 commits intojenkinsci:masterfrom
mrginglymus:less-explodey

Conversation

@mrginglymus
Copy link
Contributor

From what I think I can glean from the stack trace/reading similar code, an unhandled exception in any of the BodyExecutionCallback might lead to zombie jobs on executors. This manifested itself when the publish at the start of a withChecks step failed because of a merge conflict. The overall job aborted succesfully, but the three outstanding parallel stages continued to block their executors and required manual cleanup.

@codecov
Copy link

codecov bot commented Feb 13, 2021

Codecov Report

Merging #83 (912bb6a) into master (65907be) will decrease coverage by 0.82%.
The diff coverage is 31.57%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master      #83      +/-   ##
============================================
- Coverage     86.94%   86.11%   -0.83%     
  Complexity      139      139              
============================================
  Files            16       16              
  Lines           628      634       +6     
  Branches         48       50       +2     
============================================
  Hits            546      546              
- Misses           65       69       +4     
- Partials         17       19       +2     
Impacted Files Coverage Δ Complexity Δ
...o/jenkins/plugins/checks/steps/WithChecksStep.java 70.65% <31.57%> (-4.93%) 2.00 <0.00> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65907be...f88c550. Read the comment docs.

@KalleOlaviNiemitalo
Copy link

KalleOlaviNiemitalo commented Feb 13, 2021

WithChecksCallBack.onFailure calls both WithChecksStepExecution.publish and context.onFailure(t):

publish(context, builder);
context.onFailure(t);

If WithChecksStepExecution.publish has already called context.onFailure (and returned false), then the second call violates the documentation of StepContext: https://github.com/jenkinsci/workflow-step-api-plugin/blob/4b67af12a6f81af147ac660c54f875a7b2c824c7/src/main/java/org/jenkinsci/plugins/workflow/steps/StepContext.java#L48-L49

as the onFailure method is part of the FutureCallback interface: https://guava.dev/releases/snapshot-jre/api/docs/com/google/common/util/concurrent/FutureCallback.html#onFailure-java.lang.Throwable-

In pipelines, I think the second onFailure call will trigger a warning from CpsStepContext.completed: https://github.com/jenkinsci/workflow-cps-plugin/blob/de5e69b5e85bd5a89960e65f3a634256b24b7af5/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java#L339

That problem is not introduced by this pull request, but perhaps it is made more frequent. I think what should happen is that StepContext.onFailure is called only once, with the Throwable t that was originally passed to WithChecksCallBack.onFailure, and t.addSuppressed has been called to register the exception that prevented the failure from being published to checks.

A similar violation in onStart looks more difficult to fix, as CpsBodyExecution.launch does not seem to expect BodyExecutionCallback.onStart to invalidate the StepContext already: https://github.com/jenkinsci/workflow-cps-plugin/blob/261df120d1c2f228b2137c3f880a7741293ea5f2/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsBodyExecution.java#L133-L136

@KalleOlaviNiemitalo
Copy link

CpsBodyExecution.FailureAdapter.receive has a loop that calls BodyExecutionCallback.onFailure of each callback: https://github.com/jenkinsci/workflow-cps-plugin/blob/261df120d1c2f228b2137c3f880a7741293ea5f2/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsBodyExecution.java#L360-L362

Before that, it has already called setOutcome. Does that mean WithChecksCallBack.onFailure should not call context.onFailure(t) at all?

@mrginglymus
Copy link
Contributor Author

There we go; we now only call onFailure potentially once in each of onStart, onFailure and stop. Hopefully that should now be up to spec?

@timja
Copy link
Member

timja commented Feb 13, 2021

checkstyle is failing

@Override
public void stop(final Throwable cause) {
if (publish(getContext(), new ChecksDetails.ChecksDetailsBuilder()
Exception publishFailure = publish(getContext(), new ChecksDetails.ChecksDetailsBuilder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit odd for java, normally you would throw the exception from the publish method and catch it here, possibly a known checked exception

Copy link
Contributor

@XiongKezhi XiongKezhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have your investigated how other enclosing steps handle such zombie problems

.build());
return null;
}
catch (RuntimeException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this catch needed? if u are catching the exception outside this whole function (like in line 112) instead of returning the exception, this is not needed?

@mrginglymus
Copy link
Contributor Author

mrginglymus commented Feb 14, 2021

Let me just take a step back so we can work out what the problem is here.

This is the stack trace that I believe to be the cause of my zombie jobs:

00:32:24.679  hudson.AbortException: Pull request 5852 : Not mergeable at xxxxxxxx+xxxxxxxxxxxx (NOT_MERGEABLE)
00:32:24.679      at org.jenkinsci.plugins.github_branch_source.PullRequestSCMRevision.validateMergeHash(PullRequestSCMRevision.java:106)
00:32:24.679      at org.jenkinsci.plugins.github_branch_source.GitHubSCMSource.retrieve(GitHubSCMSource.java:1599)
00:32:24.679      at jenkins.scm.api.SCMSource.fetch(SCMSource.java:582)
00:32:24.679      at io.jenkins.plugins.checks.github.SCMFacade.findRevision(SCMFacade.java:156)
00:32:24.679  Caused: java.lang.IllegalStateException: Could not fetch revision from repository: xxxxxxxxxxxx
00:32:24.679      at io.jenkins.plugins.checks.github.SCMFacade.findRevision(SCMFacade.java:159)
00:32:24.679      at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.resolveHeadSha(GitHubSCMSourceChecksContext.java:131)
00:32:24.679      at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.<init>(GitHubSCMSourceChecksContext.java:46)
00:32:24.679      at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.fromRun(GitHubSCMSourceChecksContext.java:24)
00:32:24.679      at io.jenkins.plugins.checks.github.GitHubChecksPublisherFactory.createPublisher(GitHubChecksPublisherFactory.java:42)
00:32:24.679      at io.jenkins.plugins.checks.api.ChecksPublisherFactory.lambda$fromRun$0(ChecksPublisherFactory.java:89)
...
00:32:24.679      at io.jenkins.plugins.checks.api.ChecksPublisherFactory.fromRun(ChecksPublisherFactory.java:92)
00:32:24.679      at io.jenkins.plugins.checks.api.ChecksPublisherFactory.fromRun(ChecksPublisherFactory.java:69)
00:32:24.679      at io.jenkins.plugins.checks.steps.WithChecksStep$WithChecksStepExecution.publish(WithChecksStep.java:150)
00:32:24.679      at io.jenkins.plugins.checks.steps.WithChecksStep$WithChecksStepExecution.access$100(WithChecksStep.java:83)
00:32:24.679      at io.jenkins.plugins.checks.steps.WithChecksStep$WithChecksStepExecution$WithChecksCallBack.onStart(WithChecksStep.java:169)

Whilst apparently innocuous, these zombie jobs are actually quite a bad bug as they can cost a lot of money - I've had elastic agents sitting for three days over the weekend doing nothing because they have been blocked by a zombie job.

So:

  1. Whilst implementers should make every effort to conform to the specification of BodyExecutionCallback, having the failure mode of an unexpected exception being zombie jobs is bad - is that something that can be fixed? edit: Incorrectly written BodyExecutionCallback implementations lead to zombie executions workflow-step-api-plugin#60 has a test-case for this issue

  2. Looking at the stack trace, SCMFacade in github-checks downgrades a nice CheckedException from github-branch-source into a runtime exception:
    https://github.com/jenkinsci/github-checks-plugin/blob/d376cd617d6060c000e9e3b5e680770014ec1968/src/main/java/io/jenkins/plugins/checks/github/SCMFacade.java#L158-L160 - I think this should remain a checked exception, as the merge conflict I've encountered is something that can reasonably be expected to occur frequently. We can then swallow that exception here: https://github.com/jenkinsci/github-checks-plugin/blob/d376cd617d6060c000e9e3b5e680770014ec1968/src/main/java/io/jenkins/plugins/checks/github/GitHubSCMSourceChecksContext.java#L131 and not have to worry about catching a runtime exception in this plugin for this specific issue.

  3. Until (1.) is addressed, do we need some way of hardening our WithChecksCallBack against runtime exceptions? Perhaps not, if we address (2.)

@timja
Copy link
Member

timja commented Feb 14, 2021

sounds reasonable depending on what the code looks like after

@XiongKezhi
Copy link
Contributor

I think we should at least some determined steps to reproduce the zombie jobs, and the problem may be that our previous implementation deviates from the docs.

There we go; we now only call onFailure potentially once in each of onStart, onFailure and stop. Hopefully that should now be up to spec?

Yes, we should make sure of that. In our previous implementation, our stop method calls the publish method which calls onFailure of the context before itself calls it directly. If stop method is used as what the docs say: "May be called if someone asks a running step to abort.", maybe we should just adding a variable to express that cancelation and delegate the actual publishing logic to onFailure in the callback.

Before that, it has already called setOutcome. Does that mean WithChecksCallBack.onFailure should not call context.onFailure(t) at all?

The withGradle step doesn't call it, although it does call the onFailure method of its parent. However, the implementation of stop in the abstract class calls it, so I believe it does no harm to inherit it in the last of ours?

While I'm scrolling the implementations of body execution to look for a normal pattern (due to the lack of docs specifying the call chain of an interrupted/failed body execution), I notice that many other implementations just implement the tail call: https://github.com/jenkinsci/kubernetes-plugin/blob/a253fba5085eabbb6e0f026f6f5576c7667c31a3/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerStepExecution.java#L101. It's cleaner, but since our implementation behaves differently on a failure and a cancelation, not sure if this is useful.

@mrginglymus
Copy link
Contributor Author

@XiongKezhi jenkinsci/workflow-step-api-plugin#60 has a reproduction, although as per the latest comment I need to move it to another repo where the ‘bug’ actually is.

otherwise, I think the usage we have here is fine, it’s just that upstream is lacking error handling. Obviously we should do our best not to generate errors too...

@timja
Copy link
Member

timja commented Feb 22, 2021

there's a checkstyle error in this PR btw in case you missed it

@mrginglymus
Copy link
Contributor Author

@timja re the checkstyle error - can I just suppress it? The point is I need to patch around the vulnerability of the cps plugin to zombie executions, the only sure-fire way of doing so is ensuring no runtime exceptions make it out of onStart.

@timja
Copy link
Member

timja commented Mar 3, 2021

@timja re the checkstyle error - can I just suppress it? The point is I need to patch around the vulnerability of the cps plugin to zombie executions, the only sure-fire way of doing so is ensuring no runtime exceptions make it out of onStart.

yeah sure

@mrginglymus
Copy link
Contributor Author

Closing in favour of #91

@timja timja closed this Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants