Handle failure in publish invocation by mrginglymus · Pull Request #83 · jenkinsci/checks-api-plugin

mrginglymus · 2021-02-13T10:45:23Z

From what I think I can glean from the stack trace/reading similar code, an unhandled exception in any of the BodyExecutionCallback might lead to zombie jobs on executors. This manifested itself when the publish at the start of a withChecks step failed because of a merge conflict. The overall job aborted succesfully, but the three outstanding parallel stages continued to block their executors and required manual cleanup.

codecov · 2021-02-13T10:47:40Z

Codecov Report

Merging #83 (912bb6a) into master (65907be) will decrease coverage by 0.82%.
The diff coverage is 31.57%.

@@             Coverage Diff              @@
##             master      #83      +/-   ##
============================================
- Coverage     86.94%   86.11%   -0.83%     
  Complexity      139      139              
============================================
  Files            16       16              
  Lines           628      634       +6     
  Branches         48       50       +2     
============================================
  Hits            546      546              
- Misses           65       69       +4     
- Partials         17       19       +2

Impacted Files	Coverage Δ	Complexity Δ
...o/jenkins/plugins/checks/steps/WithChecksStep.java	`70.65% <31.57%> (-4.93%)`	`2.00 <0.00> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65907be...f88c550. Read the comment docs.

KalleOlaviNiemitalo · 2021-02-13T12:23:38Z

WithChecksCallBack.onFailure calls both WithChecksStepExecution.publish and context.onFailure(t):

checks-api-plugin/src/main/java/io/jenkins/plugins/checks/steps/WithChecksStep.java

Lines 206 to 207 in d73ab46

    
           publish(context, builder); 
        
           context.onFailure(t);

If WithChecksStepExecution.publish has already called context.onFailure (and returned false), then the second call violates the documentation of StepContext: https://github.com/jenkinsci/workflow-step-api-plugin/blob/4b67af12a6f81af147ac660c54f875a7b2c824c7/src/main/java/org/jenkinsci/plugins/workflow/steps/StepContext.java#L48-L49

as the onFailure method is part of the FutureCallback interface: https://guava.dev/releases/snapshot-jre/api/docs/com/google/common/util/concurrent/FutureCallback.html#onFailure-java.lang.Throwable-

In pipelines, I think the second onFailure call will trigger a warning from CpsStepContext.completed: https://github.com/jenkinsci/workflow-cps-plugin/blob/de5e69b5e85bd5a89960e65f3a634256b24b7af5/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java#L339

That problem is not introduced by this pull request, but perhaps it is made more frequent. I think what should happen is that StepContext.onFailure is called only once, with the Throwable t that was originally passed to WithChecksCallBack.onFailure, and t.addSuppressed has been called to register the exception that prevented the failure from being published to checks.

A similar violation in onStart looks more difficult to fix, as CpsBodyExecution.launch does not seem to expect BodyExecutionCallback.onStart to invalidate the StepContext already: https://github.com/jenkinsci/workflow-cps-plugin/blob/261df120d1c2f228b2137c3f880a7741293ea5f2/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsBodyExecution.java#L133-L136

KalleOlaviNiemitalo · 2021-02-13T16:49:30Z

CpsBodyExecution.FailureAdapter.receive has a loop that calls BodyExecutionCallback.onFailure of each callback: https://github.com/jenkinsci/workflow-cps-plugin/blob/261df120d1c2f228b2137c3f880a7741293ea5f2/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsBodyExecution.java#L360-L362

Before that, it has already called setOutcome. Does that mean WithChecksCallBack.onFailure should not call context.onFailure(t) at all?

mrginglymus · 2021-02-13T18:34:11Z

There we go; we now only call onFailure potentially once in each of onStart, onFailure and stop. Hopefully that should now be up to spec?

timja · 2021-02-13T19:54:30Z

checkstyle is failing

timja · 2021-02-13T20:00:28Z

src/main/java/io/jenkins/plugins/checks/steps/WithChecksStep.java

        @Override
        public void stop(final Throwable cause) {
-            if (publish(getContext(), new ChecksDetails.ChecksDetailsBuilder()
+            Exception publishFailure = publish(getContext(), new ChecksDetails.ChecksDetailsBuilder()


this is a bit odd for java, normally you would throw the exception from the publish method and catch it here, possibly a known checked exception

XiongKezhi

have your investigated how other enclosing steps handle such zombie problems

XiongKezhi · 2021-02-14T07:27:32Z

src/main/java/io/jenkins/plugins/checks/steps/WithChecksStep.java

+                                .build());
+                return null;
+            }
+            catch (RuntimeException e) {


is this catch needed? if u are catching the exception outside this whole function (like in line 112) instead of returning the exception, this is not needed?

mrginglymus · 2021-02-14T09:02:57Z

Let me just take a step back so we can work out what the problem is here.

This is the stack trace that I believe to be the cause of my zombie jobs:

00:32:24.679  hudson.AbortException: Pull request 5852 : Not mergeable at xxxxxxxx+xxxxxxxxxxxx (NOT_MERGEABLE)
00:32:24.679      at org.jenkinsci.plugins.github_branch_source.PullRequestSCMRevision.validateMergeHash(PullRequestSCMRevision.java:106)
00:32:24.679      at org.jenkinsci.plugins.github_branch_source.GitHubSCMSource.retrieve(GitHubSCMSource.java:1599)
00:32:24.679      at jenkins.scm.api.SCMSource.fetch(SCMSource.java:582)
00:32:24.679      at io.jenkins.plugins.checks.github.SCMFacade.findRevision(SCMFacade.java:156)
00:32:24.679  Caused: java.lang.IllegalStateException: Could not fetch revision from repository: xxxxxxxxxxxx
00:32:24.679      at io.jenkins.plugins.checks.github.SCMFacade.findRevision(SCMFacade.java:159)
00:32:24.679      at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.resolveHeadSha(GitHubSCMSourceChecksContext.java:131)
00:32:24.679      at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.<init>(GitHubSCMSourceChecksContext.java:46)
00:32:24.679      at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.fromRun(GitHubSCMSourceChecksContext.java:24)
00:32:24.679      at io.jenkins.plugins.checks.github.GitHubChecksPublisherFactory.createPublisher(GitHubChecksPublisherFactory.java:42)
00:32:24.679      at io.jenkins.plugins.checks.api.ChecksPublisherFactory.lambda$fromRun$0(ChecksPublisherFactory.java:89)
...
00:32:24.679      at io.jenkins.plugins.checks.api.ChecksPublisherFactory.fromRun(ChecksPublisherFactory.java:92)
00:32:24.679      at io.jenkins.plugins.checks.api.ChecksPublisherFactory.fromRun(ChecksPublisherFactory.java:69)
00:32:24.679      at io.jenkins.plugins.checks.steps.WithChecksStep$WithChecksStepExecution.publish(WithChecksStep.java:150)
00:32:24.679      at io.jenkins.plugins.checks.steps.WithChecksStep$WithChecksStepExecution.access$100(WithChecksStep.java:83)
00:32:24.679      at io.jenkins.plugins.checks.steps.WithChecksStep$WithChecksStepExecution$WithChecksCallBack.onStart(WithChecksStep.java:169)

Whilst apparently innocuous, these zombie jobs are actually quite a bad bug as they can cost a lot of money - I've had elastic agents sitting for three days over the weekend doing nothing because they have been blocked by a zombie job.

So:

Whilst implementers should make every effort to conform to the specification of BodyExecutionCallback, having the failure mode of an unexpected exception being zombie jobs is bad - is that something that can be fixed? edit: Incorrectly written BodyExecutionCallback implementations lead to zombie executions workflow-step-api-plugin#60 has a test-case for this issue
Looking at the stack trace, SCMFacade in github-checks downgrades a nice CheckedException from github-branch-source into a runtime exception:
https://github.com/jenkinsci/github-checks-plugin/blob/d376cd617d6060c000e9e3b5e680770014ec1968/src/main/java/io/jenkins/plugins/checks/github/SCMFacade.java#L158-L160 - I think this should remain a checked exception, as the merge conflict I've encountered is something that can reasonably be expected to occur frequently. We can then swallow that exception here: https://github.com/jenkinsci/github-checks-plugin/blob/d376cd617d6060c000e9e3b5e680770014ec1968/src/main/java/io/jenkins/plugins/checks/github/GitHubSCMSourceChecksContext.java#L131 and not have to worry about catching a runtime exception in this plugin for this specific issue.
Until (1.) is addressed, do we need some way of hardening our WithChecksCallBack against runtime exceptions? Perhaps not, if we address (2.)

timja · 2021-02-14T09:05:39Z

sounds reasonable depending on what the code looks like after

mrginglymus · 2021-02-14T09:21:25Z

Another option - this could well be an issue with GithubBranchSourcePlugin, where

https://github.com/jenkinsci/github-branch-source-plugin/blob/ea3b723b4f12fbc3f06d985c9cc81dccea1ae2ae/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMSource.java#L1414-L1422

handles unmergables, but

https://github.com/jenkinsci/github-branch-source-plugin/blob/ea3b723b4f12fbc3f06d985c9cc81dccea1ae2ae/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMSource.java#L1612

doesn't.

XiongKezhi · 2021-02-15T17:14:15Z

I think we should at least some determined steps to reproduce the zombie jobs, and the problem may be that our previous implementation deviates from the docs.

There we go; we now only call onFailure potentially once in each of onStart, onFailure and stop. Hopefully that should now be up to spec?

Yes, we should make sure of that. In our previous implementation, our stop method calls the publish method which calls onFailure of the context before itself calls it directly. If stop method is used as what the docs say: "May be called if someone asks a running step to abort.", maybe we should just adding a variable to express that cancelation and delegate the actual publishing logic to onFailure in the callback.

Before that, it has already called setOutcome. Does that mean WithChecksCallBack.onFailure should not call context.onFailure(t) at all?

The withGradle step doesn't call it, although it does call the onFailure method of its parent. However, the implementation of stop in the abstract class calls it, so I believe it does no harm to inherit it in the last of ours?

While I'm scrolling the implementations of body execution to look for a normal pattern (due to the lack of docs specifying the call chain of an interrupted/failed body execution), I notice that many other implementations just implement the tail call: https://github.com/jenkinsci/kubernetes-plugin/blob/a253fba5085eabbb6e0f026f6f5576c7667c31a3/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerStepExecution.java#L101. It's cleaner, but since our implementation behaves differently on a failure and a cancelation, not sure if this is useful.

mrginglymus · 2021-02-22T08:34:30Z

@XiongKezhi jenkinsci/workflow-step-api-plugin#60 has a reproduction, although as per the latest comment I need to move it to another repo where the ‘bug’ actually is.

otherwise, I think the usage we have here is fine, it’s just that upstream is lacking error handling. Obviously we should do our best not to generate errors too...

timja · 2021-02-22T08:42:30Z

there's a checkstyle error in this PR btw in case you missed it

mrginglymus · 2021-03-03T15:04:52Z

@timja re the checkstyle error - can I just suppress it? The point is I need to patch around the vulnerability of the cps plugin to zombie executions, the only sure-fire way of doing so is ensuring no runtime exceptions make it out of onStart.

timja · 2021-03-03T15:14:40Z

@timja re the checkstyle error - can I just suppress it? The point is I need to patch around the vulnerability of the cps plugin to zombie executions, the only sure-fire way of doing so is ensuring no runtime exceptions make it out of onStart.

yeah sure

mrginglymus · 2021-03-03T15:15:25Z

Closing in favour of #91

Handle failure in publish invocation

69d14a5

mrginglymus added 2 commits February 13, 2021 17:39

Only call onFailure if we know we haven't already

55790f5

Alternatively, this

f88c550

mrginglymus force-pushed the less-explodey branch from 32aef2e to f88c550 Compare February 13, 2021 18:27

timja reviewed Feb 13, 2021

View reviewed changes

XiongKezhi reviewed Feb 14, 2021

View reviewed changes

This was referenced Feb 14, 2021

Return null for unmergable heads rather than excepting jenkinsci/github-branch-source-plugin#386

Open

Incorrectly written BodyExecutionCallback implementations lead to zombie executions jenkinsci/workflow-step-api-plugin#60

Closed

mrginglymus mentioned this pull request Mar 3, 2021

Handle failures in publish invocation #91

Merged

timja closed this Mar 3, 2021

Conversation

mrginglymus commented Feb 13, 2021

Uh oh!

codecov bot commented Feb 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

KalleOlaviNiemitalo commented Feb 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KalleOlaviNiemitalo commented Feb 13, 2021

Uh oh!

mrginglymus commented Feb 13, 2021

Uh oh!

timja commented Feb 13, 2021

Uh oh!

timja Feb 13, 2021

Choose a reason for hiding this comment

Uh oh!

XiongKezhi left a comment

Choose a reason for hiding this comment

Uh oh!

XiongKezhi Feb 14, 2021

Choose a reason for hiding this comment

Uh oh!

mrginglymus commented Feb 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timja commented Feb 14, 2021

Uh oh!

mrginglymus commented Feb 14, 2021

Uh oh!

XiongKezhi commented Feb 15, 2021

Uh oh!

mrginglymus commented Feb 22, 2021

Uh oh!

timja commented Feb 22, 2021

Uh oh!

mrginglymus commented Mar 3, 2021

Uh oh!

timja commented Mar 3, 2021

Uh oh!

mrginglymus commented Mar 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Feb 13, 2021 •

edited

Loading

KalleOlaviNiemitalo commented Feb 13, 2021 •

edited

Loading

mrginglymus commented Feb 14, 2021 •

edited

Loading