[JENKINS-76327] EC2 Plugin blocks Jenkins queue lock for up to 25 seconds for each orphan instance during startup

<p>Controller startup after outage/crash with orphaned agents leads to build queue completely blocked. <br/>
 <br/>
The impact scales linearly with the number of orphaned agents: 10 agents: +4 minutes (25x10)</p>
<h2><a name="StepstoReproduce"></a>Steps to Reproduce</h2>

<p>1. Create many EC2 agents via EC2 plugin<br/>
2. Stop Jenkins controller<br/>
3. Manually delete the EC2 instances from AWS console (instances no longer exist in AWS)<br/>
4. Start Jenkins controller</p>
<h3><a name="ActualResult"></a>Actual Result</h3>
<ul class="alternate" type="square">
	<li>All Queue operations blocked for: 25 seconds × number of orphaned agents</li>
	<li>Some liveness probes may check the queue status, and time out because any queue operation is blocked. This will be interpreted as the instance being unhealthy, potentially causing automation to restart the instance</li>
</ul>


<h3><a name="ExpectedResult"></a>Expected Result</h3>
<ul class="alternate" type="square">
	<li>Queue operations continue normally (negligible held time)</li>
	<li>Orphaned agents are handled asynchronously or skipped</li>
</ul>


<h2><a name="RootCause"></a>Root Cause</h2>

<p>EC2RetentionStrategy.internalCheck() calls <a href="https://github.com/jenkinsci/ec2-plugin/blob/483cc2854116da9079d47e95e31e9b7a2e609372/src/main/java/hudson/plugins/ec2/CloudHelper.java#L23" class="external-link" target="_blank" rel="nofollow noopener">CloudHelper.getInstance<b>WithRetry</b>()</a> which holds +25s in case the instance if not found (orphan).</p>

<p>Call chain:</p>

<pre>

ComputerRetentionWork.doAperiodicRun  runs every 1 minute
Queue.withLock()      [LOCK]
EC2RetentionStrategy.check()
EC2RetentionStrategy.internalCheck()
EC2Computer.getState()<span class="code-keyword">for</span> each instance
CloudHelper.getInstanceWithRetry()
<span class="code-object">Thread</span>.sleep(5000)    × 5 retries = 25 seconds MAX</pre>
<p> </p>

<p>Issue probably introduced in <a href="https://issue-redirect.jenkins.io/issue/54071" title="EC2-plugin not spooling up stopped nodes" class="issue-link" data-issue-key="JENKINS-54071"><del>JENKINS-54071</del></a> , <a href="http://github.com/jenkinsci/ec2-plugin/pull/311" class="external-link" target="_blank" rel="nofollow noopener">PR</a> . As it started using retries in retention checks.</p>

<p> </p>
<h2><a name="Workarround"></a>Workarround</h2>

<p>1. Manually delete `$JENKINS_HOME/nodes/` before startup after outages<br/>
2. Enable `cleanUpOrphanedNodes: true` in EC2 cloud config (only available in latest versions), it will reduces frequency but doesn't prevent blocking</p>

---
<details><summary><i>Originally reported by <img align="left" width="20" src="https://raw.githubusercontent.com/jenkinsci/artifacts-from-jira-issues/refs/heads/main/avatars/apuig.png" title="apuig's avatar" /> <a href="https://issues.jenkins.io/secure/ViewProfile.jspa?name=apuig">apuig</a>, imported from: <a class="original-jira-link" href="https://issues.jenkins.io/browse/JENKINS-76327" target="_blank">EC2 Plugin blocks Jenkins queue lock for up to 25 seconds for each orphan instance during startup</a></i></summary>
<i><ul>
<li><b>assignee</b>: <a href="https://issues.jenkins.io/secure/ViewProfile.jspa?name=thoulen">thoulen</a>
<li><b>status</b>: Open
<li><b>priority</b>: Major
<li><b>component(s)</b>: ec2-plugin
<li><b>resolution</b>: Unresolved
<li><b>votes</b>: 0
<li><b>watchers</b>: 2
<li><b>imported</b>: 2025-12-06
</ul></i>
<details><summary>Raw content of original issue</summary>

<pre>
<p>Controller startup after outage/crash with orphaned agents leads to build queue completely blocked. 
 
The impact scales linearly with the number of orphaned agents: 10 agents: +4 minutes (25x10)</p>
<h2><a name="StepstoReproduce"></a>Steps to Reproduce</h2>

<p>1. Create many EC2 agents via EC2 plugin
2. Stop Jenkins controller
3. Manually delete the EC2 instances from AWS console (instances no longer exist in AWS)
4. Start Jenkins controller</p>
<h3><a name="ActualResult"></a>Actual Result</h3>
<ul class="alternate" type="square">
	<li>All Queue operations blocked for: 25 seconds × number of orphaned agents</li>
	<li>Some liveness probes may check the queue status, and time out because any queue operation is blocked. This will be interpreted as the instance being unhealthy, potentially causing automation to restart the instance</li>
</ul>


<h3><a name="ExpectedResult"></a>Expected Result</h3>
<ul class="alternate" type="square">
	<li>Queue operations continue normally (negligible held time)</li>
	<li>Orphaned agents are handled asynchronously or skipped</li>
</ul>


<h2><a name="RootCause"></a>Root Cause</h2>

<p>EC2RetentionStrategy.internalCheck() calls <a href="https://github.com/jenkinsci/ec2-plugin/blob/483cc2854116da9079d47e95e31e9b7a2e609372/src/main/java/hudson/plugins/ec2/CloudHelper.java#L23" class="external-link" target="_blank" rel="nofollow noopener">CloudHelper.getInstance<b>WithRetry</b>()</a> which holds +25s in case the instance if not found (orphan).</p>

<p>Call chain:</p>
<div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
<pre class="code-java">
ComputerRetentionWork.doAperiodicRun  runs every 1 minute
Queue.withLock()                      [LOCK]
EC2RetentionStrategy.check()
EC2RetentionStrategy.internalCheck()
EC2Computer.getState()                <span class="code-keyword">for</span> each instance
CloudHelper.getInstanceWithRetry()
<span class="code-object">Thread</span>.sleep(5000)                    × 5 retries = 25 seconds MAX</pre>
</div></div>
<p> </p>

<p>Issue probably introduced in <a href="https://issue-redirect.jenkins.io/issue/54071" title="EC2-plugin not spooling up stopped nodes" class="issue-link" data-issue-key="JENKINS-54071"><del>JENKINS-54071</del></a> , <a href="http://github.com/jenkinsci/ec2-plugin/pull/311" class="external-link" target="_blank" rel="nofollow noopener">PR</a> . As it started using retries in retention checks.</p>

<p> </p>
<h2><a name="Workarround"></a>Workarround</h2>

<p>1. Manually delete `$JENKINS_HOME/nodes/` before startup after outages
2. Enable `cleanUpOrphanedNodes: true` in EC2 cloud config (only available in latest versions), it will reduces frequency but doesn't prevent blocking</p></pre>
</details>
</details>
<ul><li><i>environment</i>: <code>EC2 plugin 2043.v483cc2854116</code></li></ul>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JENKINS-76327] EC2 Plugin blocks Jenkins queue lock for up to 25 seconds for each orphan instance during startup #1984

Steps to Reproduce

Actual Result

Expected Result

Root Cause

Workarround

Steps to Reproduce

Actual Result

Expected Result

Root Cause

Workarround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[JENKINS-76327] EC2 Plugin blocks Jenkins queue lock for up to 25 seconds for each orphan instance during startup #1984

Description

Steps to Reproduce

Actual Result

Expected Result

Root Cause

Workarround

Steps to Reproduce

Actual Result

Expected Result

Root Cause

Workarround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions