Fix deploy script rolling deployment timeout issue #1927

rileyseaburg · 2025-12-22T17:09:15Z

Summary

Fix infinite loop in deploy script where rolling deployments kept getting canceled before all instances could be replaced
Add --no-wait flag to cf push so it succeeds after first instance is healthy
Add explicit wait_for_deployment_complete function to wait up to 15 minutes for all instances to be replaced
Improve retry logic to detect and wait for active deployments instead of canceling them

Problem

The previous deploy script had a race condition:

cf push --strategy rolling -t 180 starts a rolling deployment
For 18 instances at ~20s each, the deployment takes ~6 minutes
cf push times out (180s) before all instances are replaced
Retry logic kicks in, canceling the in-progress deployment
This creates an infinite loop where old instances (#0, Add GTM container #1, Update gems #2) never get replaced

Solution

Use --no-wait flag so cf push succeeds quickly, then explicitly wait for the deployment to complete with the new wait_for_deployment_complete function.

- Add --no-wait flag to cf push so it succeeds after first instance is healthy - Add wait_for_deployment_complete function to explicitly wait for all instances - Improve retry logic to detect and wait for active deployments instead of canceling them with a new push - Increases effective deployment timeout from 180s to 15 minutes This fixes the infinite loop where rolling deployments kept getting canceled before all 18 instances could be replaced, leaving old instances running.

Copilot

Pull request overview

This PR fixes a critical race condition in the deploy script where rolling deployments were being prematurely canceled due to timeout constraints. The solution introduces the --no-wait flag to allow cf push to return immediately after initiating deployment, then explicitly waits for complete deployment with proper timeout handling.

Key Changes:

Added wait_for_deployment_complete function to poll deployment status until all instances are replaced (up to 15 minutes)
Modified cf push commands to use --no-wait flag, preventing premature timeout
Enhanced retry logic to detect and wait for active deployments rather than canceling them

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-22T17:13:13Z

.circleci/deploy.sh

+      else
+        echo "Deployment did not complete successfully"
+        # Continue to retry logic below
+      fi


The else block at line 245 handles the case where wait_for_deployment_complete returns failure, but there's no similar else block for when exit_code is non-zero at line 238. This means if cf push fails (exit_code != 0), the code immediately jumps to the retry check at line 251 without any specific error handling or logging for the push failure itself. While the error message at line 252 mentions "Push failed", it might be helpful to add an explicit else block to log the specific push failure and exit code before entering retry logic.

Suggested change

fi

fi

else

echo "cf push command failed for $app_name with exit code: $exit_code"

# Continue to retry logic below

Copilot · 2025-12-22T17:13:14Z

.circleci/deploy.sh

+    local status=$(echo "$deployment_info" | grep -o '"value":"[^"]*"' | head -1 | cut -d'"' -f4 || echo "")
+    local reason=$(echo "$deployment_info" | grep -o '"reason":"[^"]*"' | head -1 | cut -d'"' -f4 || echo "")


The grep pattern '"value":"[^"]*"' is attempting to extract the status value from JSON, but this pattern will match the first occurrence of any field named "value" in the entire JSON response, not necessarily the status value. If there are multiple fields named "value" in the JSON structure, this could extract the wrong field. The same issue exists for the "reason" field extraction. Consider using a JSON parser like jq for more reliable extraction, or use a more specific grep pattern that includes the full path to the status field.

Suggested change

local status=$(echo "$deployment_info" | grep -o '"value":"[^"]*"' | head -1 | cut -d'"' -f4 || echo "")

local reason=$(echo "$deployment_info" | grep -o '"reason":"[^"]*"' | head -1 | cut -d'"' -f4 || echo "")

local status=$(echo "$deployment_info" | jq -r '.resources[0].status.value // ""' 2>/dev/null || echo "")

local reason=$(echo "$deployment_info" | jq -r '.resources[0].status.reason // ""' 2>/dev/null || echo "")

Copilot · 2025-12-22T17:13:14Z

.circleci/deploy.sh

+      # Check if there's an active deployment that we should wait for instead of retrying
+      local app_guid=$(cf app "$app_name" --guid 2>/dev/null || echo "")
+      if [ -n "$app_guid" ]; then
+        local active_deployment=$(cf curl "/v3/deployments?app_guids=${app_guid}&status_values=ACTIVE" 2>/dev/null | grep -c '"ACTIVE"' || echo "0")


Using grep -c '"ACTIVE"' counts occurrences of the literal string "ACTIVE" anywhere in the JSON response, which could match fields other than the deployment status (e.g., in comments, descriptions, or other unrelated fields). This could lead to false positives where the script thinks there's an active deployment when there isn't one. Consider using a more precise check, such as checking for the status field specifically with a JSON parser, or using a more specific grep pattern that includes the field context.

Suggested change

local active_deployment=$(cf curl "/v3/deployments?app_guids=${app_guid}&status_values=ACTIVE" 2>/dev/null | grep -c '"ACTIVE"' || echo "0")

local active_deployment=$(cf curl "/v3/deployments?app_guids=${app_guid}&status_values=ACTIVE" 2>/dev/null | jq -r '.pagination.total_results // 0' 2>/dev/null || echo "0")

Copilot · 2025-12-22T17:13:14Z

.circleci/deploy.sh

+
+  echo "Waiting for deployment of $app_name to complete..."
+
+  local app_guid=$(cf app "$app_name" --guid)


If cf app "$app_name" --guid fails, the function will continue with an empty app_guid. While the empty string check on line 256 handles this case for the retry logic, in the wait_for_deployment_complete function (line 99), failure to get the app GUID will result in an API call with an empty GUID parameter, which may return unexpected results or errors. Consider adding error handling to fail fast if the app GUID cannot be retrieved.

Suggested change

local app_guid=$(cf app "$app_name" --guid)

local app_guid

if ! app_guid=$(cf app "$app_name" --guid 2>/dev/null) || [ -z "$app_guid" ]; then

echo "✗ Failed to retrieve GUID for app '$app_name'. Aborting deployment wait."

return 1

fi

Copilot · 2025-12-22T17:13:14Z

.circleci/deploy.sh

+    waited=$((waited + wait_interval))
+  done
+
+  echo "Warning: Timed out waiting for deployment to complete after ${max_wait}s"


The timeout message indicates a warning, but the function returns 1 (failure), which will cause the deployment to be retried. However, a timeout after 15 minutes might indicate a stuck deployment rather than a transient failure. Consider whether the deployment should be explicitly canceled before returning failure, to prevent accumulating stuck deployments. Alternatively, document that stuck deployments will remain active and may need manual cleanup.

Suggested change

echo "Warning: Timed out waiting for deployment to complete after ${max_wait}s"

echo "Warning: Timed out waiting for deployment to complete after ${max_wait}s; deployment may still be active or stuck and may require manual cleanup."

Copilot · 2025-12-22T17:13:15Z

.circleci/deploy.sh

+        echo "✗ Deployment was canceled"
+        return 1
+      else
+        echo "✗ Deployment finalized with reason: $reason"


When the deployment completes with an unexpected reason (line 115), the error message shows the reason but doesn't provide guidance on what to do next. Consider logging more diagnostic information such as the full deployment status or suggesting manual investigation steps, to help operators understand what went wrong.

Suggested change

echo "✗ Deployment finalized with reason: $reason"

echo "✗ Deployment finalized with unexpected reason: $reason"

echo "Full deployment response from Cloud Foundry:"

echo "$deployment_info"

echo "Please investigate this deployment manually, for example by running:"

echo " cf curl \"/v3/deployments?app_guids=${app_guid}&order_by=-created_at&per_page=1\""

Copilot AI review requested due to automatic review settings December 22, 2025 17:09

Copilot started reviewing on behalf of rileyseaburg December 22, 2025 17:09 View session

rileyseaburg merged commit 574de74 into production Dec 22, 2025
18 of 19 checks passed

Copilot AI reviewed Dec 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deploy script rolling deployment timeout issue #1927

Fix deploy script rolling deployment timeout issue #1927

Uh oh!

rileyseaburg commented Dec 22, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		local status=$(echo "$deployment_info" \| grep -o '"value":"[^"]*"' \| head -1 \| cut -d'"' -f4 \|\| echo "")
		local reason=$(echo "$deployment_info" \| grep -o '"reason":"[^"]*"' \| head -1 \| cut -d'"' -f4 \|\| echo "")

	local active_deployment=$(cf curl "/v3/deployments?app_guids=${app_guid}&status_values=ACTIVE" 2>/dev/null \| grep -c '"ACTIVE"' \|\| echo "0")
	local active_deployment=$(cf curl "/v3/deployments?app_guids=${app_guid}&status_values=ACTIVE" 2>/dev/null \| jq -r '.pagination.total_results // 0' 2>/dev/null \|\| echo "0")


		echo "Waiting for deployment of $app_name to complete..."

		local app_guid=$(cf app "$app_name" --guid)

	echo "Warning: Timed out waiting for deployment to complete after ${max_wait}s"
	echo "Warning: Timed out waiting for deployment to complete after ${max_wait}s; deployment may still be active or stuck and may require manual cleanup."

-        echo "✗ Deployment finalized with reason: $reason"
+        echo "✗ Deployment finalized with unexpected reason: $reason"
+        echo "Full deployment response from Cloud Foundry:"
+        echo "$deployment_info"
+        echo "Please investigate this deployment manually, for example by running:"
+        echo "  cf curl \"/v3/deployments?app_guids=${app_guid}&order_by=-created_at&per_page=1\""

Fix deploy script rolling deployment timeout issue #1927

Fix deploy script rolling deployment timeout issue #1927

Uh oh!

Conversation

rileyseaburg commented Dec 22, 2025

Summary

Problem

Solution

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants