Skip to content

Conversation

@rileyseaburg
Copy link
Contributor

Summary

  • Fix infinite loop in deploy script where rolling deployments kept getting canceled before all instances could be replaced
  • Add --no-wait flag to cf push so it succeeds after first instance is healthy
  • Add explicit wait_for_deployment_complete function to wait up to 15 minutes for all instances to be replaced
  • Improve retry logic to detect and wait for active deployments instead of canceling them

Problem

The previous deploy script had a race condition:

  1. cf push --strategy rolling -t 180 starts a rolling deployment
  2. For 18 instances at ~20s each, the deployment takes ~6 minutes
  3. cf push times out (180s) before all instances are replaced
  4. Retry logic kicks in, canceling the in-progress deployment
  5. This creates an infinite loop where old instances (#0, Add GTM container #1, Update gems #2) never get replaced

Solution

Use --no-wait flag so cf push succeeds quickly, then explicitly wait for the deployment to complete with the new wait_for_deployment_complete function.

- Add --no-wait flag to cf push so it succeeds after first instance is healthy
- Add wait_for_deployment_complete function to explicitly wait for all instances
- Improve retry logic to detect and wait for active deployments instead of
  canceling them with a new push
- Increases effective deployment timeout from 180s to 15 minutes

This fixes the infinite loop where rolling deployments kept getting canceled
before all 18 instances could be replaced, leaving old instances running.
Copilot AI review requested due to automatic review settings December 22, 2025 17:09
@rileyseaburg rileyseaburg merged commit 574de74 into production Dec 22, 2025
18 of 19 checks passed
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical race condition in the deploy script where rolling deployments were being prematurely canceled due to timeout constraints. The solution introduces the --no-wait flag to allow cf push to return immediately after initiating deployment, then explicitly waits for complete deployment with proper timeout handling.

Key Changes:

  • Added wait_for_deployment_complete function to poll deployment status until all instances are replaced (up to 15 minutes)
  • Modified cf push commands to use --no-wait flag, preventing premature timeout
  • Enhanced retry logic to detect and wait for active deployments rather than canceling them

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

else
echo "Deployment did not complete successfully"
# Continue to retry logic below
fi
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else block at line 245 handles the case where wait_for_deployment_complete returns failure, but there's no similar else block for when exit_code is non-zero at line 238. This means if cf push fails (exit_code != 0), the code immediately jumps to the retry check at line 251 without any specific error handling or logging for the push failure itself. While the error message at line 252 mentions "Push failed", it might be helpful to add an explicit else block to log the specific push failure and exit code before entering retry logic.

Suggested change
fi
fi
else
echo "cf push command failed for $app_name with exit code: $exit_code"
# Continue to retry logic below

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +105
local status=$(echo "$deployment_info" | grep -o '"value":"[^"]*"' | head -1 | cut -d'"' -f4 || echo "")
local reason=$(echo "$deployment_info" | grep -o '"reason":"[^"]*"' | head -1 | cut -d'"' -f4 || echo "")
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grep pattern '"value":"[^"]*"' is attempting to extract the status value from JSON, but this pattern will match the first occurrence of any field named "value" in the entire JSON response, not necessarily the status value. If there are multiple fields named "value" in the JSON structure, this could extract the wrong field. The same issue exists for the "reason" field extraction. Consider using a JSON parser like jq for more reliable extraction, or use a more specific grep pattern that includes the full path to the status field.

Suggested change
local status=$(echo "$deployment_info" | grep -o '"value":"[^"]*"' | head -1 | cut -d'"' -f4 || echo "")
local reason=$(echo "$deployment_info" | grep -o '"reason":"[^"]*"' | head -1 | cut -d'"' -f4 || echo "")
local status=$(echo "$deployment_info" | jq -r '.resources[0].status.value // ""' 2>/dev/null || echo "")
local reason=$(echo "$deployment_info" | jq -r '.resources[0].status.reason // ""' 2>/dev/null || echo "")

Copilot uses AI. Check for mistakes.
# Check if there's an active deployment that we should wait for instead of retrying
local app_guid=$(cf app "$app_name" --guid 2>/dev/null || echo "")
if [ -n "$app_guid" ]; then
local active_deployment=$(cf curl "/v3/deployments?app_guids=${app_guid}&status_values=ACTIVE" 2>/dev/null | grep -c '"ACTIVE"' || echo "0")
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using grep -c '"ACTIVE"' counts occurrences of the literal string "ACTIVE" anywhere in the JSON response, which could match fields other than the deployment status (e.g., in comments, descriptions, or other unrelated fields). This could lead to false positives where the script thinks there's an active deployment when there isn't one. Consider using a more precise check, such as checking for the status field specifically with a JSON parser, or using a more specific grep pattern that includes the field context.

Suggested change
local active_deployment=$(cf curl "/v3/deployments?app_guids=${app_guid}&status_values=ACTIVE" 2>/dev/null | grep -c '"ACTIVE"' || echo "0")
local active_deployment=$(cf curl "/v3/deployments?app_guids=${app_guid}&status_values=ACTIVE" 2>/dev/null | jq -r '.pagination.total_results // 0' 2>/dev/null || echo "0")

Copilot uses AI. Check for mistakes.

echo "Waiting for deployment of $app_name to complete..."

local app_guid=$(cf app "$app_name" --guid)
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cf app "$app_name" --guid fails, the function will continue with an empty app_guid. While the empty string check on line 256 handles this case for the retry logic, in the wait_for_deployment_complete function (line 99), failure to get the app GUID will result in an API call with an empty GUID parameter, which may return unexpected results or errors. Consider adding error handling to fail fast if the app GUID cannot be retrieved.

Suggested change
local app_guid=$(cf app "$app_name" --guid)
local app_guid
if ! app_guid=$(cf app "$app_name" --guid 2>/dev/null) || [ -z "$app_guid" ]; then
echo "✗ Failed to retrieve GUID for app '$app_name'. Aborting deployment wait."
return 1
fi

Copilot uses AI. Check for mistakes.
waited=$((waited + wait_interval))
done

echo "Warning: Timed out waiting for deployment to complete after ${max_wait}s"
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout message indicates a warning, but the function returns 1 (failure), which will cause the deployment to be retried. However, a timeout after 15 minutes might indicate a stuck deployment rather than a transient failure. Consider whether the deployment should be explicitly canceled before returning failure, to prevent accumulating stuck deployments. Alternatively, document that stuck deployments will remain active and may need manual cleanup.

Suggested change
echo "Warning: Timed out waiting for deployment to complete after ${max_wait}s"
echo "Warning: Timed out waiting for deployment to complete after ${max_wait}s; deployment may still be active or stuck and may require manual cleanup."

Copilot uses AI. Check for mistakes.
echo "✗ Deployment was canceled"
return 1
else
echo "✗ Deployment finalized with reason: $reason"
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the deployment completes with an unexpected reason (line 115), the error message shows the reason but doesn't provide guidance on what to do next. Consider logging more diagnostic information such as the full deployment status or suggesting manual investigation steps, to help operators understand what went wrong.

Suggested change
echo "✗ Deployment finalized with reason: $reason"
echo "✗ Deployment finalized with unexpected reason: $reason"
echo "Full deployment response from Cloud Foundry:"
echo "$deployment_info"
echo "Please investigate this deployment manually, for example by running:"
echo " cf curl \"/v3/deployments?app_guids=${app_guid}&order_by=-created_at&per_page=1\""

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants