Skip to content

feat: strict zero-downtime deploy with readiness probes and rollback#91

Open
bnema wants to merge 25 commits intomainfrom
feat/strict-zero-downtime-deploy
Open

feat: strict zero-downtime deploy with readiness probes and rollback#91
bnema wants to merge 25 commits intomainfrom
feat/strict-zero-downtime-deploy

Conversation

@bnema
Copy link
Owner

@bnema bnema commented Mar 3, 2026

Summary

Fixes 500 errors during production deploys by ensuring the old container never stops until the new one is verified ready and stable.

Root cause: multiple failure modes — stale in-memory state causing orphan cleanup to kill the active container, readiness check using only a fixed delay (no actual connection verification), and no rollback if the new container crashes after the traffic switch.

Changes

Core fixes

  • Resolve existing container from runtimeDeploy now queries the Docker runtime when in-memory state is empty, preventing stale-memory kills after Gordon restarts
  • Conservative orphan cleanup — never kills running canonical containers, only stopped canonical or temp (-new/-next) containers
  • Recover missing ImageID — inspect existing container to fill ImageID for redundant deploy skip check

Readiness probes

  • TCP probe — retries TCP connect every 500ms until success or timeout (universal fallback). Uses host-mapped port (127.0.0.1:<host_port>) instead of unreachable container internal IPs in rootless podman.
  • HTTP probe — retries GET every 1s until 2xx/3xx or timeout (used when gordon.health label is set); does not follow redirects
  • Readiness cascade (deploy.readiness_mode = "auto", the default):
    1. Docker healthcheck (if present)
    2. HTTP probe (if gordon.health label set)
    3. TCP probe (if port info available)
    4. Delay fallback (last resort)

Post-switch stabilization & rollback

  • After traffic switch, monitors the new container for a configurable window (deploy.stabilization_delay, default 2s)
  • If the new container crashes, restores the old container, re-invalidates proxy cache, and cleans up the failed container
  • If both old and new containers are dead, returns an error instead of silently pointing at a dead container

Deploy intent — fix event/CLI race condition

  • Problem: gordon push triggered two parallel deploys — one via image.pushed eventbus event (30s timeout, causing orphaned containers) and one via explicit /admin/deploy/ CLI call.
  • Fix: CLI calls POST /admin/deploy-intent/<image> before docker push. The registry service suppresses image.pushed events for that image (2min TTL auto-expiry). Deploy happens only via the explicit /admin/deploy/ call. Direct docker push (without CLI) still triggers event-based deploys.

Config

New options under [deploy]:

  • stabilization_delay (default "2s")
  • tcp_probe_timeout (default "30s")
  • http_probe_timeout (default "60s")

Cleanup

  • Removed dead network_isolation.dns_suffix config field (was set but never read; Docker native DNS already handles hostname resolution on user-defined networks)
  • Removed unused containerConfig parameter from resolveContainerEndpoint

Test coverage

  • ~800 new lines of tests including strict ordering verification (.NotBefore()), readiness failure leaving old untouched, rollback on post-switch crash, stabilization success, first deploy sanity, concurrent deploy serialization
  • All existing tests updated for new mock expectations
  • go test ./internal/usecase/container -count=1 passes (114s, includes intentionally slow 30s/60s probe timeout tests)

Commits (16)

f5d1226 feat(admin): clear deploy event suppression after successful deploy
5ba38cf feat(cli): call deploy-intent before push to prevent event-based double deploy
5c9a313 feat(admin): add deploy-intent endpoint to suppress event-based deploys
2eb86f5 feat(registry): add deploy event suppression for CLI-managed pushes
8147c67 refactor: address CodeRabbit review and fix lint gate
5c94f33 chore: remove dead dns_suffix config field
0fd48ee refactor: address CodeRabbit review nitpicks
5d857a7 fix: address code review findings for zero-downtime deploy
2a109aa test: add strict zero-downtime ordering tests
e6981de feat: add deploy config for stabilization delay and probe timeouts
b3c4051 feat: post-switch stabilization window with automatic rollback
43c7dfb feat: readiness cascade — healthcheck > HTTP probe > TCP probe > delay
1c38208 feat: add TCP and HTTP readiness probes
dca4ad5 fix: orphan cleanup never kills running canonical container
5f1ff6a fix: resolve existing container from runtime when memory is stale
c7a2b36 fix: recover missing ImageID via inspect for redundant deploy skip

Summary by CodeRabbit

  • New Features

    • Configurable health probes (TCP/HTTP) with per-probe timeouts.
    • Post-deploy stabilization delay to monitor new containers before finalizing.
    • Deploy-intent control to suppress automatic deployments during image pushes.
    • New image labels for routing and readiness (domains, port, health, env-file).
  • Removed

    • DNS suffix option removed from network isolation config.
  • Documentation

    • Updated docs and examples to reflect DNS suffix removal and new image-label guidance.

bnema added 12 commits March 1, 2026 15:01
When tracked container has no ImageID in memory (stale state after
restart/sync), Deploy now inspects the running container to recover
its ImageID before the redundancy check. This prevents unnecessary
full redeploys when the same image is already running.
Deploy now queries the runtime for a running container with matching
labels when in-memory state is empty. This prevents orphan cleanup
from killing the active container after Gordon restarts.
Running containers with the canonical name are now preserved during
orphan cleanup even if not tracked in memory. Only stopped canonical
containers and temp (-new/-next) containers are cleaned up.
TCP probe retries every 500ms until port accepts connections.
HTTP probe retries GET every 1s, expects 2xx/3xx.
Both are universal readiness checks for deploy pipeline.
waitForReady now auto-detects the strongest available readiness
signal in auto mode. Docker healthcheck is preferred, then HTTP
probe via gordon.health label, then TCP connect probe to exposed
port, then delay fallback as last resort.
After switching traffic to new container, Deploy monitors it briefly.
If new container crashes during stabilization, traffic is switched
back to old container and the failed new container is cleaned up.
New config options: deploy.stabilization_delay (2s),
deploy.tcp_probe_timeout (30s), deploy.http_probe_timeout (60s).
Verify old container is never stopped before new readiness completes.
Verify readiness failure leaves old container untouched.
Verify first deploy works without existing container.
- httpProbe: add CheckRedirect to prevent following redirects, treating
  3xx responses as successful readiness signals instead of chasing them
- stabilizeNewContainer: verify old container is still running before
  rollback; return error when both containers are dead instead of
  silently returning a dead container reference
- resolveExistingContainer: guard managedCount increment to prevent
  double-counting when domain is already tracked
- Remove unused containerConfig parameter from resolveContainerEndpoint
  and update both call sites in readinessCascade
- Add t.Logf in TestTCPProbe_DelayedListener to surface re-listen
  failures instead of silently swallowing them
- Times(5) for IsContainerRunning in concurrent test is correct:
  2x per deploy (pollContainerRunning + waitForReadyByDelay) + 1x
  stabilization for the second deploy = 5 total; left as-is
DNSSuffix was set from network_isolation.dns_suffix but never read
anywhere in the codebase. Remove the struct field, config wiring,
viper default, and all documentation references across docs/ and wiki/.
@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR removes the DNS suffix docs/config, adds TCP/HTTP readiness probes with configurable timeouts and a stabilization delay for zero-downtime deploys, implements suppressible deploy-event coordination, and wires CLI/admin HTTP and registry changes to support deploy-intent suppression.

Changes

Cohort / File(s) Summary
Docs & Examples
docs/config/index.md, docs/config/network-isolation.md, docs/config/reference.md, wiki/examples/production.md
Removed dns_suffix config and DNS-resolution examples/documentation.
Config Wiring
internal/app/run.go
Removed DNSSuffix; added StabilizationDelay, TCPProbeTimeout, HTTPProbeTimeout to container config and defaults.
Readiness Probes
internal/usecase/container/readiness.go, internal/usecase/container/readiness_test.go
New tcpProbe and httpProbe implementations with retries, timeouts, context cancellation and comprehensive tests.
Container Service & Deploy Flow
internal/usecase/container/service.go, internal/usecase/container/service_readiness_test.go, internal/usecase/container/service_test.go
Refactored deploy flow: added readiness cascade, probe/timeouts, stabilization/rollback logic, endpoint resolution, redundancy checks, new helper methods, updated waitForReady signature, and extensive tests.
Registry: Deploy-Event Suppression
internal/usecase/registry/service.go, internal/usecase/registry/service_test.go
Added suppression API (Suppress/Clear/IsDeployEventSuppressed), image-name normalization, 2-min suppression timers, and tests; PutManifest checks suppression before publishing events.
DeployCoordinator Interface & Mocks
internal/boundaries/in/deploy.go, internal/boundaries/in/mocks/mock_deploy_coordinator.go, internal/boundaries/in/mocks/mock_registry_service.go
New DeployCoordinator interface and generated mocks; expanded registry mock with Suppress/Clear methods.
CLI Control Plane & Push
internal/adapters/in/cli/controlplane.go, internal/adapters/in/cli/controlplane_local.go, internal/adapters/in/cli/controlplane_remote.go, internal/adapters/in/cli/push.go
Added DeployIntent to control plane API; local and remote implementations wired to DeployCoordinator; push calls DeployIntent pre-push (non-fatal).
Remote Client & Admin API
internal/adapters/in/cli/remote/client.go, internal/adapters/in/http/admin/handler.go
Remote client POST /deploy-intent; admin handler gains /deploy-intent route, enforces permissions, calls SuppressDeployEvent, and clears suppression after explicit admin deploy.
Misc Tests
internal/adapters/in/cli/proxy/service_test.go
Refactored a test to use a buffered bool channel for DrainRegistryInFlight outcome.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/Push
    participant CP as ControlPlane
    participant Reg as Registry Service
    participant Repo as Image/Deploy logic
    participant Container as Container Runtime

    CLI->>CP: DeployIntent(imageName)
    CP->>Reg: SuppressDeployEvent(imageName)
    Reg->>Reg: store suppression (2m timer)
    CLI->>Repo: Push image / PutManifest
    Repo->>Reg: PutManifest(manifest)
    Reg->>Reg: IsDeployEventSuppressed? -> true
    Reg-->>Repo: skip publish event
    CLI->>Repo: explicit Deploy (admin)
    Repo->>Reg: ClearDeployEventSuppression(imageName)
    Reg->>Reg: remove suppression
    Repo->>Container: Create & Start container
    Repo->>Container: resolveContainerEndpoint -> host:port / internalIP
    Repo->>Container: run readinessCascade (TCP/HTTP/healthcheck)
    Container-->>Repo: readiness result (success/fail)
    alt success
        Repo->>Repo: stabilizeNewContainer (delay / monitor)
    else failure
        Repo->>Container: rollback to old container
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I poked the ports and tapped each URL,

timers set to hum and probes that swell.
DNS hoppity—now gone from the trail,
deploys held gently so rollback won't wail.
A bunny's cheer for safer deploy tales! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.51% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: strict zero-downtime deploy with readiness probes and rollback' clearly and specifically summarizes the main feature addition: implementing strict zero-downtime deployment with new readiness probe capabilities and automatic rollback.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/strict-zero-downtime-deploy

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/usecase/container/readiness_test.go`:
- Around line 40-43: The test should stop assuming port 1 is unused; instead
open a temporary listener on ":0" to obtain an ephemeral port, read the actual
port number, close that listener to ensure no process is listening, then call
tcpProbe(ctx, "127.0.0.1:<ephemeralPort>", 200*time.Millisecond) and assert an
error. Update the readiness test to create and close the net.Listener (or use
net.Listen in the test setup) around testContext() so the reserved ephemeral
port is deterministic for tcpProbe.

In `@internal/usecase/container/readiness.go`:
- Around line 24-34: The TCP probe currently uses net.DialTimeout which is not
context-aware and can exceed the caller's deadline; replace net.DialTimeout with
net.Dialer{}.DialContext using the incoming ctx and compute a per-attempt
timeout by clamping the desired attempt duration (e.g., 1s) to the remaining
time from ctx.Deadline(), then call DialContext(ctxWithTimeout, "tcp", addr).
Ensure the connection is closed on success (conn.Close) and preserve the
existing select that respects ctx.Done(); apply the same pattern in the HTTP
probe by creating requests with http.NewRequestWithContext and clamping
per-attempt timeouts to the remaining deadline so neither net.Dial nor HTTP GET
can overrun the caller's context.

In `@internal/usecase/container/service_readiness_test.go`:
- Around line 55-56: The test currently mocks
runtime.EXPECT().GetContainerNetworkInfo(mock.Anything,
containerID).Return("172.17.0.5", 8080, nil).Once() which can point at a
reachable Docker host address; replace the hardcoded address/port with a
loopback-based unreachable endpoint by reserving an ephemeral port on 127.0.0.1
(listen on "127.0.0.1:0"), capturing its assigned port, closing the listener to
make the port closed/reserved-then-closed, and then return "127.0.0.1" and that
port from GetContainerNetworkInfo; apply the same change for the other
occurrences around the test (the spots noted at the later expectation lines).
Ensure you create the ephemeral-reserve helper or inline the reserve/close logic
before setting the mock expectations so the port is deterministic and
unreachable during the timeout assertions.

In `@internal/usecase/container/service_test.go`:
- Around line 2677-2708: The test is invalid because Docker disallows creating a
container with the same name as an existing running canonical container; update
the test to reflect real behavior by making createStartedContainer create with a
non-canonical/unique name (or rename/stop the existing one) when ListContainers
returns an active canonical container. Concretely, change the CreateContainer
expectation in the test (the mock for CreateContainer and the matched cfg.Name)
to expect a suffix/unique name (e.g., "gordon-test.example.com-new" or whatever
the production name-generator produces) or adjust the test to simulate the code
path that renames/stops the existing container; reference CreateContainer,
createStartedContainer and cleanupOrphanedContainers to locate and update the
expectation and matching predicate.

In `@internal/usecase/container/service.go`:
- Around line 2269-2324: readinessCascade is above the cyclomatic limit; extract
the HTTP and TCP probe blocks into helper methods to simplify flow. Create
Service methods like attemptHTTPProbe(ctx, containerID string, containerConfig
*domain.ContainerConfig, cfg Config) (bool, error) and attemptTCPProbe(ctx,
containerID string, containerConfig *domain.ContainerConfig, cfg Config) (bool,
error) that encapsulate endpoint resolution (resolveContainerEndpoint), timeout
defaulting, logging, and calling httpProbe/tcpProbe; they should return (true,
err) when a probe was used (returning the probe error) and (false, nil) when
skipped so readinessCascade can simply call them and either return the probe
error or continue to the next step, preserving existing calls to waitForHealthy
and waitForReadyByDelay.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a363ed1 and 5c94f33.

📒 Files selected for processing (11)
  • docs/config/index.md
  • docs/config/network-isolation.md
  • docs/config/reference.md
  • internal/app/run.go
  • internal/usecase/container/readiness.go
  • internal/usecase/container/readiness_test.go
  • internal/usecase/container/service.go
  • internal/usecase/container/service_readiness_test.go
  • internal/usecase/container/service_test.go
  • internal/usecase/proxy/service_test.go
  • wiki/examples/production.md
💤 Files with no reviewable changes (4)
  • docs/config/network-isolation.md
  • docs/config/reference.md
  • docs/config/index.md
  • wiki/examples/production.md

Comment on lines +24 to +34
conn, err := net.DialTimeout("tcp", addr, time.Second)
if err == nil {
conn.Close()
return nil
}

select {
case <-time.After(500 * time.Millisecond):
case <-ctx.Done():
return ctx.Err()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "readiness.go" | head -20

Repository: bnema/gordon

Length of output: 97


🏁 Script executed:

cat -n internal/usecase/container/readiness.go

Repository: bnema/gordon

Length of output: 2357


Bind probe attempts to context and remaining deadline.

Network calls in both probes can exceed the configured timeout because net.DialTimeout and client.Get are not context-bound. A 1-second dial or 2-second HTTP timeout can proceed even when the caller's deadline is seconds away, stalling readiness checks longer than intended.

Use net.Dialer.DialContext() for TCP probes and http.NewRequestWithContext() for HTTP probes, with per-attempt timeouts clamped to the remaining deadline.

🔧 Proposed fix
 func tcpProbe(ctx context.Context, addr string, timeout time.Duration) error {
 	deadline := time.Now().Add(timeout)
+	dialer := &net.Dialer{}
 	for {
 		if err := ctx.Err(); err != nil {
 			return err
 		}
 		if time.Now().After(deadline) {
 			return fmt.Errorf("TCP probe timeout after %s: %s not reachable", timeout, addr)
 		}
 
-		conn, err := net.DialTimeout("tcp", addr, time.Second)
+		remaining := time.Until(deadline)
+		attemptTimeout := time.Second
+		if remaining < attemptTimeout {
+			attemptTimeout = remaining
+		}
+		attemptCtx, cancel := context.WithTimeout(ctx, attemptTimeout)
+		conn, err := dialer.DialContext(attemptCtx, "tcp", addr)
+		cancel()
 		if err == nil {
 			conn.Close()
 			return nil
 		}
 
 		select {
 		case <-time.After(500 * time.Millisecond):
 		case <-ctx.Done():
 			return ctx.Err()
 		}
 	}
 }
@@
 func httpProbe(ctx context.Context, url string, timeout time.Duration) error {
 	client := &http.Client{
 		Timeout: 2 * time.Second,
@@
 	for {
@@
-		resp, err := client.Get(url)
+		remaining := time.Until(deadline)
+		attemptTimeout := 2 * time.Second
+		if remaining < attemptTimeout {
+			attemptTimeout = remaining
+		}
+		req, reqErr := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
+		if reqErr != nil {
+			return reqErr
+		}
+		attemptCtx, cancel := context.WithTimeout(req.Context(), attemptTimeout)
+		req = req.WithContext(attemptCtx)
+		resp, err := client.Do(req)
+		cancel()
 		if err == nil {
 			lastStatus = resp.StatusCode
 			resp.Body.Close()
 			if lastStatus >= 200 && lastStatus < 400 {
 				return nil
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/usecase/container/readiness.go` around lines 24 - 34, The TCP probe
currently uses net.DialTimeout which is not context-aware and can exceed the
caller's deadline; replace net.DialTimeout with net.Dialer{}.DialContext using
the incoming ctx and compute a per-attempt timeout by clamping the desired
attempt duration (e.g., 1s) to the remaining time from ctx.Deadline(), then call
DialContext(ctxWithTimeout, "tcp", addr). Ensure the connection is closed on
success (conn.Close) and preserve the existing select that respects ctx.Done();
apply the same pattern in the HTTP probe by creating requests with
http.NewRequestWithContext and clamping per-attempt timeouts to the remaining
deadline so neither net.Dial nor HTTP GET can overrun the caller's context.

- Extract tryHTTPProbe/tryTCPProbe helpers from readinessCascade to reduce
  cyclomatic complexity (16 → ≤15, gocyclo)
- Extract recordDeployMetrics helper from Deploy defer closure to reduce
  cyclomatic complexity (16 → ≤15, gocyclo)
- Remove unused getTrackedContainer method (unused linter)
- readiness.go: replace net.DialTimeout with net.Dialer.DialContext and
  client.Get with http.NewRequestWithContext; per-attempt timeouts clamped
  to caller deadline so neither probe can overrun the context
- readiness_test.go: use ephemeral port (net.Listen :0 + close) instead of
  hardcoded port 1 for TestTCPProbe_Timeout
- service_readiness_test.go: use loopback ephemeral ports for TCP/HTTP probe
  timeout mocks instead of hardcoded 172.17.0.5
- service_test.go: fix TestService_Deploy_OrphanCleanup_NeverKillsRunningCanonical
  to reflect real Docker behavior — existing canonical container is resolved
  via managed label so new container gets the -new suffix, avoiding a name
  collision that Docker would reject at runtime
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/usecase/container/readiness_test.go`:
- Around line 60-88: The test can race if another process binds the port after
ln.Close(); modify the goroutine that re-binds (the net.Listen call in
TestTCPProbe_DelayedListener) to retry binding for a short window and surface
failures to the test instead of only logging: implement a retry loop around
newLn, sleeping (e.g., 10–50ms) between attempts for up to ~200–500ms, and
replace t.Logf on bind error with require.NoError (or send the error on a
channel and fail the test) so that a failed rebind fails the test immediately;
this keeps tcpProbe(addr) semantics intact while removing the silent port-rebind
race.

In `@internal/usecase/container/service_readiness_test.go`:
- Around line 76-78: The test currently mutates svc.config.TCPProbeTimeout under
svc.mu after construction; instead set TCPProbeTimeout directly in the initial
Config passed when creating the service (the Config struct used to build svc),
e.g., include TCPProbeTimeout: 50 * time.Millisecond in that Config literal so
you can remove the svc.mu.Lock()/Unlock() and the post-construction assignment;
update references to svc.config only if needed and keep the Config field name
TCPProbeTimeout and the svc variable (created by the service constructor) as the
targets to change.

In `@internal/usecase/container/service.go`:
- Around line 610-632: The cleanup calls to s.runtime.StopContainer(ctx,
newContainer.ID) and s.runtime.RemoveContainer(ctx, newContainer.ID, true)
silently ignore returned errors; capture their errors and log them at Debug or
Warn level (including context like newContainer.ID and domainName and the error)
so failed best-effort cleanup is visible; do this for both cleanup sites in the
rollback path (the early cleanup branch and the later cleanup branch) using the
service's logger (e.g., s.logger or the package logger) and include the error
details in the log message.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c94f33 and 8147c67.

📒 Files selected for processing (5)
  • internal/usecase/container/readiness.go
  • internal/usecase/container/readiness_test.go
  • internal/usecase/container/service.go
  • internal/usecase/container/service_readiness_test.go
  • internal/usecase/container/service_test.go

Comment on lines +76 to +78
svc.mu.Lock()
svc.config.TCPProbeTimeout = 50 * time.Millisecond
svc.mu.Unlock()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider setting timeout in initial Config.

The test modifies svc.config.TCPProbeTimeout via mutex after construction. This could be simplified by including it in the initial Config{} on lines 52-54, making the test setup more declarative.

♻️ Suggested simplification
 func TestService_WaitForReady_AutoCascadeUsesTCPWhenNoHealthcheckAndNoHealthLabel(t *testing.T) {
 	runtime := mocks.NewMockContainerRuntime(t)
 	svc := NewService(runtime, nil, nil, nil, Config{
-		ReadinessMode: "auto",
+		ReadinessMode:   "auto",
+		TCPProbeTimeout: 50 * time.Millisecond,
 	})
 	// ...
-	// TCP probe will try to connect — which will fail since the port is closed.
-	// Use a short health timeout so the test doesn't hang.
-	svc.mu.Lock()
-	svc.config.TCPProbeTimeout = 50 * time.Millisecond
-	svc.mu.Unlock()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
svc.mu.Lock()
svc.config.TCPProbeTimeout = 50 * time.Millisecond
svc.mu.Unlock()
func TestService_WaitForReady_AutoCascadeUsesTCPWhenNoHealthcheckAndNoHealthLabel(t *testing.T) {
runtime := mocks.NewMockContainerRuntime(t)
svc := NewService(runtime, nil, nil, nil, Config{
ReadinessMode: "auto",
TCPProbeTimeout: 50 * time.Millisecond,
})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/usecase/container/service_readiness_test.go` around lines 76 - 78,
The test currently mutates svc.config.TCPProbeTimeout under svc.mu after
construction; instead set TCPProbeTimeout directly in the initial Config passed
when creating the service (the Config struct used to build svc), e.g., include
TCPProbeTimeout: 50 * time.Millisecond in that Config literal so you can remove
the svc.mu.Lock()/Unlock() and the post-construction assignment; update
references to svc.config only if needed and keep the Config field name
TCPProbeTimeout and the svc variable (created by the service constructor) as the
targets to change.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (2)
internal/usecase/container/service_readiness_test.go (1)

52-54: 🧹 Nitpick | 🔵 Trivial

Keep timeout configuration in constructor setup.

Lines 78-80 mutate config after construction. Prefer setting TCPProbeTimeout in the initial Config literal on Lines 52-54 to keep setup declarative.

Also applies to: 78-80

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/usecase/container/service_readiness_test.go` around lines 52 - 54,
The test mutates Config.TCPProbeTimeout after constructing the service; instead,
pass TCPProbeTimeout in the initial Config literal when calling NewService in
service_readiness_test.go (use the Config struct used in the NewService(...)
call) so that Config (including TCPProbeTimeout) is fully defined at
construction time rather than modified post-construction; update the
NewService(...) call to include TCPProbeTimeout alongside ReadinessMode to keep
setup declarative.
internal/usecase/container/service.go (1)

629-631: 🧹 Nitpick | 🔵 Trivial

Rollback cleanup still drops stop/remove errors.

Lines 630-631 and Lines 650-651 ignore cleanup failures. Logging these errors would materially improve rollback diagnostics.

Also applies to: 650-651

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/usecase/container/service.go` around lines 629 - 631, The rollback
cleanup currently swallows errors from s.runtime.StopContainer(ctx,
newContainer.ID) and s.runtime.RemoveContainer(ctx, newContainer.ID, true);
update the rollback path that handles the failed newContainer to capture
returned errors and log them (e.g., using the existing logger or processLogger)
with context including newContainer.ID and the operation name ("StopContainer" /
"RemoveContainer"); do the same for the other occurrence around lines where
remove/stop are called during rollback so both stop and remove failures are
logged rather than ignored.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/adapters/in/cli/push.go`:
- Around line 159-164: The current logic skips calling handle.plane.DeployIntent
when noDeploy is true, which leaves the registry image.pushed events free to
trigger deployments; always record a suppression intent for imageName instead of
skipping registration: modify the flow around noDeploy to either call a new
method (e.g., handle.plane.SuppressDeploy(ctx, imageName)) or extend
handle.plane.DeployIntent to accept a noDeploy flag so an intent is stored for
imageName even when noDeploy is true, and ensure the event handler checks that
stored intent before performing an actual deploy.

In `@internal/adapters/in/http/admin/handler.go`:
- Around line 234-250: The handler handleDeployIntent lacks authorization and a
nil-check for the registry service: first call h.HasAccess(r.Context(), ...) (or
the appropriate HasAccess signature) and if it denies, respond with
h.sendError(w, http.StatusForbidden, "forbidden") and return; second, verify
h.registrySvc != nil before calling h.registrySvc.SuppressDeployEvent(imageName)
and if nil respond with h.sendError(w, http.StatusServiceUnavailable, "registry
service unavailable") and return; keep existing method check, imageName
validation, and logging but add these two early guards to prevent unauthorized
use and panics.
- Around line 1109-1121: The current manual string-splitting around route.Image
can incorrectly handle digest-form and other valid container references and also
fails to check h.registrySvc before use; replace the ad-hoc parsing in the block
that reads route.Image with a canonical image/name parser (e.g., use your
project’s image parsing util or a library like distribution/reference's
ParseNormalizedNamed) to obtain the repository name without registry, tag or
digest, then call h.registrySvc.ClearDeployEventSuppression with that canonical
repo name; also add a nil guard for h.registrySvc before calling
ClearDeployEventSuppression to avoid panics (refer to route.Image and
h.registrySvc.ClearDeployEventSuppression in this handler).
- Around line 240-249: The handler extracts imageName from the request path but
never URL-decodes it, so escaped names like "team%2Fmy-app" are used as
suppression keys; call url.PathUnescape (or net/url's QueryUnescape if
preferred) on imageName after strings.TrimPrefix and before logging and calling
h.registrySvc.SuppressDeployEvent(imageName) to ensure the suppression key
matches manifest events, handle/unwrap the returned error and return a
BadRequest via h.sendError if unescaping fails.

In `@internal/usecase/container/service.go`:
- Around line 1765-1783: The orphan cleanup currently only protects running
canonical containers (isCanonical) and will remove running temp containers
(-new/-next), which can kill active traffic; update the loop in the orphan
cleanup (the block using expectedName, expectedNewName, expectedNextName,
skipContainerID and the isCanonical variable) to first detect and skip any
container with c.Status == "running" (log and continue) regardless of name, and
then proceed with orphan removal logic for non-running containers; ensure you
still honor skipContainerID when deciding which containers are orphan
candidates.
- Line 262: The deferred recordDeployMetrics call captures the named error
variable &err but the function returns stabilizeErr directly (leaving err
unchanged), so update the stabilize error return path in the function containing
defer s.recordDeployMetrics(ctx, span, route, deployStart, &err) to assign the
stabilization error into the named return error (e.g., set err = stabilizeErr)
before returning; ensure you reference the stabilizeErr variable and the named
err used by the defer so the deferred recordDeployMetrics and span status see
the stabilization failure.

In `@internal/usecase/registry/service.go`:
- Around line 57-63: The suppression expiry callback created with time.AfterFunc
currently deletes by imageName unconditionally, allowing an old timer's callback
to remove a newer timer; update the callback in SuppressDeployEvent so it first
loads s.suppressedImages for imageName, compares the stored value identity to
the local timer instance (type-assert to *time.Timer), and only call
Delete(imageName) if they are the same instance; keep the existing
LoadOrStore/Stop/Store flow but add this identity check in the callback to avoid
removing a newer suppression.

---

Duplicate comments:
In `@internal/usecase/container/service_readiness_test.go`:
- Around line 52-54: The test mutates Config.TCPProbeTimeout after constructing
the service; instead, pass TCPProbeTimeout in the initial Config literal when
calling NewService in service_readiness_test.go (use the Config struct used in
the NewService(...) call) so that Config (including TCPProbeTimeout) is fully
defined at construction time rather than modified post-construction; update the
NewService(...) call to include TCPProbeTimeout alongside ReadinessMode to keep
setup declarative.

In `@internal/usecase/container/service.go`:
- Around line 629-631: The rollback cleanup currently swallows errors from
s.runtime.StopContainer(ctx, newContainer.ID) and s.runtime.RemoveContainer(ctx,
newContainer.ID, true); update the rollback path that handles the failed
newContainer to capture returned errors and log them (e.g., using the existing
logger or processLogger) with context including newContainer.ID and the
operation name ("StopContainer" / "RemoveContainer"); do the same for the other
occurrence around lines where remove/stop are called during rollback so both
stop and remove failures are logged rather than ignored.

ℹ️ Review info
Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ac87aa16-47d9-43e5-b65d-5202634b74b6

📥 Commits

Reviewing files that changed from the base of the PR and between 8147c67 and f5d1226.

📒 Files selected for processing (13)
  • internal/adapters/in/cli/controlplane.go
  • internal/adapters/in/cli/controlplane_local.go
  • internal/adapters/in/cli/controlplane_remote.go
  • internal/adapters/in/cli/push.go
  • internal/adapters/in/cli/remote/client.go
  • internal/adapters/in/http/admin/handler.go
  • internal/boundaries/in/mocks/mock_registry_service.go
  • internal/boundaries/in/registry.go
  • internal/usecase/container/readiness.go
  • internal/usecase/container/service.go
  • internal/usecase/container/service_readiness_test.go
  • internal/usecase/registry/service.go
  • internal/usecase/registry/service_test.go

Comment on lines +159 to +164
if !noDeploy {
if err := handle.plane.DeployIntent(ctx, imageName); err != nil {
// Non-fatal: worst case we get a redundant deploy via event
fmt.Fprintf(os.Stderr, "warning: failed to register deploy intent: %v\n", err)
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

--no-deploy can still trigger a deployment.

At Line 159, suppression is skipped when noDeploy is true. That allows registry image.pushed events to trigger deploys, which breaks the “push only” contract for --no-deploy.

Suggested fix
-	if !noDeploy {
-		if err := handle.plane.DeployIntent(ctx, imageName); err != nil {
-			// Non-fatal: worst case we get a redundant deploy via event
-			fmt.Fprintf(os.Stderr, "warning: failed to register deploy intent: %v\n", err)
-		}
-	}
+	if err := handle.plane.DeployIntent(ctx, imageName); err != nil {
+		// Non-fatal: worst case event-based deploy behavior is unchanged
+		fmt.Fprintf(os.Stderr, "warning: failed to register deploy intent: %v\n", err)
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if !noDeploy {
if err := handle.plane.DeployIntent(ctx, imageName); err != nil {
// Non-fatal: worst case we get a redundant deploy via event
fmt.Fprintf(os.Stderr, "warning: failed to register deploy intent: %v\n", err)
}
}
if err := handle.plane.DeployIntent(ctx, imageName); err != nil {
// Non-fatal: worst case event-based deploy behavior is unchanged
fmt.Fprintf(os.Stderr, "warning: failed to register deploy intent: %v\n", err)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapters/in/cli/push.go` around lines 159 - 164, The current logic
skips calling handle.plane.DeployIntent when noDeploy is true, which leaves the
registry image.pushed events free to trigger deployments; always record a
suppression intent for imageName instead of skipping registration: modify the
flow around noDeploy to either call a new method (e.g.,
handle.plane.SuppressDeploy(ctx, imageName)) or extend handle.plane.DeployIntent
to accept a noDeploy flag so an intent is stored for imageName even when
noDeploy is true, and ensure the event handler checks that stored intent before
performing an actual deploy.

Comment on lines 1765 to +1783
if (c.Name == expectedName || c.Name == expectedNewName || c.Name == expectedNextName) && c.ID != skipContainerID {
log.Info().Str(zerowrap.FieldEntityID, c.ID).Str(zerowrap.FieldStatus, c.Status).Msg("found orphaned container, removing")
// The skipContainerID is the container we resolved as the active
// one serving traffic. Any other container with a matching name is
// an orphan from a previous (possibly interrupted) deploy.
//
// Running containers with the canonical name are protected — they
// could be legitimate if resolveExistingContainer found a temp
// container instead. Running temp containers (-new/-next) that are
// NOT the skip container are safe to remove: they are leftovers
// from failed deploys.
isCanonical := c.Name == expectedName
if isCanonical && c.Status == "running" {
log.Debug().
Str(zerowrap.FieldEntityID, c.ID).
Str("container_name", c.Name).
Str(zerowrap.FieldStatus, c.Status).
Msg("skipping running canonical container during orphan cleanup")
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Orphan cleanup can stop a live -new/-next container.

Line 1776 protects only running canonical containers. Running temp containers (-new/-next) are still removed, which can kill active traffic if skipContainerID is stale or unresolved.

🔧 Proposed fix
-			isCanonical := c.Name == expectedName
-			if isCanonical && c.Status == "running" {
+			if c.Status == "running" {
 				log.Debug().
 					Str(zerowrap.FieldEntityID, c.ID).
 					Str("container_name", c.Name).
 					Str(zerowrap.FieldStatus, c.Status).
-					Msg("skipping running canonical container during orphan cleanup")
+					Msg("skipping running container during orphan cleanup")
 				continue
 			}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (c.Name == expectedName || c.Name == expectedNewName || c.Name == expectedNextName) && c.ID != skipContainerID {
log.Info().Str(zerowrap.FieldEntityID, c.ID).Str(zerowrap.FieldStatus, c.Status).Msg("found orphaned container, removing")
// The skipContainerID is the container we resolved as the active
// one serving traffic. Any other container with a matching name is
// an orphan from a previous (possibly interrupted) deploy.
//
// Running containers with the canonical name are protected — they
// could be legitimate if resolveExistingContainer found a temp
// container instead. Running temp containers (-new/-next) that are
// NOT the skip container are safe to remove: they are leftovers
// from failed deploys.
isCanonical := c.Name == expectedName
if isCanonical && c.Status == "running" {
log.Debug().
Str(zerowrap.FieldEntityID, c.ID).
Str("container_name", c.Name).
Str(zerowrap.FieldStatus, c.Status).
Msg("skipping running canonical container during orphan cleanup")
continue
}
if (c.Name == expectedName || c.Name == expectedNewName || c.Name == expectedNextName) && c.ID != skipContainerID {
// The skipContainerID is the container we resolved as the active
// one serving traffic. Any other container with a matching name is
// an orphan from a previous (possibly interrupted) deploy.
//
// Running containers with the canonical name are protected — they
// could be legitimate if resolveExistingContainer found a temp
// container instead. Running temp containers (-new/-next) that are
// NOT the skip container are safe to remove: they are leftovers
// from failed deploys.
if c.Status == "running" {
log.Debug().
Str(zerowrap.FieldEntityID, c.ID).
Str("container_name", c.Name).
Str(zerowrap.FieldStatus, c.Status).
Msg("skipping running container during orphan cleanup")
continue
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/usecase/container/service.go` around lines 1765 - 1783, The orphan
cleanup currently only protects running canonical containers (isCanonical) and
will remove running temp containers (-new/-next), which can kill active traffic;
update the loop in the orphan cleanup (the block using expectedName,
expectedNewName, expectedNextName, skipContainerID and the isCanonical variable)
to first detect and skip any container with c.Status == "running" (log and
continue) regardless of name, and then proceed with orphan removal logic for
non-running containers; ensure you still honor skipContainerID when deciding
which containers are orphan candidates.

bnema added 7 commits March 4, 2026 09:42
- registry: fix timer identity check in SuppressDeployEvent to prevent
  stale AfterFunc callbacks from evicting a newer suppression entry
- registry: add ExtractImageName helper for canonical image ref parsing
- admin/handler: add auth check, registrySvc nil-guard, and URL-decode
  for imageName in handleDeployIntent
- admin/handler: use ExtractImageName and registrySvc nil-guard when
  clearing suppression after explicit deploy
- container: assign stabilizeErr to named return err so deferred
  recordDeployMetrics and span see stabilization failures
- container: skip all running containers (not just canonical) in
  cleanupOrphanedContainers to protect active temp containers
- container: log StopContainer/RemoveContainer errors in rollback paths
  instead of silently discarding them
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/usecase/container/service.go (1)

223-228: ⚠️ Potential issue | 🟠 Major

HTTP readiness probe is unreachable in the current deploy path.

tryHTTPProbe requires containerConfig.Labels[domain.LabelHealth] (Line 2337), but buildContainerConfig only sets domain/image/managed/route labels (Lines 223-228). In practice, HTTP probing is always skipped and auto-mode falls through to TCP/delay.

Also applies to: 2337-2340

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/usecase/container/service.go` around lines 223 - 228,
buildContainerConfig doesn't set domain.LabelHealth so tryHTTPProbe never runs;
add the health label to the Labels map in buildContainerConfig (the same place
where domain.LabelDomain, domain.LabelImage, domain.LabelManaged,
domain.LabelRoute are set) so tryHTTPProbe can read
containerConfig.Labels[domain.LabelHealth] and enable HTTP probing; ensure the
label value follows the expected format used by tryHTTPProbe and update any
callers constructing health settings if necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/adapters/in/cli/remote/client.go`:
- Around line 729-733: The DeployIntent method should validate the imageName
before building/sending the request: in Client.DeployIntent check that imageName
is not empty (and optionally trim whitespace) and return a clear validation
error (e.g. fmt.Errorf or errors.New) when it is invalid, rather than calling
requestWithRetry; update the function (referencing DeployIntent and the
imageName parameter) to perform this early-return validation so the CLI fails
fast with a user-facing message.

In `@internal/usecase/container/readiness.go`:
- Around line 49-57: The retry sleepInterval computation currently only clamps
to ctx.Deadline(), which can still oversleep past the probe's own deadline;
update both places where sleepInterval is set (the block using sleepInterval,
ctx.Deadline(), and rem around the top of the probe loop and the similar block
at lines ~109-117) to take the minimum of the context remaining time and the
probe-specific deadline remaining (e.g., compute remCtx :=
time.Until(ctxDeadline) if ctx.Deadline() ok and remProbe :=
time.Until(deadline) for the probe's deadline, then set sleepInterval =
min(500*time.Millisecond, remCtx, remProbe)), and if the computed sleepInterval
<= 0 return the appropriate error (ctx.Err() if the context expired, otherwise
the probe timeout error) so the probe cannot sleep past its own deadline.

In `@internal/usecase/container/service_test.go`:
- Around line 30-35: testMinDelayConfig returns a Config with DrainDelay set to
1ms but doesn't set DrainDelayConfigured, so waitForDrain ignores it and uses
the default; update testMinDelayConfig to also set DrainDelayConfigured true
(alongside ReadinessDelay/DrainDelay/StabilizationDelay) so waitForDrain will
honor the 1ms drain delay.

---

Outside diff comments:
In `@internal/usecase/container/service.go`:
- Around line 223-228: buildContainerConfig doesn't set domain.LabelHealth so
tryHTTPProbe never runs; add the health label to the Labels map in
buildContainerConfig (the same place where domain.LabelDomain,
domain.LabelImage, domain.LabelManaged, domain.LabelRoute are set) so
tryHTTPProbe can read containerConfig.Labels[domain.LabelHealth] and enable HTTP
probing; ensure the label value follows the expected format used by tryHTTPProbe
and update any callers constructing health settings if necessary.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c9526c0b-9fbf-4c3d-a72d-67ab4dcc4c9a

📥 Commits

Reviewing files that changed from the base of the PR and between f5d1226 and bf45a07.

📒 Files selected for processing (10)
  • internal/adapters/in/cli/controlplane_local.go
  • internal/adapters/in/cli/remote/client.go
  • internal/adapters/in/http/admin/handler.go
  • internal/boundaries/in/deploy.go
  • internal/boundaries/in/mocks/mock_deploy_coordinator.go
  • internal/usecase/container/readiness.go
  • internal/usecase/container/readiness_test.go
  • internal/usecase/container/service.go
  • internal/usecase/container/service_test.go
  • internal/usecase/registry/service.go

Comment on lines +49 to +57
sleepInterval := 500 * time.Millisecond
if deadline, ok := ctx.Deadline(); ok {
if rem := time.Until(deadline); rem < sleepInterval {
sleepInterval = rem
}
}
if sleepInterval <= 0 {
return ctx.Err()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Probe timeout can be exceeded by retry sleep.

Line 49 and Line 109 clamp sleep only to ctx.Deadline(), not the probe’s own deadline. With no context deadline, short probe timeouts can still oversleep (up to 500ms/1s) before failing.

🔧 Suggested fix
-		sleepInterval := 500 * time.Millisecond
-		if deadline, ok := ctx.Deadline(); ok {
-			if rem := time.Until(deadline); rem < sleepInterval {
-				sleepInterval = rem
-			}
-		}
+		sleepInterval := 500 * time.Millisecond
+		if rem := time.Until(deadline); rem < sleepInterval {
+			sleepInterval = rem
+		}
+		if ctxDeadline, ok := ctx.Deadline(); ok {
+			if rem := time.Until(ctxDeadline); rem < sleepInterval {
+				sleepInterval = rem
+			}
+		}
@@
-		sleepInterval := time.Second
-		if deadline, ok := ctx.Deadline(); ok {
-			if rem := time.Until(deadline); rem < sleepInterval {
-				sleepInterval = rem
-			}
-		}
+		sleepInterval := time.Second
+		if rem := time.Until(deadline); rem < sleepInterval {
+			sleepInterval = rem
+		}
+		if ctxDeadline, ok := ctx.Deadline(); ok {
+			if rem := time.Until(ctxDeadline); rem < sleepInterval {
+				sleepInterval = rem
+			}
+		}

Also applies to: 109-117

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/usecase/container/readiness.go` around lines 49 - 57, The retry
sleepInterval computation currently only clamps to ctx.Deadline(), which can
still oversleep past the probe's own deadline; update both places where
sleepInterval is set (the block using sleepInterval, ctx.Deadline(), and rem
around the top of the probe loop and the similar block at lines ~109-117) to
take the minimum of the context remaining time and the probe-specific deadline
remaining (e.g., compute remCtx := time.Until(ctxDeadline) if ctx.Deadline() ok
and remProbe := time.Until(deadline) for the probe's deadline, then set
sleepInterval = min(500*time.Millisecond, remCtx, remProbe)), and if the
computed sleepInterval <= 0 return the appropriate error (ctx.Err() if the
context expired, otherwise the probe timeout error) so the probe cannot sleep
past its own deadline.

… delay, and docs

- DeployIntent: validate imageName is non-empty before sending request
- tcpProbe/httpProbe: clamp sleepInterval to probe-specific deadline (not just
  ctx deadline) so probes cannot oversleep past their own timeout; return the
  probe timeout error (not ctx.Err()) when the probe deadline expires
- testMinDelayConfig: set DrainDelayConfigured: true so waitForDrain honours
  the 1ms drain delay in tests instead of falling back to the 2s default
- docs/reference/docker-labels.md: document gordon.health, gordon.domains,
  gordon.port, and gordon.env-file image labels alongside gordon.proxy.port
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/adapters/in/cli/remote/client.go`:
- Around line 729-734: Trim the incoming imageName before using it to build the
deploy-intent path: validate and assign a trimmed value (e.g., overwrite
imageName = strings.TrimSpace(imageName)) in Client.DeployIntent, then pass
url.PathEscape(imageName) to requestWithRetry so encoded spaces are never sent;
update the code around the DeployIntent function where imageName is checked and
where requestWithRetry is called to use the trimmed variable.

In `@internal/usecase/container/readiness.go`:
- Around line 76-138: httpProbe currently exceeds cyclomatic complexity; extract
the retry/deadline/sleep orchestration into small shared helpers so httpProbe
focuses only on request logic. Create helpers such as probeLoop(ctx, deadline,
timeout, work func(attemptTimeout time.Duration) (success bool, info int, err
error)) that computes attemptTimeout, enforces overall deadline, handles
per-attempt context with cancel, computes sleepInterval (shared with tcpProbe),
manages timer/select and returns final result (including lastStatus/info and
context errors). Replace the loop in httpProbe with a call to probeLoop that
performs the HTTP request (using client.Do, sets lastStatus from resp.StatusCode
and closes body) and let probeLoop handle retries, timeouts, sleeps and
propagation of ctx.Err()/timeout errors so both httpProbe and tcpProbe can reuse
the same orchestration helpers.

In `@internal/usecase/container/service_test.go`:
- Around line 429-432: The test config is setting DrainDelay: time.Millisecond
without marking DrainDelayConfigured, which can trigger default drain behavior
and slow tests; update the test setup that constructs the config (the instances
where ReadinessDelay/DrainDelay/StabilizationDelay are set) to either call
testMinDelayConfig() or explicitly set DrainDelayConfigured: true alongside
DrainDelay so the code uses the provided minimal drain value; apply the same
change to the other similar test config block that sets minimal delays (the
block around the other deploy tests).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 86a7643c-e65e-41fc-bf42-55d9263eea86

📥 Commits

Reviewing files that changed from the base of the PR and between bf45a07 and 08d30e7.

📒 Files selected for processing (4)
  • docs/reference/docker-labels.md
  • internal/adapters/in/cli/remote/client.go
  • internal/usecase/container/readiness.go
  • internal/usecase/container/service_test.go

Comment on lines +729 to +734
func (c *Client) DeployIntent(ctx context.Context, imageName string) error {
if strings.TrimSpace(imageName) == "" {
return fmt.Errorf("image name cannot be empty")
}
resp, err := c.requestWithRetry(ctx, http.MethodPost, "/deploy-intent/"+url.PathEscape(imageName), nil)
if err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use the trimmed image name when building the deploy-intent path.

Line 730 validates a trimmed value, but Line 733 still sends the original imageName. An input like " myrepo/app:tag " passes validation yet sends encoded spaces to the server.

🔧 Proposed fix
func (c *Client) DeployIntent(ctx context.Context, imageName string) error {
-	if strings.TrimSpace(imageName) == "" {
+	imageName = strings.TrimSpace(imageName)
+	if imageName == "" {
		return fmt.Errorf("image name cannot be empty")
	}
	resp, err := c.requestWithRetry(ctx, http.MethodPost, "/deploy-intent/"+url.PathEscape(imageName), nil)
	if err != nil {
		return err
	}
	return parseResponse(resp, nil)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapters/in/cli/remote/client.go` around lines 729 - 734, Trim the
incoming imageName before using it to build the deploy-intent path: validate and
assign a trimmed value (e.g., overwrite imageName =
strings.TrimSpace(imageName)) in Client.DeployIntent, then pass
url.PathEscape(imageName) to requestWithRetry so encoded spaces are never sent;
update the code around the DeployIntent function where imageName is checked and
where requestWithRetry is called to use the trimmed variable.

Comment on lines +76 to +138
func httpProbe(ctx context.Context, url string, timeout time.Duration) error {
client := &http.Client{
CheckRedirect: func(req *http.Request, via []*http.Request) error {
// Don't follow redirects — treat 3xx as a successful response.
return http.ErrUseLastResponse
},
}
deadline := time.Now().Add(timeout)
var lastStatus int

for {
if err := ctx.Err(); err != nil {
return err
}
if time.Now().After(deadline) {
return fmt.Errorf("HTTP probe timeout after %s: last status %d from %s", timeout, lastStatus, url)
}

remaining := time.Until(deadline)
attemptTimeout := 2 * time.Second
if remaining < attemptTimeout {
attemptTimeout = remaining
}
req, reqErr := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
if reqErr != nil {
return reqErr
}
attemptCtx, cancel := context.WithTimeout(ctx, attemptTimeout)
req = req.WithContext(attemptCtx)
resp, err := client.Do(req)
cancel()
if err == nil {
lastStatus = resp.StatusCode
resp.Body.Close()
if lastStatus >= 200 && lastStatus < 400 {
return nil
}
}

sleepInterval := time.Second
if remProbe := time.Until(deadline); remProbe < sleepInterval {
sleepInterval = remProbe
}
if ctxDeadline, ok := ctx.Deadline(); ok {
if remCtx := time.Until(ctxDeadline); remCtx < sleepInterval {
sleepInterval = remCtx
}
}
if sleepInterval <= 0 {
if ctx.Err() != nil {
return ctx.Err()
}
return fmt.Errorf("HTTP probe timeout after %s: last status %d from %s", timeout, lastStatus, url)
}
t := time.NewTimer(sleepInterval)
select {
case <-t.C:
case <-ctx.Done():
t.Stop()
return ctx.Err()
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Refactor httpProbe to pass gocyclo and unblock CI.

Line 76 currently exceeds the configured cyclomatic complexity limit (16 > 15), and CI is failing on this function. Please split retry/deadline/sleep orchestration into helpers (shared with tcpProbe is ideal) so httpProbe stays below the threshold.

🔧 Proposed refactor sketch
+func probeSleepInterval(ctx context.Context, probeDeadline time.Time, base time.Duration) (time.Duration, error) {
+	sleep := base
+	if remProbe := time.Until(probeDeadline); remProbe < sleep {
+		sleep = remProbe
+	}
+	if ctxDeadline, ok := ctx.Deadline(); ok {
+		if remCtx := time.Until(ctxDeadline); remCtx < sleep {
+			sleep = remCtx
+		}
+	}
+	if sleep <= 0 {
+		if err := ctx.Err(); err != nil {
+			return 0, err
+		}
+		return 0, context.DeadlineExceeded
+	}
+	return sleep, nil
+}
+
+func doHTTPProbeAttempt(ctx context.Context, client *http.Client, url string, probeDeadline time.Time) (int, error) {
+	attemptTimeout := 2 * time.Second
+	if rem := time.Until(probeDeadline); rem < attemptTimeout {
+		attemptTimeout = rem
+	}
+	attemptCtx, cancel := context.WithTimeout(ctx, attemptTimeout)
+	defer cancel()
+	req, err := http.NewRequestWithContext(attemptCtx, http.MethodGet, url, nil)
+	if err != nil {
+		return 0, err
+	}
+	resp, err := client.Do(req)
+	if err != nil {
+		return 0, err
+	}
+	defer resp.Body.Close()
+	return resp.StatusCode, nil
+}
+
 func httpProbe(ctx context.Context, url string, timeout time.Duration) error {
@@
-		remaining := time.Until(deadline)
-		attemptTimeout := 2 * time.Second
-		if remaining < attemptTimeout {
-			attemptTimeout = remaining
-		}
-		req, reqErr := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
-		if reqErr != nil {
-			return reqErr
-		}
-		attemptCtx, cancel := context.WithTimeout(ctx, attemptTimeout)
-		req = req.WithContext(attemptCtx)
-		resp, err := client.Do(req)
-		cancel()
-		if err == nil {
-			lastStatus = resp.StatusCode
-			resp.Body.Close()
+		status, err := doHTTPProbeAttempt(ctx, client, url, deadline)
+		if err == nil {
+			lastStatus = status
 			if lastStatus >= 200 && lastStatus < 400 {
 				return nil
 			}
 		}
 
-		sleepInterval := time.Second
-		if remProbe := time.Until(deadline); remProbe < sleepInterval {
-			sleepInterval = remProbe
-		}
-		if ctxDeadline, ok := ctx.Deadline(); ok {
-			if remCtx := time.Until(ctxDeadline); remCtx < sleepInterval {
-				sleepInterval = remCtx
-			}
-		}
-		if sleepInterval <= 0 {
-			if ctx.Err() != nil {
-				return ctx.Err()
-			}
-			return fmt.Errorf("HTTP probe timeout after %s: last status %d from %s", timeout, lastStatus, url)
-		}
+		sleepInterval, sleepErr := probeSleepInterval(ctx, deadline, time.Second)
+		if sleepErr != nil {
+			if sleepErr == context.DeadlineExceeded {
+				return fmt.Errorf("HTTP probe timeout after %s: last status %d from %s", timeout, lastStatus, url)
+			}
+			return sleepErr
+		}
🧰 Tools
🪛 GitHub Actions: CI

[error] 76-76: golangci-lint: cyclomatic complexity 16 of func httpProbe is high (> 15) (gocyclo)

🪛 GitHub Check: Lint

[failure] 76-76:
cyclomatic complexity 16 of func httpProbe is high (> 15) (gocyclo)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/usecase/container/readiness.go` around lines 76 - 138, httpProbe
currently exceeds cyclomatic complexity; extract the retry/deadline/sleep
orchestration into small shared helpers so httpProbe focuses only on request
logic. Create helpers such as probeLoop(ctx, deadline, timeout, work
func(attemptTimeout time.Duration) (success bool, info int, err error)) that
computes attemptTimeout, enforces overall deadline, handles per-attempt context
with cancel, computes sleepInterval (shared with tcpProbe), manages timer/select
and returns final result (including lastStatus/info and context errors). Replace
the loop in httpProbe with a call to probeLoop that performs the HTTP request
(using client.Do, sets lastStatus from resp.StatusCode and closes body) and let
probeLoop handle retries, timeouts, sleeps and propagation of ctx.Err()/timeout
errors so both httpProbe and tcpProbe can reuse the same orchestration helpers.

Comment on lines +429 to 432
ReadinessDelay: time.Millisecond, // Minimal delay for tests
DrainDelay: time.Millisecond,
StabilizationDelay: time.Millisecond,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Prefer testMinDelayConfig() (or set DrainDelayConfigured: true) in these deploy tests.

These blocks set DrainDelay: time.Millisecond but not DrainDelayConfigured. In old-container finalize paths, that can silently use the default drain behavior and add avoidable delay to test runs.

♻️ Suggested change
-	config := Config{
-		ReadinessDelay:     time.Millisecond, // Minimal delay for tests
-		DrainDelay:         time.Millisecond,
-		StabilizationDelay: time.Millisecond,
-	}
+	config := testMinDelayConfig()

Also applies to: 1677-1680

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/usecase/container/service_test.go` around lines 429 - 432, The test
config is setting DrainDelay: time.Millisecond without marking
DrainDelayConfigured, which can trigger default drain behavior and slow tests;
update the test setup that constructs the config (the instances where
ReadinessDelay/DrainDelay/StabilizationDelay are set) to either call
testMinDelayConfig() or explicitly set DrainDelayConfigured: true alongside
DrainDelay so the code uses the provided minimal drain value; apply the same
change to the other similar test config block that sets minimal delays (the
block around the other deploy tests).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant