fix: registry push resilience during Gordon restart by bnema · Pull Request #90 · bnema/gordon

bnema · 2026-03-01T11:31:27Z

Problem

When gordon push was running and Gordon restarted due to a config file change, the final manifest push failed with 503 Service Unavailable. All 12 image layers had uploaded successfully, but the manifest — the last step that commits the image to the registry — was lost because the registry backend was shutting down while the proxy frontend was still alive and forwarding traffic.

Two root causes identified from production logs:

1. Wrong shutdown order — gracefulShutdown stopped registrySrv first, then proxySrv/tlsSrv. While the backend was draining, the proxy continued forwarding registry traffic to a dead backend → 503.

2. No in-flight tracking for registry requests — unlike container proxy requests (which have trackInFlight), proxyToRegistry had no drain mechanism. Shutdown had no way to wait for active manifest pushes to complete.

Changes

`internal/app/run.go`

Reversed shutdown order: tlsSrv → proxySrv → registrySrv (frontends before backend)
Split the single shutdown loop into 3 explicit phases with a drain step between stopping frontends and stopping the registry backend
gracefulShutdown now accepts proxySvc *proxy.Service to call DrainRegistryInFlight
Added explanatory comment documenting why order matters

`internal/usecase/proxy/service.go`

Added registryInFlight atomic.Int64 field to Service
Instrumented proxyToRegistry with Add(1) / defer Add(-1)
Added DrainRegistryInFlight(timeout time.Duration) bool — polls until counter is 0 or timeout, returns true if clean drain
Added RegistryInFlight() int64 for observability (used in the timeout warning log)

`internal/usecase/proxy/service_test.go`

TestRegistryInFlightTracking — verifies counter increments/decrements correctly
TestDrainRegistryInFlight — verifies drain returns true after requests complete
TestDrainRegistryInFlightTimeout — verifies drain returns false on timeout

`internal/app/migrate_env_test.go` (boyscout)

Fixed TestMigrateAttachmentEnvFile: test was calling GetAllAttachment with "gitea-postgres-<ts>" but extractContainerNameFromAttachmentFile preserves the "gordon-" prefix, so the lookup always returned empty. Now derives storedContainerName from the filename to match what production code actually stores.

Result

All tests pass (previously TestMigrateAttachmentEnvFile was failing)
On restart, Gordon now stops accepting new registry traffic first, waits up to 25s for in-flight pushes to complete, then stops the registry backend — manifest pushes survive a restart

Summary by CodeRabbit

Release Notes

New Features
- Enhanced graceful shutdown mechanism to properly drain in-flight requests before service termination, improving reliability during application restarts and updates.

Prevents 503 on manifest push during Gordon restart. Previously registrySrv was shut down first, leaving the proxy alive and forwarding registry traffic to a dying backend. Now TLS and proxy frontends are stopped first, then the registry backend drains.

Adds registryInFlight atomic.Int64 counter to proxy.Service and instruments proxyToRegistry to increment/decrement it. Adds DrainRegistryInFlight method for use during shutdown. This allows gracefulShutdown to wait for active manifest pushes before stopping the registry backend.

gracefulShutdown now stops TLS and proxy frontends first, then calls DrainRegistryInFlight (25s timeout) to wait for active manifest pushes to complete before stopping the registry backend. Fixes 503 errors on gordon push during Gordon restarts.

extractContainerNameFromAttachmentFile preserves the 'gordon-' prefix from the filename, but the test was calling GetAllAttachment with the bare containerName (without prefix), so the lookup returned empty. Fix by deriving storedContainerName from the filename, matching what migrateAttachmentEnvFile actually stores in pass.

coderabbitai · 2026-03-01T11:31:41Z

📝 Walkthrough

Walkthrough

These changes implement graceful shutdown coordination by introducing in-flight request tracking to the proxy service. The gracefulShutdown function is extended to accept a proxy service and orchestrate a three-phase shutdown sequence: stopping TLS/proxy frontends, draining in-flight registry requests, and shutting down the registry server.

Changes

Cohort / File(s)	Summary
Graceful Shutdown Orchestration `internal/app/run.go`	Extended `gracefulShutdown` function signature to accept `proxySvc *proxy.Service` parameter. Updated `runServers` to pass `svc.proxySvc` when calling `gracefulShutdown`. Shutdown sequence expanded to three phases: Phase 1 stops TLS/proxy frontends, Phase 2 drains in-flight registry requests via proxy service, Phase 3 shuts down registry server.
In-Flight Request Tracking `internal/usecase/proxy/service.go`, `internal/usecase/proxy/service_test.go`	Added `registryInFlight` atomic counter field to Service struct. Introduced `RegistryInFlight()` method to read current in-flight count and `DrainRegistryInFlight(timeout time.Duration)` method to block until requests drain or timeout. Instrumented `proxyToRegistry` to increment/decrement counter around request handling. Test suite added with `TestRegistryInFlightTracking`, `TestDrainRegistryInFlight`, and `TestDrainRegistryInFlightTimeout`.
Container Name Handling `internal/app/migrate_env_test.go`	Updated `TestMigrateAttachmentEnvFile` to derive `storedContainerName` from `envFileName` base name. Cleanup and retrieval operations now use `storedContainerName` instead of `containerName` for consistency with storage operations.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant TLS as TLS/Proxy<br/>Server
    participant Proxy as Proxy<br/>Service
    participant Registry as Registry<br/>Server
    
    App->>App: Receive shutdown signal
    
    Note over App,Registry: Phase 1: Stop frontends
    App->>TLS: Shutdown TLS/Proxy servers
    TLS-->>App: Shutdown complete
    
    Note over App,Registry: Phase 2: Drain in-flight
    App->>Proxy: DrainRegistryInFlight(timeout)
    Proxy->>Proxy: Poll registryInFlight counter
    Proxy-->>App: Drain complete or timeout
    
    Note over App,Registry: Phase 3: Shutdown registry
    App->>Registry: Shutdown registry server
    Registry-->>App: Shutdown complete

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A shutdown graceful, oh so fine,
In-flight requests drain line by line,
Three phases choreographed with care,
No request left stranded in the air!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: registry push resilience during Gordon restart' clearly summarizes the main change: improving resilience of registry push operations during application restart by fixing shutdown ordering and in-flight request tracking.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/registry-shutdown-resilience

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/usecase/proxy/service_test.go`:
- Around line 535-562: In TestDrainRegistryInFlight the goroutine leaves the
done channel open on failure causing a long timeout; change it to signal
failures immediately by using a result channel (e.g., result := make(chan bool))
or by calling t.Errorf from inside the goroutine, then always send a
boolean/close the result channel when DrainRegistryInFlight returns; update the
select to check the result and call t.Fatalf/t.Errorf on false so failures in
the goroutine surface immediately — adjust the TestDrainRegistryInFlight
function and the anonymous goroutine that calls svc.DrainRegistryInFlight
(referenced symbols: TestDrainRegistryInFlight, DrainRegistryInFlight, svc,
registryInFlight).

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8ea1b0 and 15ba724.

📒 Files selected for processing (4)

internal/app/migrate_env_test.go
internal/app/run.go
internal/usecase/proxy/service.go
internal/usecase/proxy/service_test.go

coderabbitai · 2026-03-01T11:35:08Z

internal/usecase/proxy/service_test.go

+func TestDrainRegistryInFlight(t *testing.T) {
+	svc := &Service{
+		inFlight: make(map[string]int),
+	}
+
+	svc.registryInFlight.Add(2)
+
+	done := make(chan struct{})
+	go func() {
+		drained := svc.DrainRegistryInFlight(50 * time.Millisecond)
+		if !drained {
+			// Signal failure via done channel by leaving it open — test will time out
+			return
+		}
+		close(done)
+	}()
+
+	time.Sleep(5 * time.Millisecond)
+	svc.registryInFlight.Add(-1)
+	svc.registryInFlight.Add(-1)
+
+	select {
+	case <-done:
+		// good — drained cleanly
+	case <-time.After(500 * time.Millisecond):
+		t.Fatal("DrainRegistryInFlight did not return true after requests completed")
+	}
+}


🧹 Nitpick | 🔵 Trivial

Consider using t.Errorf in goroutine for faster failure feedback.

The current approach leaves the done channel open on failure, causing the test to wait for the full 500ms timeout. Using a result channel or t.Errorf would provide faster feedback on failures.

♻️ Optional: faster failure signaling

done := make(chan struct{}) + result := make(chan bool, 1) go func() { drained := svc.DrainRegistryInFlight(50 * time.Millisecond) - if !drained { - // Signal failure via done channel by leaving it open — test will time out - return - } - close(done) + result <- drained + if drained { + close(done) + } }() time.Sleep(5 * time.Millisecond) svc.registryInFlight.Add(-1) svc.registryInFlight.Add(-1) select { case <-done: // good — drained cleanly case <-time.After(500 * time.Millisecond): - t.Fatal("DrainRegistryInFlight did not return true after requests completed") + if r := <-result; !r { + t.Fatal("DrainRegistryInFlight returned false unexpectedly") + } else { + t.Fatal("DrainRegistryInFlight did not return in time") + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func TestDrainRegistryInFlight(t *testing.T) {

svc := &Service{

inFlight: make(map[string]int),

}

svc.registryInFlight.Add(2)

done := make(chan struct{})

go func() {

drained := svc.DrainRegistryInFlight(50 * time.Millisecond)

if !drained {

// Signal failure via done channel by leaving it open — test will time out

return

}

close(done)

}()

time.Sleep(5 * time.Millisecond)

svc.registryInFlight.Add(-1)

svc.registryInFlight.Add(-1)

select {

case <-done:

// good — drained cleanly

case <-time.After(500 * time.Millisecond):

t.Fatal("DrainRegistryInFlight did not return true after requests completed")

}

}

func TestDrainRegistryInFlight(t *testing.T) {

svc := &Service{

inFlight: make(map[string]int),

}

svc.registryInFlight.Add(2)

done := make(chan struct{})

result := make(chan bool, 1)

go func() {

drained := svc.DrainRegistryInFlight(50 * time.Millisecond)

result <- drained

if drained {

close(done)

}

}()

time.Sleep(5 * time.Millisecond)

svc.registryInFlight.Add(-1)

svc.registryInFlight.Add(-1)

select {

case <-done:

// good — drained cleanly

case <-time.After(500 * time.Millisecond):

if r := <-result; !r {

t.Fatal("DrainRegistryInFlight returned false unexpectedly")

} else {

t.Fatal("DrainRegistryInFlight did not return in time")

}

}

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/usecase/proxy/service_test.go` around lines 535 - 562, In TestDrainRegistryInFlight the goroutine leaves the done channel open on failure causing a long timeout; change it to signal failures immediately by using a result channel (e.g., result := make(chan bool)) or by calling t.Errorf from inside the goroutine, then always send a boolean/close the result channel when DrainRegistryInFlight returns; update the select to check the result and call t.Fatalf/t.Errorf on false so failures in the goroutine surface immediately — adjust the TestDrainRegistryInFlight function and the anonymous goroutine that calls svc.DrainRegistryInFlight (referenced symbols: TestDrainRegistryInFlight, DrainRegistryInFlight, svc, registryInFlight).

bnema added 4 commits March 1, 2026 12:03

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

bnema closed this Mar 1, 2026

bnema reopened this Mar 1, 2026

bnema merged commit a363ed1 into main Mar 1, 2026
5 checks passed

bnema deleted the fix/registry-shutdown-resilience branch March 1, 2026 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: registry push resilience during Gordon restart#90

fix: registry push resilience during Gordon restart#90
bnema merged 4 commits intomainfrom
fix/registry-shutdown-resilience

bnema commented Mar 1, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bnema commented Mar 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

internal/app/run.go

internal/usecase/proxy/service.go

internal/usecase/proxy/service_test.go

internal/app/migrate_env_test.go (boyscout)

Result

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bnema commented Mar 1, 2026 •

edited by coderabbitai bot

Loading

`internal/app/run.go`

`internal/usecase/proxy/service.go`

`internal/usecase/proxy/service_test.go`

`internal/app/migrate_env_test.go` (boyscout)

coderabbitai bot commented Mar 1, 2026 •

edited

Loading