Skip to content

fix: registry push resilience during Gordon restart#90

Merged
bnema merged 4 commits intomainfrom
fix/registry-shutdown-resilience
Mar 1, 2026
Merged

fix: registry push resilience during Gordon restart#90
bnema merged 4 commits intomainfrom
fix/registry-shutdown-resilience

Conversation

@bnema
Copy link
Owner

@bnema bnema commented Mar 1, 2026

Problem

When gordon push was running and Gordon restarted due to a config file change, the final manifest push failed with 503 Service Unavailable. All 12 image layers had uploaded successfully, but the manifest — the last step that commits the image to the registry — was lost because the registry backend was shutting down while the proxy frontend was still alive and forwarding traffic.

Two root causes identified from production logs:

1. Wrong shutdown ordergracefulShutdown stopped registrySrv first, then proxySrv/tlsSrv. While the backend was draining, the proxy continued forwarding registry traffic to a dead backend → 503.

2. No in-flight tracking for registry requests — unlike container proxy requests (which have trackInFlight), proxyToRegistry had no drain mechanism. Shutdown had no way to wait for active manifest pushes to complete.

Changes

internal/app/run.go

  • Reversed shutdown order: tlsSrv → proxySrv → registrySrv (frontends before backend)
  • Split the single shutdown loop into 3 explicit phases with a drain step between stopping frontends and stopping the registry backend
  • gracefulShutdown now accepts proxySvc *proxy.Service to call DrainRegistryInFlight
  • Added explanatory comment documenting why order matters

internal/usecase/proxy/service.go

  • Added registryInFlight atomic.Int64 field to Service
  • Instrumented proxyToRegistry with Add(1) / defer Add(-1)
  • Added DrainRegistryInFlight(timeout time.Duration) bool — polls until counter is 0 or timeout, returns true if clean drain
  • Added RegistryInFlight() int64 for observability (used in the timeout warning log)

internal/usecase/proxy/service_test.go

  • TestRegistryInFlightTracking — verifies counter increments/decrements correctly
  • TestDrainRegistryInFlight — verifies drain returns true after requests complete
  • TestDrainRegistryInFlightTimeout — verifies drain returns false on timeout

internal/app/migrate_env_test.go (boyscout)

  • Fixed TestMigrateAttachmentEnvFile: test was calling GetAllAttachment with "gitea-postgres-<ts>" but extractContainerNameFromAttachmentFile preserves the "gordon-" prefix, so the lookup always returned empty. Now derives storedContainerName from the filename to match what production code actually stores.

Result

  • All tests pass (previously TestMigrateAttachmentEnvFile was failing)
  • On restart, Gordon now stops accepting new registry traffic first, waits up to 25s for in-flight pushes to complete, then stops the registry backend — manifest pushes survive a restart

Summary by CodeRabbit

Release Notes

  • New Features
    • Enhanced graceful shutdown mechanism to properly drain in-flight requests before service termination, improving reliability during application restarts and updates.

bnema added 4 commits March 1, 2026 12:03
Prevents 503 on manifest push during Gordon restart. Previously
registrySrv was shut down first, leaving the proxy alive and forwarding
registry traffic to a dying backend. Now TLS and proxy frontends are
stopped first, then the registry backend drains.
Adds registryInFlight atomic.Int64 counter to proxy.Service and
instruments proxyToRegistry to increment/decrement it. Adds
DrainRegistryInFlight method for use during shutdown. This allows
gracefulShutdown to wait for active manifest pushes before stopping
the registry backend.
gracefulShutdown now stops TLS and proxy frontends first, then calls
DrainRegistryInFlight (25s timeout) to wait for active manifest pushes
to complete before stopping the registry backend. Fixes 503 errors on
gordon push during Gordon restarts.
extractContainerNameFromAttachmentFile preserves the 'gordon-' prefix
from the filename, but the test was calling GetAllAttachment with the
bare containerName (without prefix), so the lookup returned empty.
Fix by deriving storedContainerName from the filename, matching what
migrateAttachmentEnvFile actually stores in pass.
@coderabbitai
Copy link

coderabbitai bot commented Mar 1, 2026

📝 Walkthrough

Walkthrough

These changes implement graceful shutdown coordination by introducing in-flight request tracking to the proxy service. The gracefulShutdown function is extended to accept a proxy service and orchestrate a three-phase shutdown sequence: stopping TLS/proxy frontends, draining in-flight registry requests, and shutting down the registry server.

Changes

Cohort / File(s) Summary
Graceful Shutdown Orchestration
internal/app/run.go
Extended gracefulShutdown function signature to accept proxySvc *proxy.Service parameter. Updated runServers to pass svc.proxySvc when calling gracefulShutdown. Shutdown sequence expanded to three phases: Phase 1 stops TLS/proxy frontends, Phase 2 drains in-flight registry requests via proxy service, Phase 3 shuts down registry server.
In-Flight Request Tracking
internal/usecase/proxy/service.go, internal/usecase/proxy/service_test.go
Added registryInFlight atomic counter field to Service struct. Introduced RegistryInFlight() method to read current in-flight count and DrainRegistryInFlight(timeout time.Duration) method to block until requests drain or timeout. Instrumented proxyToRegistry to increment/decrement counter around request handling. Test suite added with TestRegistryInFlightTracking, TestDrainRegistryInFlight, and TestDrainRegistryInFlightTimeout.
Container Name Handling
internal/app/migrate_env_test.go
Updated TestMigrateAttachmentEnvFile to derive storedContainerName from envFileName base name. Cleanup and retrieval operations now use storedContainerName instead of containerName for consistency with storage operations.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant TLS as TLS/Proxy<br/>Server
    participant Proxy as Proxy<br/>Service
    participant Registry as Registry<br/>Server
    
    App->>App: Receive shutdown signal
    
    Note over App,Registry: Phase 1: Stop frontends
    App->>TLS: Shutdown TLS/Proxy servers
    TLS-->>App: Shutdown complete
    
    Note over App,Registry: Phase 2: Drain in-flight
    App->>Proxy: DrainRegistryInFlight(timeout)
    Proxy->>Proxy: Poll registryInFlight counter
    Proxy-->>App: Drain complete or timeout
    
    Note over App,Registry: Phase 3: Shutdown registry
    App->>Registry: Shutdown registry server
    Registry-->>App: Shutdown complete
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A shutdown graceful, oh so fine,
In-flight requests drain line by line,
Three phases choreographed with care,
No request left stranded in the air!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: registry push resilience during Gordon restart' clearly summarizes the main change: improving resilience of registry push operations during application restart by fixing shutdown ordering and in-flight request tracking.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/registry-shutdown-resilience

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/usecase/proxy/service_test.go`:
- Around line 535-562: In TestDrainRegistryInFlight the goroutine leaves the
done channel open on failure causing a long timeout; change it to signal
failures immediately by using a result channel (e.g., result := make(chan bool))
or by calling t.Errorf from inside the goroutine, then always send a
boolean/close the result channel when DrainRegistryInFlight returns; update the
select to check the result and call t.Fatalf/t.Errorf on false so failures in
the goroutine surface immediately — adjust the TestDrainRegistryInFlight
function and the anonymous goroutine that calls svc.DrainRegistryInFlight
(referenced symbols: TestDrainRegistryInFlight, DrainRegistryInFlight, svc,
registryInFlight).

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8ea1b0 and 15ba724.

📒 Files selected for processing (4)
  • internal/app/migrate_env_test.go
  • internal/app/run.go
  • internal/usecase/proxy/service.go
  • internal/usecase/proxy/service_test.go

Comment on lines +535 to +562
func TestDrainRegistryInFlight(t *testing.T) {
svc := &Service{
inFlight: make(map[string]int),
}

svc.registryInFlight.Add(2)

done := make(chan struct{})
go func() {
drained := svc.DrainRegistryInFlight(50 * time.Millisecond)
if !drained {
// Signal failure via done channel by leaving it open — test will time out
return
}
close(done)
}()

time.Sleep(5 * time.Millisecond)
svc.registryInFlight.Add(-1)
svc.registryInFlight.Add(-1)

select {
case <-done:
// good — drained cleanly
case <-time.After(500 * time.Millisecond):
t.Fatal("DrainRegistryInFlight did not return true after requests completed")
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using t.Errorf in goroutine for faster failure feedback.

The current approach leaves the done channel open on failure, causing the test to wait for the full 500ms timeout. Using a result channel or t.Errorf would provide faster feedback on failures.

♻️ Optional: faster failure signaling
 	done := make(chan struct{})
+	result := make(chan bool, 1)
 	go func() {
 		drained := svc.DrainRegistryInFlight(50 * time.Millisecond)
-		if !drained {
-			// Signal failure via done channel by leaving it open — test will time out
-			return
-		}
-		close(done)
+		result <- drained
+		if drained {
+			close(done)
+		}
 	}()
 
 	time.Sleep(5 * time.Millisecond)
 	svc.registryInFlight.Add(-1)
 	svc.registryInFlight.Add(-1)
 
 	select {
 	case <-done:
 		// good — drained cleanly
 	case <-time.After(500 * time.Millisecond):
-		t.Fatal("DrainRegistryInFlight did not return true after requests completed")
+		if r := <-result; !r {
+			t.Fatal("DrainRegistryInFlight returned false unexpectedly")
+		} else {
+			t.Fatal("DrainRegistryInFlight did not return in time")
+		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func TestDrainRegistryInFlight(t *testing.T) {
svc := &Service{
inFlight: make(map[string]int),
}
svc.registryInFlight.Add(2)
done := make(chan struct{})
go func() {
drained := svc.DrainRegistryInFlight(50 * time.Millisecond)
if !drained {
// Signal failure via done channel by leaving it open — test will time out
return
}
close(done)
}()
time.Sleep(5 * time.Millisecond)
svc.registryInFlight.Add(-1)
svc.registryInFlight.Add(-1)
select {
case <-done:
// good — drained cleanly
case <-time.After(500 * time.Millisecond):
t.Fatal("DrainRegistryInFlight did not return true after requests completed")
}
}
func TestDrainRegistryInFlight(t *testing.T) {
svc := &Service{
inFlight: make(map[string]int),
}
svc.registryInFlight.Add(2)
done := make(chan struct{})
result := make(chan bool, 1)
go func() {
drained := svc.DrainRegistryInFlight(50 * time.Millisecond)
result <- drained
if drained {
close(done)
}
}()
time.Sleep(5 * time.Millisecond)
svc.registryInFlight.Add(-1)
svc.registryInFlight.Add(-1)
select {
case <-done:
// good — drained cleanly
case <-time.After(500 * time.Millisecond):
if r := <-result; !r {
t.Fatal("DrainRegistryInFlight returned false unexpectedly")
} else {
t.Fatal("DrainRegistryInFlight did not return in time")
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/usecase/proxy/service_test.go` around lines 535 - 562, In
TestDrainRegistryInFlight the goroutine leaves the done channel open on failure
causing a long timeout; change it to signal failures immediately by using a
result channel (e.g., result := make(chan bool)) or by calling t.Errorf from
inside the goroutine, then always send a boolean/close the result channel when
DrainRegistryInFlight returns; update the select to check the result and call
t.Fatalf/t.Errorf on false so failures in the goroutine surface immediately —
adjust the TestDrainRegistryInFlight function and the anonymous goroutine that
calls svc.DrainRegistryInFlight (referenced symbols: TestDrainRegistryInFlight,
DrainRegistryInFlight, svc, registryInFlight).

@bnema bnema closed this Mar 1, 2026
@bnema bnema reopened this Mar 1, 2026
@bnema bnema merged commit a363ed1 into main Mar 1, 2026
5 checks passed
@bnema bnema deleted the fix/registry-shutdown-resilience branch March 1, 2026 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant