fix: registry push resilience during Gordon restart#90
Conversation
Prevents 503 on manifest push during Gordon restart. Previously registrySrv was shut down first, leaving the proxy alive and forwarding registry traffic to a dying backend. Now TLS and proxy frontends are stopped first, then the registry backend drains.
Adds registryInFlight atomic.Int64 counter to proxy.Service and instruments proxyToRegistry to increment/decrement it. Adds DrainRegistryInFlight method for use during shutdown. This allows gracefulShutdown to wait for active manifest pushes before stopping the registry backend.
gracefulShutdown now stops TLS and proxy frontends first, then calls DrainRegistryInFlight (25s timeout) to wait for active manifest pushes to complete before stopping the registry backend. Fixes 503 errors on gordon push during Gordon restarts.
extractContainerNameFromAttachmentFile preserves the 'gordon-' prefix from the filename, but the test was calling GetAllAttachment with the bare containerName (without prefix), so the lookup returned empty. Fix by deriving storedContainerName from the filename, matching what migrateAttachmentEnvFile actually stores in pass.
📝 WalkthroughWalkthroughThese changes implement graceful shutdown coordination by introducing in-flight request tracking to the proxy service. The Changes
Sequence DiagramsequenceDiagram
participant App as Application
participant TLS as TLS/Proxy<br/>Server
participant Proxy as Proxy<br/>Service
participant Registry as Registry<br/>Server
App->>App: Receive shutdown signal
Note over App,Registry: Phase 1: Stop frontends
App->>TLS: Shutdown TLS/Proxy servers
TLS-->>App: Shutdown complete
Note over App,Registry: Phase 2: Drain in-flight
App->>Proxy: DrainRegistryInFlight(timeout)
Proxy->>Proxy: Poll registryInFlight counter
Proxy-->>App: Drain complete or timeout
Note over App,Registry: Phase 3: Shutdown registry
App->>Registry: Shutdown registry server
Registry-->>App: Shutdown complete
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/usecase/proxy/service_test.go`:
- Around line 535-562: In TestDrainRegistryInFlight the goroutine leaves the
done channel open on failure causing a long timeout; change it to signal
failures immediately by using a result channel (e.g., result := make(chan bool))
or by calling t.Errorf from inside the goroutine, then always send a
boolean/close the result channel when DrainRegistryInFlight returns; update the
select to check the result and call t.Fatalf/t.Errorf on false so failures in
the goroutine surface immediately — adjust the TestDrainRegistryInFlight
function and the anonymous goroutine that calls svc.DrainRegistryInFlight
(referenced symbols: TestDrainRegistryInFlight, DrainRegistryInFlight, svc,
registryInFlight).
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
internal/app/migrate_env_test.gointernal/app/run.gointernal/usecase/proxy/service.gointernal/usecase/proxy/service_test.go
| func TestDrainRegistryInFlight(t *testing.T) { | ||
| svc := &Service{ | ||
| inFlight: make(map[string]int), | ||
| } | ||
|
|
||
| svc.registryInFlight.Add(2) | ||
|
|
||
| done := make(chan struct{}) | ||
| go func() { | ||
| drained := svc.DrainRegistryInFlight(50 * time.Millisecond) | ||
| if !drained { | ||
| // Signal failure via done channel by leaving it open — test will time out | ||
| return | ||
| } | ||
| close(done) | ||
| }() | ||
|
|
||
| time.Sleep(5 * time.Millisecond) | ||
| svc.registryInFlight.Add(-1) | ||
| svc.registryInFlight.Add(-1) | ||
|
|
||
| select { | ||
| case <-done: | ||
| // good — drained cleanly | ||
| case <-time.After(500 * time.Millisecond): | ||
| t.Fatal("DrainRegistryInFlight did not return true after requests completed") | ||
| } | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider using t.Errorf in goroutine for faster failure feedback.
The current approach leaves the done channel open on failure, causing the test to wait for the full 500ms timeout. Using a result channel or t.Errorf would provide faster feedback on failures.
♻️ Optional: faster failure signaling
done := make(chan struct{})
+ result := make(chan bool, 1)
go func() {
drained := svc.DrainRegistryInFlight(50 * time.Millisecond)
- if !drained {
- // Signal failure via done channel by leaving it open — test will time out
- return
- }
- close(done)
+ result <- drained
+ if drained {
+ close(done)
+ }
}()
time.Sleep(5 * time.Millisecond)
svc.registryInFlight.Add(-1)
svc.registryInFlight.Add(-1)
select {
case <-done:
// good — drained cleanly
case <-time.After(500 * time.Millisecond):
- t.Fatal("DrainRegistryInFlight did not return true after requests completed")
+ if r := <-result; !r {
+ t.Fatal("DrainRegistryInFlight returned false unexpectedly")
+ } else {
+ t.Fatal("DrainRegistryInFlight did not return in time")
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func TestDrainRegistryInFlight(t *testing.T) { | |
| svc := &Service{ | |
| inFlight: make(map[string]int), | |
| } | |
| svc.registryInFlight.Add(2) | |
| done := make(chan struct{}) | |
| go func() { | |
| drained := svc.DrainRegistryInFlight(50 * time.Millisecond) | |
| if !drained { | |
| // Signal failure via done channel by leaving it open — test will time out | |
| return | |
| } | |
| close(done) | |
| }() | |
| time.Sleep(5 * time.Millisecond) | |
| svc.registryInFlight.Add(-1) | |
| svc.registryInFlight.Add(-1) | |
| select { | |
| case <-done: | |
| // good — drained cleanly | |
| case <-time.After(500 * time.Millisecond): | |
| t.Fatal("DrainRegistryInFlight did not return true after requests completed") | |
| } | |
| } | |
| func TestDrainRegistryInFlight(t *testing.T) { | |
| svc := &Service{ | |
| inFlight: make(map[string]int), | |
| } | |
| svc.registryInFlight.Add(2) | |
| done := make(chan struct{}) | |
| result := make(chan bool, 1) | |
| go func() { | |
| drained := svc.DrainRegistryInFlight(50 * time.Millisecond) | |
| result <- drained | |
| if drained { | |
| close(done) | |
| } | |
| }() | |
| time.Sleep(5 * time.Millisecond) | |
| svc.registryInFlight.Add(-1) | |
| svc.registryInFlight.Add(-1) | |
| select { | |
| case <-done: | |
| // good — drained cleanly | |
| case <-time.After(500 * time.Millisecond): | |
| if r := <-result; !r { | |
| t.Fatal("DrainRegistryInFlight returned false unexpectedly") | |
| } else { | |
| t.Fatal("DrainRegistryInFlight did not return in time") | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/usecase/proxy/service_test.go` around lines 535 - 562, In
TestDrainRegistryInFlight the goroutine leaves the done channel open on failure
causing a long timeout; change it to signal failures immediately by using a
result channel (e.g., result := make(chan bool)) or by calling t.Errorf from
inside the goroutine, then always send a boolean/close the result channel when
DrainRegistryInFlight returns; update the select to check the result and call
t.Fatalf/t.Errorf on false so failures in the goroutine surface immediately —
adjust the TestDrainRegistryInFlight function and the anonymous goroutine that
calls svc.DrainRegistryInFlight (referenced symbols: TestDrainRegistryInFlight,
DrainRegistryInFlight, svc, registryInFlight).
Problem
When
gordon pushwas running and Gordon restarted due to a config file change, the final manifest push failed with503 Service Unavailable. All 12 image layers had uploaded successfully, but the manifest — the last step that commits the image to the registry — was lost because the registry backend was shutting down while the proxy frontend was still alive and forwarding traffic.Two root causes identified from production logs:
1. Wrong shutdown order —
gracefulShutdownstoppedregistrySrvfirst, thenproxySrv/tlsSrv. While the backend was draining, the proxy continued forwarding registry traffic to a dead backend → 503.2. No in-flight tracking for registry requests — unlike container proxy requests (which have
trackInFlight),proxyToRegistryhad no drain mechanism. Shutdown had no way to wait for active manifest pushes to complete.Changes
internal/app/run.gotlsSrv → proxySrv → registrySrv(frontends before backend)gracefulShutdownnow acceptsproxySvc *proxy.Serviceto callDrainRegistryInFlightinternal/usecase/proxy/service.goregistryInFlight atomic.Int64field toServiceproxyToRegistrywithAdd(1)/defer Add(-1)DrainRegistryInFlight(timeout time.Duration) bool— polls until counter is 0 or timeout, returnstrueif clean drainRegistryInFlight() int64for observability (used in the timeout warning log)internal/usecase/proxy/service_test.goTestRegistryInFlightTracking— verifies counter increments/decrements correctlyTestDrainRegistryInFlight— verifies drain returnstrueafter requests completeTestDrainRegistryInFlightTimeout— verifies drain returnsfalseon timeoutinternal/app/migrate_env_test.go(boyscout)TestMigrateAttachmentEnvFile: test was callingGetAllAttachmentwith"gitea-postgres-<ts>"butextractContainerNameFromAttachmentFilepreserves the"gordon-"prefix, so the lookup always returned empty. Now derivesstoredContainerNamefrom the filename to match what production code actually stores.Result
TestMigrateAttachmentEnvFilewas failing)Summary by CodeRabbit
Release Notes