Skip to content

fix: recovery paths for PodSucceeded and PodFailed GameServer lifecycle#4480

Open
markmandel wants to merge 1 commit intoagones-dev:mainfrom
markmandel:flake/TestGameServerPodCompletedAfterCleanExit
Open

fix: recovery paths for PodSucceeded and PodFailed GameServer lifecycle#4480
markmandel wants to merge 1 commit intoagones-dev:mainfrom
markmandel:flake/TestGameServerPodCompletedAfterCleanExit

Conversation

@markmandel
Copy link
Copy Markdown
Collaborator

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking

/kind bug

/kind cleanup
/kind documentation
/kind feature
/kind hotfix
/kind release

What this PR does / Why we need it:

Both SucceededController and MissingPodController were purely event -driven with no fallback if a pod phase transition event was missed (e.g . during controller restart). This could cause GameServers to get stuck when their pod had already completed or failed and the Pod records weren't being garbage collected.

  • SucceededController: add GameServer informer UpdateFunc that re-checks pod phase on every ~30s resync, enqueuing any non-terminal GameServer whose pod is already in PodSucceeded state (mirrors the recovery pattern used by MissingPodController).

  • MissingPodController: extend the GS resync condition to also enqueue when the pod is in PodFailed state. HealthController already owns the primary (event-driven) path for PodFailed; this adds the recovery path for when that event is missed. Updates syncGameServer to pass through for failed pods and emit a distinct "Pod has failed" log and event.

Which issue(s) this PR fixes:

Fixes 🤞🏻 TestGameServerPodCompletedAfterCleanExit

Special notes for your reviewer:

N/A

Both SucceededController and MissingPodController were purely event
-driven with no fallback if a pod phase transition event was missed (e.g
. during controller restart). This could cause GameServers to get stuck
when their pod had already completed or failed and the Pod records
weren't being garbage collected.

- SucceededController: add GameServer informer UpdateFunc that re-checks
  pod phase on every ~30s resync, enqueuing any non-terminal GameServer
  whose pod is already in PodSucceeded state (mirrors the recovery
  pattern used by MissingPodController).

- MissingPodController: extend the GS resync condition to also enqueue
  when the pod is in PodFailed state. HealthController already owns the
  primary (event-driven) path for PodFailed; this adds the recovery path
  for when that event is missed. Updates syncGameServer to pass through
  for failed pods and emit a distinct "Pod has failed" log and event.

Signed-off-by: Mark Mandel <mark@compoundtheory.com>
@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: e6916a7e-9c75-45a9-92fa-fac87b8434e3

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4480/head:pr_4480 && git checkout pr_4480
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.57.0-dev-67f2f3e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug These are bugs. size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants