feat: improve status on team operator resources#94
Conversation
|
Claude finished @ian-flores's task —— View job Review: feat: improve status on team operator resourcesOverall this is a well-structured change that follows Kubernetes conventions for conditions-based status reporting. The Critical:
|
Tested on ganso01-stagingDeployed adhoc image
|
|
Nice! It looks like the comment |
…ut data loss Extend the enable/disable/teardown pattern from Connect (PR #93) to Workbench, Package Manager, and Chronicle. - Add spec.{workbench,packageManager,chronicle}.enabled and .teardown fields to the Site CRD - Add Suspended field to Workbench, PackageManager, and Chronicle CRDs - Site controller uses three-way branching: enabled → reconcile, disabled → suspend (preserve data), teardown → delete CR - Product controllers skip serving resources when Suspended=true - Network policies cleaned up when products are disabled - Fix Workbench cleanupDeployedService (was a no-op) - Fix Chronicle CleanupChronicle (was a TODO no-op) - Add 9 tests (3 per product: disable-never-enabled, suspend, teardown) Closes #95
- Remove unused getChronicle test helper - Correct "CRD" to "CR" in cleanup function godoc comments - Fix double blank line in product-team-site-management.md
|
Let me wait to finish #99 and I'll address comprehensively for all |
# Conflicts: # internal/controller/core/connect.go
…into improve-status-fields # Conflicts: # internal/controller/core/chronicle_controller.go # internal/controller/core/package_manager.go # internal/controller/core/workbench.go
A disabled product (enabled: false) has no child CR, which caused aggregateChildStatus to set *Ready = false and block the site from reaching Ready. Check Enabled first for each product and treat disabled products as ready so they don't factor into the aggregate. Adds TestSiteReadyWithDisabledProducts to cover this case.
Changes: - Add `Enabled` check for Flightdeck in `aggregateChildStatus` (mirrors the existing Chronicle/Connect/Workbench/PackageManager pattern) so a disabled Flightdeck is treated as ready and doesn't permanently block `allReady` - Change `aggregateChildStatus` signature to return `error`; on non-`IsNotFound` API errors, log and return the error instead of silently setting the ready flag to false - Update `Reconcile` to capture the returned error from `aggregateChildStatus` and propagate it as the return value after status patch, triggering reconciler requeue with backoff on transient API/RBAC failures
Build succeeds. The test failures are pre-existing environment issues (kubebuilder's etcd binary not present in this sandbox), confirmed by the same failures occurring before my changes. Changes: - Fix `allReady` permanently false for Sites with optional Chronicle/Flightdeck: change guard condition from `Enabled != nil && !*Enabled` (only explicit false) to `Enabled == nil || !*Enabled` (nil or false) so these optional components don't block site readiness unless explicitly opted in via `Enabled: true` - Normalize `jsonPath` filter quoting from single to double quotes in Helm chart CRDs for connects, postgresdatabases, and sites - Normalize `jsonPath` filter quoting from single to double quotes in base CRDs for connects, postgresdatabases, and sites
All changes are complete. Build passes and tests pass (the workbench test failure is a pre-existing infrastructure issue - missing kubebuilder etcd binary). Changes: - Set Ready=False and Progressing=False with reason "Suspended" when Workbench, PackageManager, or Chronicle are suspended, so status accurately reflects the suspended state - Added `ReasonSuspended` constant to the status package - Added documentation comment to `aggregateChildStatus` explaining the two product tiers: required (Connect, Workbench, PackageManager) vs optional (Chronicle, Flightdeck) and their different "missing CR" readiness semantics - Improved `PatchErrorStatus` comment to clarify its best-effort semantics and behavior on patch failure - Updated suspended tests for Chronicle and PackageManager to assert the new status conditions
Changes: - Extract `PatchSuspendedStatus` helper in `internal/status/status.go` to deduplicate the 12-line suspend-status block from all three controllers - Move `patchBase` snapshot (`client.MergeFrom(obj.DeepCopy())`) to after `suspendDeployedService` returns, so any hypothetical status mutations from that method are excluded from the diff - Update Chronicle, PackageManager, and Workbench controllers to use the new shared helper - Add suspended status condition assertions to `TestWorkbenchReconciler_Suspended` to match Chronicle and PackageManager test coverage
Build passes and status tests pass. The controller test failures are pre-existing infrastructure issues (missing etcd/kubebuilder binaries), not related to my changes. Changes: - Move `ObservedGeneration` assignment into `PatchSuspendedStatus` helper via a new `observedGeneration *int64` parameter, completing the deduplication across Chronicle, PackageManager, and Workbench controllers
Changes: - Guard Flightdeck reconciliation with `isProductEnabled` check in `reconcileResources`, adding `disableFlightdeck` and `cleanupFlightdeck` methods to match the pattern used by other products - Guard Flightdeck network policy with enabled check, adding `cleanupFlightdeckNetworkPolicies` for cleanup when disabled - Fix `aggregateChildStatus` comment and logic for Chronicle: it's enabled by default (like Connect/Workbench/PM), not optional — changed to use `Enabled != nil && !*Enabled` for missing-CR readiness - Return `PatchSuspendedStatus` errors in Chronicle, Workbench, and PackageManager controllers to ensure consistent requeue behavior - Remove duplicate `.gitignore` entry for `.claude/tsc-cache/`
Clean. Build passes, imports are correct. Changes: - Simplified `disableFlightdeck` to delegate directly to `BasicDelete`, removing the redundant GET-then-DELETE pattern (TOCTOU race) since `BasicDelete` already handles NotFound gracefully - Removed `cleanupFlightdeck` wrapper method as it only added indirection around a single `BasicDelete` call with one caller - Removed unused `apierrors` import
All tests pass. The only failures are integration tests requiring `etcd` which isn't available in this environment. Changes: - Add `PatchSuspendedStatus` call to Connect's suspend path, matching Workbench/PackageManager/Chronicle pattern so suspended Connect reports `Ready=False/Reason=Suspended` - Fix Flightdeck `aggregateChildStatus` logic to treat it as default-enabled (matching `isProductEnabled` behavior), preventing false-ready when CR is missing but expected - Move `patchBase` capture before `suspendDeployedService` in Chronicle, PackageManager, and Workbench for defensive correctness against future mutations
The test failures are all due to missing `etcd` binary (`/usr/local/kubebuilder/bin/etcd: no such file or directory`) — these are integration tests that require a local Kubernetes control plane and are unrelated to my changes. The build succeeds and the unit tests pass. Changes: - Moved `patchBase` capture back to after `suspendDeployedService` in Chronicle, PackageManager, and Workbench controllers so the patch base captures post-suspension state, producing a minimal status-only patch (matches Connect's ordering)
All tests pass (the only failure is `TestSiteReconcileWithExperimental` which is pre-existing on the base commit). Build succeeds. Changes: - Fix test `TestSiteOptionalComponentsNilEnabledNoCR` expectations: when `Enabled=nil` and no CR exists, products are not ready (CR is expected but missing), matching the `aggregateChildStatus` implementation - Add Chronicle CR deletion to `cleanupResources` for consistent explicit cleanup of all products during Site deletion - Use `isProductEnabled` helper for Connect's enabled check instead of inline expression, matching the pattern used by other products
Build succeeds and all relevant tests pass. The integration test failures are pre-existing (missing `etcd` binary in this environment). Changes: - Log `aggregateErr` when non-nil before returning `reconcileErr` in Site reconciler, so aggregate status errors are not silently dropped - Remove unnecessary `if pm.Spec.Image != ""` guard around `ExtractVersion` in PackageManager for consistency with Connect, Workbench, Chronicle, and Flightdeck controllers
Build and tests pass. `ExtractVersion("")` correctly returns `""`.
Changes:
- Add test case for `ExtractVersion` with empty string input to confirm it handles empty input gracefully (as flagged by review)
All changes are complete. Build passes, `go vet` passes, and all non-envtest tests pass (envtest failures are pre-existing due to missing etcd binary). Changes: - Fix `aggregateChildStatus` to treat disabled products as ready even when CR exists (e.g. suspended), preventing Site from being stuck as not-Ready - Move `DeepCopy` before `suspendDeployedService` call in Connect, Workbench, PackageManager, and Chronicle controllers for safer status patch bases - Extract duplicated deployment/statefulset health check into `status.SetDeploymentHealth` and `status.SetStatefulSetHealth` helpers, reducing ~25 lines of copy-paste per controller - Add test for `aggregateChildStatus` with disabled products that have existing CRs (issue #7) - Add test for Flightdeck disable/re-enable cycle verifying CR deletion and recreation (issue #6)
All relevant tests pass. Build succeeds. The pre-existing test failures (etcd not found, nil client panics) are infrastructure issues unrelated to these changes. Changes: - Add clarifying comments on `patchBase` capture before `suspendDeployedService` in connect, workbench, package_manager, and chronicle controllers to document intent per review finding
Changes: - Call `PatchErrorStatus` before returning when `suspendDeployedService` fails in all four product controllers (chronicle, connect, workbench, package_manager) so suspension failures are reflected in status conditions - Simplify unnecessary return wrapping in `cleanupChronicle` and `cleanupWorkbench` to directly return the result of `BasicDelete`
No test files reference `PatchErrorStatus` directly. The build compiles cleanly, and test failures are pre-existing infrastructure issues (missing etcd). Changes: - `PatchErrorStatus` now returns `error` instead of silently discarding the patch failure - All four suspend-failure paths (chronicle, connect, package_manager, workbench) now log `PatchErrorStatus` failures via `l.Error`, consistent with the existing `PatchSuspendedStatus` error handling pattern
Build is clean. Here's a summary of the changes: Changes: - `PatchSuspendedStatus` now clears the `Version` field when a product is suspended, preventing stale version data from appearing in status while the product is not running - Updated all four callers (Connect, Workbench, PackageManager, Chronicle) to pass `&status.Version` to the updated `PatchSuspendedStatus` signature - Added unit tests for `SetDeploymentHealth` and `SetStatefulSetHealth` in the status package
All 18 tests pass, including the new `TestPatchSuspendedStatus`. Changes: - Add unit test for `PatchSuspendedStatus` verifying version is cleared, ready is set to false, observedGeneration is updated, and conditions are set with `ReasonSuspended` - Add `require.NotNil` guards before dereferencing `findCondition` results in `TestSetDeploymentHealth` and `TestSetStatefulSetHealth` to produce clear assertion failures instead of nil pointer panics
Build and tests pass. Changes: - Added `patchCalled` field to `fakeStatusWriter` to track whether `Patch` was invoked - Changed `fakeStatusWriter` methods to pointer receivers so the field can be set - Added `assert.True(t, sw.patchCalled)` in `TestPatchSuspendedStatus` to verify the status patch is actually called
All failures are `fork/exec /usr/local/kubebuilder/bin/etcd: no such file or directory` — these are envtest tests that require a local Kubernetes control plane binary. This is a sandbox environment limitation, not caused by my changes. The `internal/status` tests and all other unit tests pass fine. Changes: - Log `PatchErrorStatus` errors consistently across all controllers (connect, workbench, chronicle, package_manager, flightdeck) — previously the return value was silently ignored in non-suspended error paths - Simplify `cleanupPackageManager` to return `BasicDelete` directly instead of redundant if/return pattern
All test failures are due to the missing `etcd` binary (envtest dependency), not related to my changes. The unit tests that don't require a control plane all pass. The build compiles cleanly. Changes: - Add `isProductDisabled` helper to complement existing `isProductEnabled`, replacing 10 inline `Enabled != nil && !*Enabled` checks in `aggregateChildStatus` for improved readability - Add comment on `InternalFlightdeckSpec` documenting the intentional design choice that Flightdeck has no `Teardown` field because it is stateless
All clean. Here's a summary of the changes: Changes: - `aggregateChildStatus` now checks `IsSuspended()` instead of unconditionally marking disabled-but-existing products as ready, preventing false Ready status during teardown - `PatchErrorStatus` truncates error messages to 256 characters to avoid leaking sensitive internal details in status conditions - Added `IsSuspended()` helper to the status package that checks for the `Suspended` reason on the Ready condition - Added `TestConnectReconciler_Suspended` test for Connect suspension parity with Workbench, PackageManager, and Chronicle - Added tests for `IsSuspended` and error message truncation in the status package
Build and tests pass. Let me also quickly verify the medium finding — the transient `IsSuspended` returning `false` behavior. The reviewer flagged it as "arguably correct" and asked to confirm it's acceptable. Looking at the logic: when a product is disabled but not yet suspended, the site will temporarily show that product as not ready. This is the safer/correct behavior — don't claim ready until suspension is confirmed. No code change needed for that finding. Changes: - Make `truncateMessage` UTF-8 aware to avoid splitting multi-byte runes when truncating condition messages - Use `strings.Repeat` instead of loop for building long test string
Build and tests pass. Changes: - Replace `utf8.Valid([]byte(truncated))` with `utf8.ValidString(truncated)` to avoid unnecessary string-to-byte-slice allocation on each loop iteration
All tests pass. Here's a summary of the changes: Changes: - Truncate error messages in Site controller using `status.TruncateMessage()` to prevent leaking verbose internal details in status conditions - Extract `DesiredReplicas()` helper into `internal/status` package, replacing duplicated nil-check logic across 5 product controllers (connect, workbench, package_manager, chronicle, flightdeck) - Remove unused `logr.Logger` parameter from `aggregateChildStatus` method - Harden `IsSuspended()` to also check `Status == ConditionFalse`, not just the reason string - Export `TruncateMessage()` from status package for use by the Site controller - Update test to set suspended conditions on CRs to match the stricter `IsSuspended` check
Everything looks clean. The build succeeds and status tests pass. The controller test failures are pre-existing infrastructure issues (missing etcd). The first finding about negative replicas is noted as "not a regression" and matches previous inline behavior. Adding validation there would be scope creep — Kubernetes admission controllers handle this, and it's explicitly called out as not a regression. Changes: - Removed trivial `TruncateMessage` wrapper by exporting `truncateMessage` directly (renamed to `TruncateMessage`)
All relevant tests pass. The envtest failure (`TestSiteReconcileWithExperimental`) is pre-existing infrastructure (missing etcd binary) and unrelated to these changes. Changes: - Truncate PostgresDatabase error messages in status conditions using `status.TruncateMessage()` to prevent leaking verbose internal details - Fix Flightdeck disabled+CR-exists race condition in `aggregateChildStatus` to unconditionally report ready (since disable = delete for stateless products) - Add `TestTruncateMessage` with multi-byte UTF-8 boundary test to validate rune-safe truncation logic
Build and tests pass. Changes: - Expanded comment on `FlightdeckReady = true` (site_controller.go:633) to clarify that stuck-delete scenarios surface as reconcile errors from `disableFlightdeck`, not from the status field, addressing the reviewer's concern about masked failures
# Conflicts: # client-go/applyconfiguration/core/v1beta1/internalpackagemanagerspec.go # internal/controller/core/chronicle_controller.go # internal/controller/core/package_manager.go # internal/controller/core/site_controller.go # internal/controller/core/site_controller_chronicle.go # internal/controller/core/site_controller_networkpolicies.go # internal/controller/core/site_controller_package_manager.go # internal/controller/core/site_controller_workbench.go # internal/controller/core/site_test.go # internal/controller/core/workbench.go # internal/controller/core/workbench_test.go
|
@timtalbot now it is ready! 😄 |
Summary
CommonProductStatustype withConditions,ObservedGeneration, andVersionfields, embedded in all status structs including PostgresDatabaseinternal/statushelper package withSetReady(),SetProgressing(),IsReady(),ExtractVersion()and reason constants (ReasonReconcileComplete,ReasonDeploymentReady/NotReady,ReasonStatefulSetReady/NotReady,ReasonAllComponentsReady,ReasonComponentsNotReady,ReasonDatabaseReady)Progressingcondition correctly reflects rollout state:Trueduring active rollouts,FalsewithReasonReconcileCompletewhen doneConnectReady,WorkbenchReady, etc.) andOwns()watches on child CRsReady,Version, andAgeprintcolumns to all CRDs forkubectl getoutputStatus().Update()toStatus().Patch()withMergeFromto avoid conflictsReadybool field as+optionalfor patch compatibility; value now derived from theReadyconditionWithStatusSubresourcefor all CRD typesExtractVersionhandles registry ports (e.g.,localhost:5000/image:v1) and digest references (image:tag@sha256:...)Closes #25
Test plan
just buildpassesjust mgenerateregenerates CRDs and deepcopy without errorsjust helm-generatekeeps Helm chart in syncgo fmtproduces no changeskubectl get connectsshows READY/VERSION/AGE columns