missing resources after space retargeting#707
missing resources after space retargeting#707rsoaresd merged 5 commits intocodeready-toolchain:masterfrom
Conversation
WalkthroughCluster resources are now deleted before namespace deletions in deleteNSTemplateSet; a new statusManager method clearStatusClusterResources was added to nil out NSTemplateSet.Status.ClusterResources and persist the change; tests were added to cover successful clearing and status-update failures. Changes
Sequence Diagram(s)sequenceDiagram
participant Controller as deleteNSTemplateSet
participant Cluster as clusterResources
participant Status as statusManager
participant K8s as Kubernetes API
rect rgb(220,230,255)
Note over Controller,Cluster: Step 1 — delete cluster-scoped resources
Controller->>Cluster: delete()
Cluster->>K8s: delete cluster resources
K8s-->>Cluster: success / error
Cluster-->>Controller: result
end
rect rgb(220,230,255)
Note over Controller,Status: Step 2 — clear ClusterResources status
Controller->>Status: clearStatusClusterResources()
alt ClusterResources non-nil
Status->>K8s: update status (set ClusterResources = nil)
K8s-->>Status: ACK / error
Status-->>Controller: nil / error
else already nil
Status-->>Controller: nil (no update)
end
end
rect rgb(230,220,230)
Note over Controller: Step 3 — proceed with namespace deletions
Controller->>Controller: delete namespaces
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #707 +/- ##
==========================================
+ Coverage 82.50% 82.53% +0.03%
==========================================
Files 48 48
Lines 3589 3596 +7
==========================================
+ Hits 2961 2968 +7
Misses 477 477
Partials 151 151
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
This is quite elegant solution to the problem of retargetting spaces between members within the same cluster. I'd just add more comments explaining what the owner label means and what it tries to solve. It's not very obvious when just looking at the code.
| toolchainv1alpha1.TemplateRefLabelKey: tierTemplate.templateRef, | ||
| toolchainv1alpha1.TierLabelKey: tierTemplate.tierName, | ||
| toolchainv1alpha1.ProviderLabelKey: toolchainv1alpha1.ProviderLabelValue, | ||
| toolchainv1alpha1.OwnerLabelKey: nsTmplSet.Namespace, |
There was a problem hiding this comment.
It took me a good while to realize that nsTmplSet.Namespace is used here to distinguish the member operators in the cluster. Maybe this could be worth a comment.
But more importantly, I'm not sure this correct.
The assumption that we make about the cluster resources managed by NSTemplateSet is that they are specific to the Space for which they have been deployed (which implies the ${SPACE_NAME} in their metadata.name btw.) and so this kind of conflict should never actually happen. If it does, then I think the bug is actually in the tests, not in this code.
The assumption is mentioned here: https://github.com/codeready-toolchain/member-operator/blob/master/controllers/nstemplateset/cluster_resources.go#L25
There was a problem hiding this comment.
Hmm... I'm stupid... I completely forgot this tries to solve a race condition while retagetting the same space 🤦🏼 I updated the review comment and crossed out my incorrect ideas in the line comment above.
There was a problem hiding this comment.
So sorry!! I should have explained better. I will address better comments!
mfrancisc
left a comment
There was a problem hiding this comment.
Thanks for looking into it 🙏
I have one small comment related to the label used.
| toolchainv1alpha1.TemplateRefLabelKey: tierTemplate.templateRef, | ||
| toolchainv1alpha1.TierLabelKey: tierTemplate.tierName, | ||
| toolchainv1alpha1.ProviderLabelKey: toolchainv1alpha1.ProviderLabelValue, | ||
| toolchainv1alpha1.OwnerLabelKey: nsTmplSet.Namespace, |
There was a problem hiding this comment.
should we use a different label ? OwnerLabelKey is usually used in a different way , by assigning it the value of the usersignup name .
Maybe creating a new label like NamespaceOwnerLabelKey or ClusterOwnerLabelKey and passing the cluster name ?
There was a problem hiding this comment.
My bad!! I was trying to reuse some existing label and I thought OwnerLabelKey would fit! I agree with you! I will create a new one
go.mod
Outdated
|
|
||
| toolchain go1.23.12 | ||
|
|
||
| replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571 |
There was a problem hiding this comment.
temporary until we merge codeready-toolchain/api#489
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (2)
controllers/nstemplateset/cluster_resources.go(3 hunks)go.mod(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.
📚 Learning: 2025-09-24T11:19:39.229Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.
Applied to files:
controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-24T10:52:01.009Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.
Applied to files:
controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-02T10:28:33.280Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: In the member-operator codebase, MEMBER_OPERATOR_NAMESPACE is only used in OpenShift templates (with ${MEMBER_OPERATOR_NAMESPACE} syntax), not in go templates (which would use {{.MEMBER_OPERATOR_NAMESPACE}} syntax). Go templates in TierTemplateRevision resources use other parameters like SPACE_NAME, NAMESPACE, CONFIG_VALUE, etc., but not MEMBER_OPERATOR_NAMESPACE.
Applied to files:
controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-02T10:28:33.280Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: The MEMBER_OPERATOR_NAMESPACE parameter is only used in OpenShift template processing, not in go templates in the member-operator codebase. Go templates in TierTemplateRevision resources do not reference this parameter.
Applied to files:
controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-11T08:28:57.736Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:143-149
Timestamp: 2025-09-11T08:28:57.736Z
Learning: The processGoTemplate function in nstemplatetier.go only processes Go templates from TierTemplateRevision resources (using {{...}} syntax), not OpenShift templates (using ${...} syntax). Files like member-sa.yaml are OpenShift templates and are not processed by this function.
Applied to files:
controllers/nstemplateset/cluster_resources.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: GolangCI Lint
- GitHub Check: Build & push operator bundles for e2e tests
- GitHub Check: Test with Coverage
| // Filter out cluster resources | ||
| objectsToDelete := []runtimeclient.Object{} | ||
| for _, obj := range oa.currentObjects { | ||
| if shouldDeleteClusterResource(ctx, obj, oa.nstt) { | ||
| objectsToDelete = append(objectsToDelete, obj) | ||
| } | ||
| } | ||
|
|
||
| if err := deleteObsoleteObjects(ctx, oa.r.Client, objectsToDelete, nil); err != nil { | ||
| return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources") |
There was a problem hiding this comment.
Ownership filter still deletes foreign resources
oa.currentObjects are the templated stubs we rendered from Status.ClusterResources; they have no runtime labels. Because of that, obj.GetLabels()[ClusterOwnerLabelKey] is empty, the migration path returns true, and we still pass every object to deleteObsoleteObjects. In the race you’re fixing, the real cluster object has already been relabeled by the other member namespace, but we never fetch those live labels here, so we still delete their resource.
We have to Get the actual cluster object before running shouldDeleteClusterResource, and only enqueue deletions for objects whose live labels match our namespace. Propagating the Get error keeps status handling consistent with the rest of the controller.
Apply this diff so the ownership check inspects the real object:
- // Filter out cluster resources
- objectsToDelete := []runtimeclient.Object{}
- for _, obj := range oa.currentObjects {
- if shouldDeleteClusterResource(ctx, obj, oa.nstt) {
- objectsToDelete = append(objectsToDelete, obj)
- }
- }
+ // Filter out cluster resources
+ objectsToDelete := []runtimeclient.Object{}
+ for _, obj := range oa.currentObjects {
+ if err := oa.r.Client.Get(ctx, runtimeclient.ObjectKeyFromObject(obj), obj); err != nil {
+ if errors.IsNotFound(err) {
+ continue
+ }
+ return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")
+ }
+ if shouldDeleteClusterResource(ctx, obj, oa.nstt) {
+ objectsToDelete = append(objectsToDelete, obj)
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Filter out cluster resources | |
| objectsToDelete := []runtimeclient.Object{} | |
| for _, obj := range oa.currentObjects { | |
| if shouldDeleteClusterResource(ctx, obj, oa.nstt) { | |
| objectsToDelete = append(objectsToDelete, obj) | |
| } | |
| } | |
| if err := deleteObsoleteObjects(ctx, oa.r.Client, objectsToDelete, nil); err != nil { | |
| return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources") | |
| // Filter out cluster resources | |
| objectsToDelete := []runtimeclient.Object{} | |
| for _, obj := range oa.currentObjects { | |
| if err := oa.r.Client.Get(ctx, runtimeclient.ObjectKeyFromObject(obj), obj); err != nil { | |
| if errors.IsNotFound(err) { | |
| continue | |
| } | |
| return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources") | |
| } | |
| if shouldDeleteClusterResource(ctx, obj, oa.nstt) { | |
| objectsToDelete = append(objectsToDelete, obj) | |
| } | |
| } | |
| if err := deleteObsoleteObjects(ctx, oa.r.Client, objectsToDelete, nil); err != nil { | |
| return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources") |
go.mod
Outdated
|
|
||
| toolchain go1.23.12 | ||
|
|
||
| replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571 |
There was a problem hiding this comment.
Drop the personal fork replace directive
The new replace github.com/codeready-toolchain/api => github.com/rsoaresd/api ... forces everyone to build against a personal fork. That breaks our supply-chain guarantees, prevents dependable reproducible builds, and will make CI consume unreviewed code. Please revert the replace and instead bump github.com/codeready-toolchain/api to an upstream commit that contains the needed changes.
Apply this diff to undo the replace:
- replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571 |
🤖 Prompt for AI Agents
In go.mod around line 132, there is a replace directive pointing to a personal
fork (github.com/rsoaresd/api) which forces all builds to use untrusted code;
remove that replace directive entirely and instead update the require for
github.com/codeready-toolchain/api to the appropriate upstream pseudo-version or
commit that contains the needed changes (use the upstream module path and commit
hash), run go mod tidy to resolve and vendor/update dependencies, and verify CI
passes to ensure no forked code remains.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
controllers/nstemplateset/cluster_resources.go (1)
253-262: Critical: Ownership check runs on template stubs, not live objects.
oa.currentObjectsare freshly rendered template stubs fromStatus.ClusterResources(line 45), not live objects fetched from the cluster. They have no runtime labels. WhenshouldDeleteClusterResourcechecksobj.GetLabels()[ClusterOwnerLabelKey]on line 272, it finds an empty string, triggers the migration path (line 275-279), and returnstruefor ALL objects—defeating the entire ownership mechanism.The race condition is NOT fixed: member1 will still delete member2's cluster resources because every stub appears unowned and gets deleted.
You must
Geteach live object from the cluster before checking ownership, just as the delete path does on line 137.Apply this diff to fetch live objects before the ownership check:
// Filter out cluster resources objectsToDelete := []runtimeclient.Object{} for _, obj := range oa.currentObjects { + if err := oa.r.Client.Get(ctx, runtimeclient.ObjectKeyFromObject(obj), obj); err != nil { + if errors.IsNotFound(err) { + continue + } + return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources") + } if shouldDeleteClusterResource(ctx, obj, oa.nstt) { objectsToDelete = append(objectsToDelete, obj) } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
controllers/nstemplateset/cluster_resources.go(3 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.
📚 Learning: 2025-09-24T11:19:39.229Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.
Applied to files:
controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-24T10:52:01.009Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.
Applied to files:
controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-02T10:28:33.280Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: In the member-operator codebase, MEMBER_OPERATOR_NAMESPACE is only used in OpenShift templates (with ${MEMBER_OPERATOR_NAMESPACE} syntax), not in go templates (which would use {{.MEMBER_OPERATOR_NAMESPACE}} syntax). Go templates in TierTemplateRevision resources use other parameters like SPACE_NAME, NAMESPACE, CONFIG_VALUE, etc., but not MEMBER_OPERATOR_NAMESPACE.
Applied to files:
controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-02T10:28:33.280Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: The MEMBER_OPERATOR_NAMESPACE parameter is only used in OpenShift template processing, not in go templates in the member-operator codebase. Go templates in TierTemplateRevision resources do not reference this parameter.
Applied to files:
controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-11T08:28:57.736Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:143-149
Timestamp: 2025-09-11T08:28:57.736Z
Learning: The processGoTemplate function in nstemplatetier.go only processes Go templates from TierTemplateRevision resources (using {{...}} syntax), not OpenShift templates (using ${...} syntax). Files like member-sa.yaml are OpenShift templates and are not processed by this function.
Applied to files:
controllers/nstemplateset/cluster_resources.go
🔇 Additional comments (2)
controllers/nstemplateset/cluster_resources.go (2)
106-108: Good addition with clear documentation.The
ClusterOwnerLabelKeylabel properly identifies the owning member operator namespace, and the comments clearly explain its purpose.
146-149: Delete path correctly checks ownership on live objects.The ownership check is properly implemented here because the live object is fetched on line 137 before the ownership check on line 147.
| // shouldDeleteClusterResource checks if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace) | ||
| func shouldDeleteClusterResource(ctx context.Context, obj runtimeclient.Object, nsTmplSet *toolchainv1alpha1.NSTemplateSet) bool { | ||
| logger := log.FromContext(ctx) | ||
| ownerNamespace := obj.GetLabels()[toolchainv1alpha1.ClusterOwnerLabelKey] | ||
|
|
||
| // MIGRATION: Old resources without owner label - assume ownership for backward compatibility | ||
| if ownerNamespace == "" { | ||
| logger.Info("cluster resource has no owner label - assuming ownership for migration", | ||
| "kind", obj.GetObjectKind().GroupVersionKind().Kind, | ||
| "name", obj.GetName()) | ||
| return true | ||
| } | ||
|
|
||
| // Check if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace) | ||
| if ownerNamespace == nsTmplSet.Namespace { | ||
| return true | ||
| } | ||
|
|
||
| // Owned by different namespace - skip deletion | ||
| logger.Info("skipping deletion of cluster resource - not owned by the same namespace as the NSTemplateSet", | ||
| "kind", obj.GetObjectKind().GroupVersionKind().Kind, | ||
| "name", obj.GetName(), | ||
| "owner_namespace", ownerNamespace, | ||
| "current_namespace", nsTmplSet.Namespace) | ||
| return false | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Document that this function requires live cluster objects.
The ownership logic is sound, but the function assumes obj has been fetched from the cluster with runtime labels. Add a comment documenting this requirement to prevent misuse.
Apply this diff to document the requirement:
-// shouldDeleteClusterResource checks if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace)
+// shouldDeleteClusterResource checks if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace).
+// IMPORTANT: obj must be a live object fetched from the cluster via Client.Get, not a template stub,
+// because this function checks runtime labels (ClusterOwnerLabelKey) that are only present on live objects.
func shouldDeleteClusterResource(ctx context.Context, obj runtimeclient.Object, nsTmplSet *toolchainv1alpha1.NSTemplateSet) bool {🤖 Prompt for AI Agents
In controllers/nstemplateset/cluster_resources.go around lines 269 to 294, add a
clear doc comment above shouldDeleteClusterResource stating that the function
requires a live cluster object with labels populated (i.e. the object must have
been fetched from the API server via client.Get/list so GetLabels() returns
runtime labels), and note the migration behavior for legacy resources without
owner labels; this prevents callers from passing in local/template objects that
lack runtime labels and misinterpreting ownership.
|
TBH, I'm a bit confused. How is this possible when we delete the NSTemplateSet in the "original cluster" first, before provisioning the NSTemplateSet in the new one? At the time when we retarget the Space to the new cluster everything should be egone, the "old NSTemplateSet" should be deleted as well thus the member-operator shouldn't touch the resources, correct? In other words, it shouldn't happen that two controllers would try to touch the same resource. If this happens then there is different bug to fix. |
|
Ok, I went through the logs and it's apparent that the NSTemplateSet was created in member2 while the NSTemplateSet member1 controller was reconciling the deletion operation. last member1 log related to this user: but at that time, the NSTemplateSet was already marked for deletion (and maybe also "almost" gone), because the removal of the finalizer happened here: also, from the host-operator point of view the NSTemplateSet was already gone so it created the NSTemplateSet in member2: in other words, there is a problem with an outdated cache in member1 |
|
The label solution should work fine, but I'm not happy about exposing e2e-specific workaround in the CRs. Personally, I would do the following:
This should fix the bug with later reconciliations caused by the outdated cache |
This makes sense, not sure why we delete first the namespaces and only later the cluster scope resources now. The other way makes more sense. Even if I'm not 100% sure it will solve the cache issue, we cannot know when the cache is updated apparently. But still worth doing this change as it makes more sense to delete resources in this order 👍 |
|
if we update also the status of NSTemplateSet by setting member-operator/controllers/nstemplateset/cluster_resources.go Lines 120 to 124 in 0568ed2 |
Ok I see now, thanks for explaining it. That should definitely help indeed, even if the cache is out of date 👍 |
|
Thank you so much @MatousJobanek, for the explanation and suggestion! Going to address the changes |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
controllers/nstemplateset/nstemplateset_controller.go (1)
203-207: Consider wrapping the error with status update for consistency.The cluster resource deletion error is propagated directly without updating the NSTemplateSet status to
TerminatingFailed, unlike the namespace deletion error handling at line 218. This inconsistency means users may not see a clear failure status if cluster resource deletion fails.Consider applying this pattern for consistency:
// delete cluster resources first err := r.clusterResources.delete(ctx, nsTmplSet) if err != nil { - return reconcile.Result{}, err + return reconcile.Result{}, r.status.wrapErrorWithStatusUpdate(ctx, nsTmplSet, r.status.setStatusTerminatingFailed, err, "failed to delete cluster resources") }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
controllers/nstemplateset/nstemplateset_controller.go(1 hunks)controllers/nstemplateset/status.go(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.
📚 Learning: 2025-09-24T11:19:39.229Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.
Applied to files:
controllers/nstemplateset/nstemplateset_controller.go
📚 Learning: 2025-09-24T10:52:01.009Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.
Applied to files:
controllers/nstemplateset/nstemplateset_controller.go
🪛 GitHub Actions: govulncheck
controllers/nstemplateset/nstemplateset_controller.go
[error] 116-116: GO-2025-4012: Lack of limit when parsing cookies can cause memory exhaustion in net/http. Found in: net/http@go1.23.12. Fixed in: net/http@go1.24.8
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build & push operator bundles for e2e tests
🔇 Additional comments (2)
controllers/nstemplateset/status.go (1)
294-301: LGTM! Well-designed helper method.The implementation correctly guards against unnecessary API calls when the status is already nil, and follows the established patterns in this file for status updates.
controllers/nstemplateset/nstemplateset_controller.go (1)
203-212: Excellent fix addressing the race condition!The reordered deletion flow (cluster resources first, then status clearing, then namespaces) effectively addresses the race condition described in the PR objectives. Clearing the status immediately after cluster resource deletion prevents redundant deletion attempts during subsequent reconciliations with stale cache, as discussed in the PR comments.
MatousJobanek
left a comment
There was a problem hiding this comment.
Nice, can you add/update unit tests please?
Sure! Addressed |
MatousJobanek
left a comment
There was a problem hiding this comment.
Looks good, just small comments regarding the unit tests
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fbm3307, MatousJobanek, rsoaresd The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |



Description
The test
TestRetargetUserWithSBRByChangingSpaceTargetClusterWhenSpaceIsSharedfails intermittently when verifying resources after space retargeting. I suspect that, in the test environment, where we use the same cluster for member1 and member2, a race condition could cause member1 to delete member2's resources.Why could this be happening?
Job example: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-codeready-toolchain-toolchain-e2e-master-ci-daily/1983821289508311040
Issue ticket number and link
SANDBOX-1472
Assisted by Cursor
Summary by CodeRabbit
Bug Fixes
Tests