Skip to content

missing resources after space retargeting#707

Merged
rsoaresd merged 5 commits intocodeready-toolchain:masterfrom
rsoaresd:fix_delete
Nov 5, 2025
Merged

missing resources after space retargeting#707
rsoaresd merged 5 commits intocodeready-toolchain:masterfrom
rsoaresd:fix_delete

Conversation

@rsoaresd
Copy link
Contributor

@rsoaresd rsoaresd commented Oct 30, 2025

Description

The test TestRetargetUserWithSBRByChangingSpaceTargetClusterWhenSpaceIsShared fails intermittently when verifying resources after space retargeting. I suspect that, in the test environment, where we use the same cluster for member1 and member2, a race condition could cause member1 to delete member2's resources.

Why could this be happening?

  • Cluster-scoped resources (ClusterResourceQuota, Idler) have no namespace
  • They exist once per cluster with a given name
  • Both operators can see and manage the same resource
  • When member1 deletes the resource (idler, for example), it deletes the cluster's only copy
  • Member2 loses its resource even though it just created it!

Job example: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-codeready-toolchain-toolchain-e2e-master-ci-daily/1983821289508311040

=== NAME  TestRetargetUserWithSBRByChangingSpaceTargetClusterWhenSpaceIsShared
    member.go:1296: failed to find Idler with matching criteria:
        ----
        actual:
        metadata:
          creationTimestamp: null
        spec:
          timeoutSeconds: 0
        status: {}
        
        ----
        diffs:
        expected Idler 'toolchain.dev.openshift.com/tier' label to be 'base1ns' but it was ''
        expected Idler timeoutSeconds to be '43200' but it was '0'
        expected Idler  label to be 'testuser-0f54ed31-8f' but it was ''
        expected Idler  label to be 'codeready-toolchain' but it was ''
        
    checks.go:1086: 
        	Error Trace:	/go/src/github.com/codeready-toolchain/toolchain-e2e/testsupport/tiers/checks.go:1086
        	            				/go/src/github.com/codeready-toolchain/toolchain-e2e/testsupport/tiers/nstemplateset.go:75
        	            				/usr/local/go/src/runtime/asm_amd64.s:1700
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestRetargetUserWithSBRByChangingSpaceTargetClusterWhenSpaceIsShared

Issue ticket number and link

SANDBOX-1472

Assisted by Cursor

Summary by CodeRabbit

  • Bug Fixes

    • NSTemplateSet deletion now cleans up cluster resources before namespace deletion and removes redundant deletion steps.
    • ClusterResources status is cleared promptly after cluster cleanup; failures to clear status are reported as errors.
  • Tests

    • Added tests for successful and failed ClusterResources status clearing and for deletion flows with and without namespaces.

@coderabbitai
Copy link

coderabbitai bot commented Oct 30, 2025

Walkthrough

Cluster resources are now deleted before namespace deletions in deleteNSTemplateSet; a new statusManager method clearStatusClusterResources was added to nil out NSTemplateSet.Status.ClusterResources and persist the change; tests were added to cover successful clearing and status-update failures.

Changes

Cohort / File(s) Summary
Controller changes
controllers/nstemplateset/nstemplateset_controller.go
deleteNSTemplateSet now calls clusterResources.delete first and propagates errors, then invokes clearStatusClusterResources; removed redundant re-deletion of cluster resources after namespace deletion.
Status helper
controllers/nstemplateset/status.go
Added clearStatusClusterResources on statusManager to set Status.ClusterResources = nil when present and update the NSTemplateSet status via the Kubernetes API; no-op if already nil.
Tests
controllers/nstemplateset/nstemplateset_controller_test.go
Added assertions verifying ClusterResources is cleared after the first reconcile and a new test simulating a failure to clear ClusterResources status expecting an error; minor formatting alignment change.

Sequence Diagram(s)

sequenceDiagram
    participant Controller as deleteNSTemplateSet
    participant Cluster as clusterResources
    participant Status as statusManager
    participant K8s as Kubernetes API

    rect rgb(220,230,255)
    Note over Controller,Cluster: Step 1 — delete cluster-scoped resources
    Controller->>Cluster: delete()
    Cluster->>K8s: delete cluster resources
    K8s-->>Cluster: success / error
    Cluster-->>Controller: result
    end

    rect rgb(220,230,255)
    Note over Controller,Status: Step 2 — clear ClusterResources status
    Controller->>Status: clearStatusClusterResources()
    alt ClusterResources non-nil
        Status->>K8s: update status (set ClusterResources = nil)
        K8s-->>Status: ACK / error
        Status-->>Controller: nil / error
    else already nil
        Status-->>Controller: nil (no update)
    end
    end

    rect rgb(230,220,230)
    Note over Controller: Step 3 — proceed with namespace deletions
    Controller->>Controller: delete namespaces
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review focus:
    • Error wrapping/propagation around cluster resource deletion and status clearing.
    • Status update call behavior and reconciliation implications when it fails.
    • Test mocks that simulate status update failures.

Possibly related PRs

Suggested reviewers

  • MatousJobanek
  • xcoulon

Poem

🐰 I nibble order, tidy up the mess,
First cluster gone, then status I address.
No double deletes, just one clean sweep,
A hop, a patch — the repo sleeps. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'missing resources after space retargeting' is vague and does not clearly describe the actual changes made. The PR implements deletion order changes and status clearing in NSTemplateSet deletion, not resource visibility issues. Revise the title to reflect the core changes, such as 'Clear cluster resources status immediately after deletion in NSTemplateSet' or 'Fix cluster resource deletion order in NSTemplateSet cleanup flow'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60f0271 and 502d7aa.

📒 Files selected for processing (1)
  • controllers/nstemplateset/nstemplateset_controller_test.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • controllers/nstemplateset/nstemplateset_controller_test.go

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.53%. Comparing base (610d5f0) to head (502d7aa).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #707      +/-   ##
==========================================
+ Coverage   82.50%   82.53%   +0.03%     
==========================================
  Files          48       48              
  Lines        3589     3596       +7     
==========================================
+ Hits         2961     2968       +7     
  Misses        477      477              
  Partials      151      151              
Files with missing lines Coverage Δ
...trollers/nstemplateset/nstemplateset_controller.go 73.01% <100.00%> (+0.43%) ⬆️
controllers/nstemplateset/status.go 93.33% <100.00%> (+0.17%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@metlos metlos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite elegant solution to the problem of retargetting spaces between members within the same cluster. I'd just add more comments explaining what the owner label means and what it tries to solve. It's not very obvious when just looking at the code.

toolchainv1alpha1.TemplateRefLabelKey: tierTemplate.templateRef,
toolchainv1alpha1.TierLabelKey: tierTemplate.tierName,
toolchainv1alpha1.ProviderLabelKey: toolchainv1alpha1.ProviderLabelValue,
toolchainv1alpha1.OwnerLabelKey: nsTmplSet.Namespace,
Copy link
Contributor

@metlos metlos Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a good while to realize that nsTmplSet.Namespace is used here to distinguish the member operators in the cluster. Maybe this could be worth a comment.

But more importantly, I'm not sure this correct.

The assumption that we make about the cluster resources managed by NSTemplateSet is that they are specific to the Space for which they have been deployed (which implies the ${SPACE_NAME} in their metadata.name btw.) and so this kind of conflict should never actually happen. If it does, then I think the bug is actually in the tests, not in this code.

The assumption is mentioned here: https://github.com/codeready-toolchain/member-operator/blob/master/controllers/nstemplateset/cluster_resources.go#L25

Copy link
Contributor

@metlos metlos Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I'm stupid... I completely forgot this tries to solve a race condition while retagetting the same space 🤦🏼 I updated the review comment and crossed out my incorrect ideas in the line comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sorry!! I should have explained better. I will address better comments!

Copy link
Contributor

@mfrancisc mfrancisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it 🙏

I have one small comment related to the label used.

toolchainv1alpha1.TemplateRefLabelKey: tierTemplate.templateRef,
toolchainv1alpha1.TierLabelKey: tierTemplate.tierName,
toolchainv1alpha1.ProviderLabelKey: toolchainv1alpha1.ProviderLabelValue,
toolchainv1alpha1.OwnerLabelKey: nsTmplSet.Namespace,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use a different label ? OwnerLabelKey is usually used in a different way , by assigning it the value of the usersignup name .

Maybe creating a new label like NamespaceOwnerLabelKey or ClusterOwnerLabelKey and passing the cluster name ?

Copy link
Contributor Author

@rsoaresd rsoaresd Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad!! I was trying to reuse some existing label and I thought OwnerLabelKey would fit! I agree with you! I will create a new one

go.mod Outdated

toolchain go1.23.12

replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporary until we merge codeready-toolchain/api#489

@rsoaresd rsoaresd requested review from metlos and mfrancisc October 31, 2025 11:53
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe23cf3 and 5646082.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (2)
  • controllers/nstemplateset/cluster_resources.go (3 hunks)
  • go.mod (1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.
📚 Learning: 2025-09-24T11:19:39.229Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.

Applied to files:

  • controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-24T10:52:01.009Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.

Applied to files:

  • controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-02T10:28:33.280Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: In the member-operator codebase, MEMBER_OPERATOR_NAMESPACE is only used in OpenShift templates (with ${MEMBER_OPERATOR_NAMESPACE} syntax), not in go templates (which would use {{.MEMBER_OPERATOR_NAMESPACE}} syntax). Go templates in TierTemplateRevision resources use other parameters like SPACE_NAME, NAMESPACE, CONFIG_VALUE, etc., but not MEMBER_OPERATOR_NAMESPACE.

Applied to files:

  • controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-02T10:28:33.280Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: The MEMBER_OPERATOR_NAMESPACE parameter is only used in OpenShift template processing, not in go templates in the member-operator codebase. Go templates in TierTemplateRevision resources do not reference this parameter.

Applied to files:

  • controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-11T08:28:57.736Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:143-149
Timestamp: 2025-09-11T08:28:57.736Z
Learning: The processGoTemplate function in nstemplatetier.go only processes Go templates from TierTemplateRevision resources (using {{...}} syntax), not OpenShift templates (using ${...} syntax). Files like member-sa.yaml are OpenShift templates and are not processed by this function.

Applied to files:

  • controllers/nstemplateset/cluster_resources.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: GolangCI Lint
  • GitHub Check: Build & push operator bundles for e2e tests
  • GitHub Check: Test with Coverage

Comment on lines 253 to 262
// Filter out cluster resources
objectsToDelete := []runtimeclient.Object{}
for _, obj := range oa.currentObjects {
if shouldDeleteClusterResource(ctx, obj, oa.nstt) {
objectsToDelete = append(objectsToDelete, obj)
}
}

if err := deleteObsoleteObjects(ctx, oa.r.Client, objectsToDelete, nil); err != nil {
return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Ownership filter still deletes foreign resources

oa.currentObjects are the templated stubs we rendered from Status.ClusterResources; they have no runtime labels. Because of that, obj.GetLabels()[ClusterOwnerLabelKey] is empty, the migration path returns true, and we still pass every object to deleteObsoleteObjects. In the race you’re fixing, the real cluster object has already been relabeled by the other member namespace, but we never fetch those live labels here, so we still delete their resource.

We have to Get the actual cluster object before running shouldDeleteClusterResource, and only enqueue deletions for objects whose live labels match our namespace. Propagating the Get error keeps status handling consistent with the rest of the controller.

Apply this diff so the ownership check inspects the real object:

-	// Filter out cluster resources
-	objectsToDelete := []runtimeclient.Object{}
-	for _, obj := range oa.currentObjects {
-		if shouldDeleteClusterResource(ctx, obj, oa.nstt) {
-			objectsToDelete = append(objectsToDelete, obj)
-		}
-	}
+	// Filter out cluster resources
+	objectsToDelete := []runtimeclient.Object{}
+	for _, obj := range oa.currentObjects {
+		if err := oa.r.Client.Get(ctx, runtimeclient.ObjectKeyFromObject(obj), obj); err != nil {
+			if errors.IsNotFound(err) {
+				continue
+			}
+			return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")
+		}
+		if shouldDeleteClusterResource(ctx, obj, oa.nstt) {
+			objectsToDelete = append(objectsToDelete, obj)
+		}
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Filter out cluster resources
objectsToDelete := []runtimeclient.Object{}
for _, obj := range oa.currentObjects {
if shouldDeleteClusterResource(ctx, obj, oa.nstt) {
objectsToDelete = append(objectsToDelete, obj)
}
}
if err := deleteObsoleteObjects(ctx, oa.r.Client, objectsToDelete, nil); err != nil {
return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")
// Filter out cluster resources
objectsToDelete := []runtimeclient.Object{}
for _, obj := range oa.currentObjects {
if err := oa.r.Client.Get(ctx, runtimeclient.ObjectKeyFromObject(obj), obj); err != nil {
if errors.IsNotFound(err) {
continue
}
return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")
}
if shouldDeleteClusterResource(ctx, obj, oa.nstt) {
objectsToDelete = append(objectsToDelete, obj)
}
}
if err := deleteObsoleteObjects(ctx, oa.r.Client, objectsToDelete, nil); err != nil {
return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")

go.mod Outdated

toolchain go1.23.12

replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Drop the personal fork replace directive

The new replace github.com/codeready-toolchain/api => github.com/rsoaresd/api ... forces everyone to build against a personal fork. That breaks our supply-chain guarantees, prevents dependable reproducible builds, and will make CI consume unreviewed code. Please revert the replace and instead bump github.com/codeready-toolchain/api to an upstream commit that contains the needed changes.

Apply this diff to undo the replace:

- replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571
🤖 Prompt for AI Agents
In go.mod around line 132, there is a replace directive pointing to a personal
fork (github.com/rsoaresd/api) which forces all builds to use untrusted code;
remove that replace directive entirely and instead update the require for
github.com/codeready-toolchain/api to the appropriate upstream pseudo-version or
commit that contains the needed changes (use the upstream module path and commit
hash), run go mod tidy to resolve and vendor/update dependencies, and verify CI
passes to ensure no forked code remains.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
controllers/nstemplateset/cluster_resources.go (1)

253-262: Critical: Ownership check runs on template stubs, not live objects.

oa.currentObjects are freshly rendered template stubs from Status.ClusterResources (line 45), not live objects fetched from the cluster. They have no runtime labels. When shouldDeleteClusterResource checks obj.GetLabels()[ClusterOwnerLabelKey] on line 272, it finds an empty string, triggers the migration path (line 275-279), and returns true for ALL objects—defeating the entire ownership mechanism.

The race condition is NOT fixed: member1 will still delete member2's cluster resources because every stub appears unowned and gets deleted.

You must Get each live object from the cluster before checking ownership, just as the delete path does on line 137.

Apply this diff to fetch live objects before the ownership check:

 	// Filter out cluster resources
 	objectsToDelete := []runtimeclient.Object{}
 	for _, obj := range oa.currentObjects {
+		if err := oa.r.Client.Get(ctx, runtimeclient.ObjectKeyFromObject(obj), obj); err != nil {
+			if errors.IsNotFound(err) {
+				continue
+			}
+			return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")
+		}
 		if shouldDeleteClusterResource(ctx, obj, oa.nstt) {
 			objectsToDelete = append(objectsToDelete, obj)
 		}
 	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5646082 and 4478f05.

📒 Files selected for processing (1)
  • controllers/nstemplateset/cluster_resources.go (3 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.
📚 Learning: 2025-09-24T11:19:39.229Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.

Applied to files:

  • controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-24T10:52:01.009Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.

Applied to files:

  • controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-02T10:28:33.280Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: In the member-operator codebase, MEMBER_OPERATOR_NAMESPACE is only used in OpenShift templates (with ${MEMBER_OPERATOR_NAMESPACE} syntax), not in go templates (which would use {{.MEMBER_OPERATOR_NAMESPACE}} syntax). Go templates in TierTemplateRevision resources use other parameters like SPACE_NAME, NAMESPACE, CONFIG_VALUE, etc., but not MEMBER_OPERATOR_NAMESPACE.

Applied to files:

  • controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-02T10:28:33.280Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: The MEMBER_OPERATOR_NAMESPACE parameter is only used in OpenShift template processing, not in go templates in the member-operator codebase. Go templates in TierTemplateRevision resources do not reference this parameter.

Applied to files:

  • controllers/nstemplateset/cluster_resources.go
📚 Learning: 2025-09-11T08:28:57.736Z
Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:143-149
Timestamp: 2025-09-11T08:28:57.736Z
Learning: The processGoTemplate function in nstemplatetier.go only processes Go templates from TierTemplateRevision resources (using {{...}} syntax), not OpenShift templates (using ${...} syntax). Files like member-sa.yaml are OpenShift templates and are not processed by this function.

Applied to files:

  • controllers/nstemplateset/cluster_resources.go
🔇 Additional comments (2)
controllers/nstemplateset/cluster_resources.go (2)

106-108: Good addition with clear documentation.

The ClusterOwnerLabelKey label properly identifies the owning member operator namespace, and the comments clearly explain its purpose.


146-149: Delete path correctly checks ownership on live objects.

The ownership check is properly implemented here because the live object is fetched on line 137 before the ownership check on line 147.

Comment on lines 269 to 294
// shouldDeleteClusterResource checks if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace)
func shouldDeleteClusterResource(ctx context.Context, obj runtimeclient.Object, nsTmplSet *toolchainv1alpha1.NSTemplateSet) bool {
logger := log.FromContext(ctx)
ownerNamespace := obj.GetLabels()[toolchainv1alpha1.ClusterOwnerLabelKey]

// MIGRATION: Old resources without owner label - assume ownership for backward compatibility
if ownerNamespace == "" {
logger.Info("cluster resource has no owner label - assuming ownership for migration",
"kind", obj.GetObjectKind().GroupVersionKind().Kind,
"name", obj.GetName())
return true
}

// Check if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace)
if ownerNamespace == nsTmplSet.Namespace {
return true
}

// Owned by different namespace - skip deletion
logger.Info("skipping deletion of cluster resource - not owned by the same namespace as the NSTemplateSet",
"kind", obj.GetObjectKind().GroupVersionKind().Kind,
"name", obj.GetName(),
"owner_namespace", ownerNamespace,
"current_namespace", nsTmplSet.Namespace)
return false
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Document that this function requires live cluster objects.

The ownership logic is sound, but the function assumes obj has been fetched from the cluster with runtime labels. Add a comment documenting this requirement to prevent misuse.

Apply this diff to document the requirement:

-// shouldDeleteClusterResource checks if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace)
+// shouldDeleteClusterResource checks if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace).
+// IMPORTANT: obj must be a live object fetched from the cluster via Client.Get, not a template stub,
+// because this function checks runtime labels (ClusterOwnerLabelKey) that are only present on live objects.
 func shouldDeleteClusterResource(ctx context.Context, obj runtimeclient.Object, nsTmplSet *toolchainv1alpha1.NSTemplateSet) bool {
🤖 Prompt for AI Agents
In controllers/nstemplateset/cluster_resources.go around lines 269 to 294, add a
clear doc comment above shouldDeleteClusterResource stating that the function
requires a live cluster object with labels populated (i.e. the object must have
been fetched from the API server via client.Get/list so GetLabels() returns
runtime labels), and note the migration behavior for legacy resources without
owner labels; this prevents callers from passing in local/template objects that
lack runtime labels and misinterpreting ownership.

@MatousJobanek
Copy link
Contributor

TBH, I'm a bit confused. How is this possible when we delete the NSTemplateSet in the "original cluster" first, before provisioning the NSTemplateSet in the new one? At the time when we retarget the Space to the new cluster everything should be egone, the "old NSTemplateSet" should be deleted as well thus the member-operator shouldn't touch the resources, correct?
https://github.com/codeready-toolchain/host-operator/blob/2fc7ca618e8cfb8b33b1b9c537ea8a0e92c917ea/controllers/space/space_controller.go#L149-L165

In other words, it shouldn't happen that two controllers would try to touch the same resource. If this happens then there is different bug to fix.

@MatousJobanek
Copy link
Contributor

Ok, I went through the logs and it's apparent that the NSTemplateSet was created in member2 while the NSTemplateSet member1 controller was reconciling the deletion operation.
member2 log:

{"level":"info","ts":"2025-10-30T10:06:58.335Z","msg":"reconciling NSTemplateSet","controller":"nstemplateset","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"NSTemplateSet","NSTemplateSet":{"name":"testuser-0f54ed31-8f","namespace":"toolchain-member2-30095445"},"namespace":"toolchain-member2-30095445","name":"testuser-0f54ed31-8f","reconcileID":"e4715aea-1f60-4ea9-a8fd-aee740009fa7"}

last member1 log related to this user:

{"level":"info","ts":"2025-10-30T10:06:59.059Z","msg":"NSTemplateSet resource is ready to be terminated: all related user namespaces have been marked for deletion","controller":"nstemplateset","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"NSTemplateSet","NSTemplateSet":{"name":"testuser-0f54ed31-8f","namespace":"toolchain-member-30095445"},"namespace":"toolchain-member-30095445","name":"testuser-0f54ed31-8f","reconcileID":"3c5a6aba-924b-479d-8203-567773b8d9ce"}

but at that time, the NSTemplateSet was already marked for deletion (and maybe also "almost" gone), because the removal of the finalizer happened here:

{"level":"info","ts":"2025-10-30T10:06:58.308Z","msg":"NSTemplateSet resource is ready to be terminated: all related user namespaces have been marked for deletion","controller":"nstemplateset","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"NSTemplateSet","NSTemplateSet":{"name":"testuser-0f54ed31-8f"},"namespace":"","name":"testuser-0f54ed31-8f","reconcileID":"998c67fa-e921-4c78-b00c-9d9aa40b4100"}

also, from the host-operator point of view the NSTemplateSet was already gone so it created the NSTemplateSet in member2:

{"level":"info","ts":"2025-10-30T10:06:58.318Z","msg":"the NSTemplateSet resource is already deleted","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testuser-0f54ed31-8f","namespace":"toolchain-host-30095445"},"namespace":"toolchain-host-30095445","name":"testuser-0f54ed31-8f","reconcileID":"969396c4-85ac-4e84-976f-7324b0e9710e"}
{"level":"info","ts":"2025-10-30T10:06:58.318Z","msg":"resetting 'space.Status.TargetCluster' field","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testuser-0f54ed31-8f","namespace":"toolchain-host-30095445"},"namespace":"toolchain-host-30095445","name":"testuser-0f54ed31-8f","reconcileID":"969396c4-85ac-4e84-976f-7324b0e9710e"}

in other words, there is a problem with an outdated cache in member1

@MatousJobanek
Copy link
Contributor

The label solution should work fine, but I'm not happy about exposing e2e-specific workaround in the CRs.

Personally, I would do the following:

  1. swap the deletion of the namespace & cluster-scoped resources here:
    // delete all namespace one by one
    allDeleted, err := r.namespaces.ensureDeleted(ctx, nsTmplSet)
    // when err, status Update will not trigger reconcile, sending returning error.
    if err != nil {
    return reconcile.Result{}, r.status.wrapErrorWithStatusUpdate(ctx, nsTmplSet, r.status.setStatusTerminatingFailed, err, "failed to ensure namespace deletion")
    }
    if !allDeleted {
    if time.Since(nsTmplSet.DeletionTimestamp.Time) > 60*time.Second {
    return reconcile.Result{}, fmt.Errorf("NSTemplateSet deletion has not completed in over 1 minute")
    }
    // One or more namespaces may not yet be deleted. We can stop here.
    return reconcile.Result{
    RequeueAfter: time.Second,
    }, nil
    }
    // if no namespace was to be deleted, then we can proceed with the cluster resources associated with the user
    err = r.clusterResources.delete(ctx, nsTmplSet)
    if err != nil {
    return reconcile.Result{}, err
    }
    so we delete cluster-scoped resources first and then the namespaces
  2. set the NSTemplateSet.Status.ClusterResources to nil as soon as the deletion of the cluster-scoped resources is performed

This should fix the bug with later reconciliations caused by the outdated cache

@mfrancisc
Copy link
Contributor

swap the deletion of the namespace & cluster-scoped resources here:

This makes sense, not sure why we delete first the namespaces and only later the cluster scope resources now. The other way makes more sense.

Even if I'm not 100% sure it will solve the cache issue, we cannot know when the cache is updated apparently. But still worth doing this change as it makes more sense to delete resources in this order 👍

@MatousJobanek
Copy link
Contributor

if we update also the status of NSTemplateSet by setting NSTemplateSet.Status.ClusterResources to nil (as I stated above), then it should definitely help - if the controller gets reconciled for any reason again, then it won't try to re-delete the cluster-scoped resources, see:

// delete deletes all cluster-scoped resources referenced by the nstemplateset.
func (r *clusterResourcesManager) delete(ctx context.Context, nsTmplSet *toolchainv1alpha1.NSTemplateSet) error {
if nsTmplSet.Status.ClusterResources == nil {
return nil
}

@mfrancisc
Copy link
Contributor

if we update also the status of NSTemplateSet by setting NSTemplateSet.Status.ClusterResources to nil (as I stated above)

Ok I see now, thanks for explaining it. That should definitely help indeed, even if the cache is out of date 👍

@rsoaresd
Copy link
Contributor Author

rsoaresd commented Nov 3, 2025

Thank you so much @MatousJobanek, for the explanation and suggestion! Going to address the changes

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
controllers/nstemplateset/nstemplateset_controller.go (1)

203-207: Consider wrapping the error with status update for consistency.

The cluster resource deletion error is propagated directly without updating the NSTemplateSet status to TerminatingFailed, unlike the namespace deletion error handling at line 218. This inconsistency means users may not see a clear failure status if cluster resource deletion fails.

Consider applying this pattern for consistency:

 	// delete cluster resources first
 	err := r.clusterResources.delete(ctx, nsTmplSet)
 	if err != nil {
-		return reconcile.Result{}, err
+		return reconcile.Result{}, r.status.wrapErrorWithStatusUpdate(ctx, nsTmplSet, r.status.setStatusTerminatingFailed, err, "failed to delete cluster resources")
 	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4478f05 and 7ac1a8e.

📒 Files selected for processing (2)
  • controllers/nstemplateset/nstemplateset_controller.go (1 hunks)
  • controllers/nstemplateset/status.go (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.
📚 Learning: 2025-09-24T11:19:39.229Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.

Applied to files:

  • controllers/nstemplateset/nstemplateset_controller.go
📚 Learning: 2025-09-24T10:52:01.009Z
Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.

Applied to files:

  • controllers/nstemplateset/nstemplateset_controller.go
🪛 GitHub Actions: govulncheck
controllers/nstemplateset/nstemplateset_controller.go

[error] 116-116: GO-2025-4012: Lack of limit when parsing cookies can cause memory exhaustion in net/http. Found in: net/http@go1.23.12. Fixed in: net/http@go1.24.8

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build & push operator bundles for e2e tests
🔇 Additional comments (2)
controllers/nstemplateset/status.go (1)

294-301: LGTM! Well-designed helper method.

The implementation correctly guards against unnecessary API calls when the status is already nil, and follows the established patterns in this file for status updates.

controllers/nstemplateset/nstemplateset_controller.go (1)

203-212: Excellent fix addressing the race condition!

The reordered deletion flow (cluster resources first, then status clearing, then namespaces) effectively addresses the race condition described in the PR objectives. Clearing the status immediately after cluster resource deletion prevents redundant deletion attempts during subsequent reconciliations with stale cache, as discussed in the PR comments.

Copy link
Contributor

@MatousJobanek MatousJobanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, can you add/update unit tests please?

@rsoaresd
Copy link
Contributor Author

rsoaresd commented Nov 3, 2025

Nice, can you add/update unit tests please?

Sure! Addressed

Copy link
Contributor

@MatousJobanek MatousJobanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just small comments regarding the unit tests

@sonarqubecloud
Copy link

sonarqubecloud bot commented Nov 4, 2025

@openshift-ci
Copy link

openshift-ci bot commented Nov 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fbm3307, MatousJobanek, rsoaresd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [MatousJobanek,fbm3307,rsoaresd]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rsoaresd rsoaresd merged commit 660fbf4 into codeready-toolchain:master Nov 5, 2025
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants