missing resources after space retargeting by rsoaresd · Pull Request #707 · codeready-toolchain/member-operator

rsoaresd · 2025-10-30T21:32:07Z

Description

The test TestRetargetUserWithSBRByChangingSpaceTargetClusterWhenSpaceIsShared fails intermittently when verifying resources after space retargeting. I suspect that, in the test environment, where we use the same cluster for member1 and member2, a race condition could cause member1 to delete member2's resources.

Why could this be happening?

Cluster-scoped resources (ClusterResourceQuota, Idler) have no namespace
They exist once per cluster with a given name
Both operators can see and manage the same resource
When member1 deletes the resource (idler, for example), it deletes the cluster's only copy
Member2 loses its resource even though it just created it!

Job example: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-codeready-toolchain-toolchain-e2e-master-ci-daily/1983821289508311040

=== NAME  TestRetargetUserWithSBRByChangingSpaceTargetClusterWhenSpaceIsShared
    member.go:1296: failed to find Idler with matching criteria:
        ----
        actual:
        metadata:
          creationTimestamp: null
        spec:
          timeoutSeconds: 0
        status: {}
        
        ----
        diffs:
        expected Idler 'toolchain.dev.openshift.com/tier' label to be 'base1ns' but it was ''
        expected Idler timeoutSeconds to be '43200' but it was '0'
        expected Idler  label to be 'testuser-0f54ed31-8f' but it was ''
        expected Idler  label to be 'codeready-toolchain' but it was ''
        
    checks.go:1086: 
        	Error Trace:	/go/src/github.com/codeready-toolchain/toolchain-e2e/testsupport/tiers/checks.go:1086
        	            				/go/src/github.com/codeready-toolchain/toolchain-e2e/testsupport/tiers/nstemplateset.go:75
        	            				/usr/local/go/src/runtime/asm_amd64.s:1700
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestRetargetUserWithSBRByChangingSpaceTargetClusterWhenSpaceIsShared

Issue ticket number and link

SANDBOX-1472

Assisted by Cursor

Summary by CodeRabbit

Bug Fixes
- NSTemplateSet deletion now cleans up cluster resources before namespace deletion and removes redundant deletion steps.
- ClusterResources status is cleared promptly after cluster cleanup; failures to clear status are reported as errors.
Tests
- Added tests for successful and failed ClusterResources status clearing and for deletion flows with and without namespaces.

coderabbitai · 2025-10-30T21:32:16Z

Walkthrough

Cluster resources are now deleted before namespace deletions in deleteNSTemplateSet; a new statusManager method clearStatusClusterResources was added to nil out NSTemplateSet.Status.ClusterResources and persist the change; tests were added to cover successful clearing and status-update failures.

Changes

Cohort / File(s)	Summary
Controller changes `controllers/nstemplateset/nstemplateset_controller.go`	`deleteNSTemplateSet` now calls `clusterResources.delete` first and propagates errors, then invokes `clearStatusClusterResources`; removed redundant re-deletion of cluster resources after namespace deletion.
Status helper `controllers/nstemplateset/status.go`	Added `clearStatusClusterResources` on `statusManager` to set `Status.ClusterResources = nil` when present and update the NSTemplateSet status via the Kubernetes API; no-op if already nil.
Tests `controllers/nstemplateset/nstemplateset_controller_test.go`	Added assertions verifying ClusterResources is cleared after the first reconcile and a new test simulating a failure to clear ClusterResources status expecting an error; minor formatting alignment change.

Sequence Diagram(s)

sequenceDiagram
    participant Controller as deleteNSTemplateSet
    participant Cluster as clusterResources
    participant Status as statusManager
    participant K8s as Kubernetes API

    rect rgb(220,230,255)
    Note over Controller,Cluster: Step 1 — delete cluster-scoped resources
    Controller->>Cluster: delete()
    Cluster->>K8s: delete cluster resources
    K8s-->>Cluster: success / error
    Cluster-->>Controller: result
    end

    rect rgb(220,230,255)
    Note over Controller,Status: Step 2 — clear ClusterResources status
    Controller->>Status: clearStatusClusterResources()
    alt ClusterResources non-nil
        Status->>K8s: update status (set ClusterResources = nil)
        K8s-->>Status: ACK / error
        Status-->>Controller: nil / error
    else already nil
        Status-->>Controller: nil (no update)
    end
    end

    rect rgb(230,220,230)
    Note over Controller: Step 3 — proceed with namespace deletions
    Controller->>Controller: delete namespaces
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Review focus:
- Error wrapping/propagation around cluster resource deletion and status clearing.
- Status update call behavior and reconciliation implications when it fails.
- Test mocks that simulate status update failures.

Possibly related PRs

[SANDBOX-1399][SANDBOX-1395] apply all cluster resources at once and read the prior state from the nstemplateset status instead of labels #700 — modifies NSTemplateSet cluster-resources deletion and Status.ClusterResources handling; strongly related.

Suggested reviewers

MatousJobanek
xcoulon

Poem

🐰 I nibble order, tidy up the mess,
First cluster gone, then status I address.
No double deletes, just one clean sweep,
A hop, a patch — the repo sleeps. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title 'missing resources after space retargeting' is vague and does not clearly describe the actual changes made. The PR implements deletion order changes and status clearing in NSTemplateSet deletion, not resource visibility issues.	Revise the title to reflect the core changes, such as 'Clear cluster resources status immediately after deletion in NSTemplateSet' or 'Fix cluster resource deletion order in NSTemplateSet cleanup flow'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60f0271 and 502d7aa.

📒 Files selected for processing (1)

controllers/nstemplateset/nstemplateset_controller_test.go (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

controllers/nstemplateset/nstemplateset_controller_test.go

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-10-30T23:49:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.53%. Comparing base (610d5f0) to head (502d7aa).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #707      +/-   ##
==========================================
+ Coverage   82.50%   82.53%   +0.03%     
==========================================
  Files          48       48              
  Lines        3589     3596       +7     
==========================================
+ Hits         2961     2968       +7     
  Misses        477      477              
  Partials      151      151

Files with missing lines	Coverage Δ
...trollers/nstemplateset/nstemplateset_controller.go	`73.01% <100.00%> (+0.43%)`	⬆️
controllers/nstemplateset/status.go	`93.33% <100.00%> (+0.17%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

metlos

This is quite elegant solution to the problem of retargetting spaces between members within the same cluster. I'd just add more comments explaining what the owner label means and what it tries to solve. It's not very obvious when just looking at the code.

metlos · 2025-10-31T10:13:31Z

controllers/nstemplateset/cluster_resources.go

 		toolchainv1alpha1.TemplateRefLabelKey: tierTemplate.templateRef,
 		toolchainv1alpha1.TierLabelKey:        tierTemplate.tierName,
 		toolchainv1alpha1.ProviderLabelKey:    toolchainv1alpha1.ProviderLabelValue,
+		toolchainv1alpha1.OwnerLabelKey:       nsTmplSet.Namespace,


It took me a good while to realize that nsTmplSet.Namespace is used here to distinguish the member operators in the cluster. Maybe this could be worth a comment.

~~But more importantly, I'm not sure this correct.~~

The assumption that we make about the cluster resources managed by NSTemplateSet is that they are specific to the Space for which they have been deployed (which implies the ${SPACE_NAME} in their metadata.name btw.) and so this kind of conflict should never actually happen. If it does, then I think the bug is actually in the tests, not in this code.

~~The assumption is mentioned here: https://github.com/codeready-toolchain/member-operator/blob/master/controllers/nstemplateset/cluster_resources.go#L25~~

Hmm... I'm stupid... I completely forgot this tries to solve a race condition while retagetting the same space 🤦🏼 I updated the review comment and crossed out my incorrect ideas in the line comment above.

So sorry!! I should have explained better. I will address better comments!

mfrancisc

Thanks for looking into it 🙏

I have one small comment related to the label used.

mfrancisc · 2025-10-31T10:45:47Z

controllers/nstemplateset/cluster_resources.go

 		toolchainv1alpha1.TemplateRefLabelKey: tierTemplate.templateRef,
 		toolchainv1alpha1.TierLabelKey:        tierTemplate.tierName,
 		toolchainv1alpha1.ProviderLabelKey:    toolchainv1alpha1.ProviderLabelValue,
+		toolchainv1alpha1.OwnerLabelKey:       nsTmplSet.Namespace,


should we use a different label ? OwnerLabelKey is usually used in a different way , by assigning it the value of the usersignup name .

Maybe creating a new label like NamespaceOwnerLabelKey or ClusterOwnerLabelKey and passing the cluster name ?

My bad!! I was trying to reuse some existing label and I thought OwnerLabelKey would fit! I agree with you! I will create a new one

rsoaresd · 2025-10-31T11:49:23Z

go.mod


 toolchain go1.23.12
+
+replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571


temporary until we merge codeready-toolchain/api#489

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe23cf3 and 5646082.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (2)

controllers/nstemplateset/cluster_resources.go (3 hunks)
go.mod (1 hunks)

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.

📚 Learning: 2025-09-24T11:19:39.229Z

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.

Applied to files:

controllers/nstemplateset/cluster_resources.go

📚 Learning: 2025-09-24T10:52:01.009Z

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.

Applied to files:

controllers/nstemplateset/cluster_resources.go

📚 Learning: 2025-09-02T10:28:33.280Z

Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: In the member-operator codebase, MEMBER_OPERATOR_NAMESPACE is only used in OpenShift templates (with ${MEMBER_OPERATOR_NAMESPACE} syntax), not in go templates (which would use {{.MEMBER_OPERATOR_NAMESPACE}} syntax). Go templates in TierTemplateRevision resources use other parameters like SPACE_NAME, NAMESPACE, CONFIG_VALUE, etc., but not MEMBER_OPERATOR_NAMESPACE.

Applied to files:

controllers/nstemplateset/cluster_resources.go

📚 Learning: 2025-09-02T10:28:33.280Z

Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: The MEMBER_OPERATOR_NAMESPACE parameter is only used in OpenShift template processing, not in go templates in the member-operator codebase. Go templates in TierTemplateRevision resources do not reference this parameter.

Applied to files:

controllers/nstemplateset/cluster_resources.go

📚 Learning: 2025-09-11T08:28:57.736Z

Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:143-149
Timestamp: 2025-09-11T08:28:57.736Z
Learning: The processGoTemplate function in nstemplatetier.go only processes Go templates from TierTemplateRevision resources (using {{...}} syntax), not OpenShift templates (using ${...} syntax). Files like member-sa.yaml are OpenShift templates and are not processed by this function.

Applied to files:

controllers/nstemplateset/cluster_resources.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: GolangCI Lint
GitHub Check: Build & push operator bundles for e2e tests
GitHub Check: Test with Coverage

coderabbitai · 2025-10-31T11:55:15Z

controllers/nstemplateset/cluster_resources.go

+	// Filter out cluster resources
+	objectsToDelete := []runtimeclient.Object{}
+	for _, obj := range oa.currentObjects {
+		if shouldDeleteClusterResource(ctx, obj, oa.nstt) {
+			objectsToDelete = append(objectsToDelete, obj)
+		}
+	}
+
+	if err := deleteObsoleteObjects(ctx, oa.r.Client, objectsToDelete, nil); err != nil {
 		return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")


⚠️ Potential issue | 🔴 Critical

Ownership filter still deletes foreign resources

oa.currentObjects are the templated stubs we rendered from Status.ClusterResources; they have no runtime labels. Because of that, obj.GetLabels()[ClusterOwnerLabelKey] is empty, the migration path returns true, and we still pass every object to deleteObsoleteObjects. In the race you’re fixing, the real cluster object has already been relabeled by the other member namespace, but we never fetch those live labels here, so we still delete their resource.

We have to Get the actual cluster object before running shouldDeleteClusterResource, and only enqueue deletions for objects whose live labels match our namespace. Propagating the Get error keeps status handling consistent with the rest of the controller.

Apply this diff so the ownership check inspects the real object:

- // Filter out cluster resources - objectsToDelete := []runtimeclient.Object{} - for _, obj := range oa.currentObjects { - if shouldDeleteClusterResource(ctx, obj, oa.nstt) { - objectsToDelete = append(objectsToDelete, obj) - } - } + // Filter out cluster resources + objectsToDelete := []runtimeclient.Object{} + for _, obj := range oa.currentObjects { + if err := oa.r.Client.Get(ctx, runtimeclient.ObjectKeyFromObject(obj), obj); err != nil { + if errors.IsNotFound(err) { + continue + } + return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources") + } + if shouldDeleteClusterResource(ctx, obj, oa.nstt) { + objectsToDelete = append(objectsToDelete, obj) + } + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Filter out cluster resources

objectsToDelete := []runtimeclient.Object{}

for _, obj := range oa.currentObjects {

if shouldDeleteClusterResource(ctx, obj, oa.nstt) {

objectsToDelete = append(objectsToDelete, obj)

}

}

if err := deleteObsoleteObjects(ctx, oa.r.Client, objectsToDelete, nil); err != nil {

return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")

// Filter out cluster resources

objectsToDelete := []runtimeclient.Object{}

for _, obj := range oa.currentObjects {

if err := oa.r.Client.Get(ctx, runtimeclient.ObjectKeyFromObject(obj), obj); err != nil {

if errors.IsNotFound(err) {

continue

}

return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")

}

if shouldDeleteClusterResource(ctx, obj, oa.nstt) {

objectsToDelete = append(objectsToDelete, obj)

}

}

if err := deleteObsoleteObjects(ctx, oa.r.Client, objectsToDelete, nil); err != nil {

return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")

coderabbitai · 2025-10-31T11:55:15Z

go.mod


 toolchain go1.23.12
+
+replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571


⚠️ Potential issue | 🟠 Major

Drop the personal fork replace directive

The new replace github.com/codeready-toolchain/api => github.com/rsoaresd/api ... forces everyone to build against a personal fork. That breaks our supply-chain guarantees, prevents dependable reproducible builds, and will make CI consume unreviewed code. Please revert the replace and instead bump github.com/codeready-toolchain/api to an upstream commit that contains the needed changes.

Apply this diff to undo the replace:

- replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571

🤖 Prompt for AI Agents

In go.mod around line 132, there is a replace directive pointing to a personal fork (github.com/rsoaresd/api) which forces all builds to use untrusted code; remove that replace directive entirely and instead update the require for github.com/codeready-toolchain/api to the appropriate upstream pseudo-version or commit that contains the needed changes (use the upstream module path and commit hash), run go mod tidy to resolve and vendor/update dependencies, and verify CI passes to ensure no forked code remains.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

controllers/nstemplateset/cluster_resources.go (1)
253-262: Critical: Ownership check runs on template stubs, not live objects.

oa.currentObjects are freshly rendered template stubs from Status.ClusterResources (line 45), not live objects fetched from the cluster. They have no runtime labels. When shouldDeleteClusterResource checks obj.GetLabels()[ClusterOwnerLabelKey] on line 272, it finds an empty string, triggers the migration path (line 275-279), and returns true for ALL objects—defeating the entire ownership mechanism.

The race condition is NOT fixed: member1 will still delete member2's cluster resources because every stub appears unowned and gets deleted.

You must Get each live object from the cluster before checking ownership, just as the delete path does on line 137.

Apply this diff to fetch live objects before the ownership check:
 	// Filter out cluster resources
 	objectsToDelete := []runtimeclient.Object{}
 	for _, obj := range oa.currentObjects {
+		if err := oa.r.Client.Get(ctx, runtimeclient.ObjectKeyFromObject(obj), obj); err != nil {
+			if errors.IsNotFound(err) {
+				continue
+			}
+			return oa.r.wrapErrorWithStatusUpdate(ctx, oa.nstt, oa.failureStatusReason, err, "failure while syncing cluster resources")
+		}
 		if shouldDeleteClusterResource(ctx, obj, oa.nstt) {
 			objectsToDelete = append(objectsToDelete, obj)
 		}
 	}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5646082 and 4478f05.

📒 Files selected for processing (1)

controllers/nstemplateset/cluster_resources.go (3 hunks)

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.

📚 Learning: 2025-09-24T11:19:39.229Z

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.

Applied to files:

controllers/nstemplateset/cluster_resources.go

📚 Learning: 2025-09-24T10:52:01.009Z

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.

Applied to files:

controllers/nstemplateset/cluster_resources.go

📚 Learning: 2025-09-02T10:28:33.280Z

Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: In the member-operator codebase, MEMBER_OPERATOR_NAMESPACE is only used in OpenShift templates (with ${MEMBER_OPERATOR_NAMESPACE} syntax), not in go templates (which would use {{.MEMBER_OPERATOR_NAMESPACE}} syntax). Go templates in TierTemplateRevision resources use other parameters like SPACE_NAME, NAMESPACE, CONFIG_VALUE, etc., but not MEMBER_OPERATOR_NAMESPACE.

Applied to files:

controllers/nstemplateset/cluster_resources.go

📚 Learning: 2025-09-02T10:28:33.280Z

Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:137-141
Timestamp: 2025-09-02T10:28:33.280Z
Learning: The MEMBER_OPERATOR_NAMESPACE parameter is only used in OpenShift template processing, not in go templates in the member-operator codebase. Go templates in TierTemplateRevision resources do not reference this parameter.

Applied to files:

controllers/nstemplateset/cluster_resources.go

📚 Learning: 2025-09-11T08:28:57.736Z

Learnt from: fbm3307
Repo: codeready-toolchain/member-operator PR: 692
File: controllers/nstemplateset/nstemplatetier.go:143-149
Timestamp: 2025-09-11T08:28:57.736Z
Learning: The processGoTemplate function in nstemplatetier.go only processes Go templates from TierTemplateRevision resources (using {{...}} syntax), not OpenShift templates (using ${...} syntax). Files like member-sa.yaml are OpenShift templates and are not processed by this function.

Applied to files:

controllers/nstemplateset/cluster_resources.go

🔇 Additional comments (2)

controllers/nstemplateset/cluster_resources.go (2)

106-108: Good addition with clear documentation.

The ClusterOwnerLabelKey label properly identifies the owning member operator namespace, and the comments clearly explain its purpose.

146-149: Delete path correctly checks ownership on live objects.

The ownership check is properly implemented here because the live object is fetched on line 137 before the ownership check on line 147.

coderabbitai · 2025-10-31T12:24:44Z

controllers/nstemplateset/cluster_resources.go

+// shouldDeleteClusterResource checks if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace)
+func shouldDeleteClusterResource(ctx context.Context, obj runtimeclient.Object, nsTmplSet *toolchainv1alpha1.NSTemplateSet) bool {
+	logger := log.FromContext(ctx)
+	ownerNamespace := obj.GetLabels()[toolchainv1alpha1.ClusterOwnerLabelKey]
+
+	// MIGRATION: Old resources without owner label - assume ownership for backward compatibility
+	if ownerNamespace == "" {
+		logger.Info("cluster resource has no owner label - assuming ownership for migration",
+			"kind", obj.GetObjectKind().GroupVersionKind().Kind,
+			"name", obj.GetName())
+		return true
+	}
+
+	// Check if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace)
+	if ownerNamespace == nsTmplSet.Namespace {
+		return true
+	}
+
+	// Owned by different namespace - skip deletion
+	logger.Info("skipping deletion of cluster resource - not owned by the same namespace as the NSTemplateSet",
+		"kind", obj.GetObjectKind().GroupVersionKind().Kind,
+		"name", obj.GetName(),
+		"owner_namespace", ownerNamespace,
+		"current_namespace", nsTmplSet.Namespace)
+	return false
 }


🛠️ Refactor suggestion | 🟠 Major

Document that this function requires live cluster objects.

The ownership logic is sound, but the function assumes obj has been fetched from the cluster with runtime labels. Add a comment documenting this requirement to prevent misuse.

Apply this diff to document the requirement:

-// shouldDeleteClusterResource checks if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace) +// shouldDeleteClusterResource checks if cluster resource is owned by the target cluster owner namespace (NSTemplateSet namespace). +// IMPORTANT: obj must be a live object fetched from the cluster via Client.Get, not a template stub, +// because this function checks runtime labels (ClusterOwnerLabelKey) that are only present on live objects. func shouldDeleteClusterResource(ctx context.Context, obj runtimeclient.Object, nsTmplSet *toolchainv1alpha1.NSTemplateSet) bool {

🤖 Prompt for AI Agents

In controllers/nstemplateset/cluster_resources.go around lines 269 to 294, add a clear doc comment above shouldDeleteClusterResource stating that the function requires a live cluster object with labels populated (i.e. the object must have been fetched from the API server via client.Get/list so GetLabels() returns runtime labels), and note the migration behavior for legacy resources without owner labels; this prevents callers from passing in local/template objects that lack runtime labels and misinterpreting ownership.

MatousJobanek · 2025-10-31T17:39:10Z

TBH, I'm a bit confused. How is this possible when we delete the NSTemplateSet in the "original cluster" first, before provisioning the NSTemplateSet in the new one? At the time when we retarget the Space to the new cluster everything should be egone, the "old NSTemplateSet" should be deleted as well thus the member-operator shouldn't touch the resources, correct?
https://github.com/codeready-toolchain/host-operator/blob/2fc7ca618e8cfb8b33b1b9c537ea8a0e92c917ea/controllers/space/space_controller.go#L149-L165

In other words, it shouldn't happen that two controllers would try to touch the same resource. If this happens then there is different bug to fix.

MatousJobanek · 2025-10-31T17:55:58Z

Ok, I went through the logs and it's apparent that the NSTemplateSet was created in member2 while the NSTemplateSet member1 controller was reconciling the deletion operation.
member2 log:

{"level":"info","ts":"2025-10-30T10:06:58.335Z","msg":"reconciling NSTemplateSet","controller":"nstemplateset","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"NSTemplateSet","NSTemplateSet":{"name":"testuser-0f54ed31-8f","namespace":"toolchain-member2-30095445"},"namespace":"toolchain-member2-30095445","name":"testuser-0f54ed31-8f","reconcileID":"e4715aea-1f60-4ea9-a8fd-aee740009fa7"}

last member1 log related to this user:

{"level":"info","ts":"2025-10-30T10:06:59.059Z","msg":"NSTemplateSet resource is ready to be terminated: all related user namespaces have been marked for deletion","controller":"nstemplateset","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"NSTemplateSet","NSTemplateSet":{"name":"testuser-0f54ed31-8f","namespace":"toolchain-member-30095445"},"namespace":"toolchain-member-30095445","name":"testuser-0f54ed31-8f","reconcileID":"3c5a6aba-924b-479d-8203-567773b8d9ce"}

but at that time, the NSTemplateSet was already marked for deletion (and maybe also "almost" gone), because the removal of the finalizer happened here:

{"level":"info","ts":"2025-10-30T10:06:58.308Z","msg":"NSTemplateSet resource is ready to be terminated: all related user namespaces have been marked for deletion","controller":"nstemplateset","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"NSTemplateSet","NSTemplateSet":{"name":"testuser-0f54ed31-8f"},"namespace":"","name":"testuser-0f54ed31-8f","reconcileID":"998c67fa-e921-4c78-b00c-9d9aa40b4100"}

also, from the host-operator point of view the NSTemplateSet was already gone so it created the NSTemplateSet in member2:

{"level":"info","ts":"2025-10-30T10:06:58.318Z","msg":"the NSTemplateSet resource is already deleted","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testuser-0f54ed31-8f","namespace":"toolchain-host-30095445"},"namespace":"toolchain-host-30095445","name":"testuser-0f54ed31-8f","reconcileID":"969396c4-85ac-4e84-976f-7324b0e9710e"}
{"level":"info","ts":"2025-10-30T10:06:58.318Z","msg":"resetting 'space.Status.TargetCluster' field","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testuser-0f54ed31-8f","namespace":"toolchain-host-30095445"},"namespace":"toolchain-host-30095445","name":"testuser-0f54ed31-8f","reconcileID":"969396c4-85ac-4e84-976f-7324b0e9710e"}

in other words, there is a problem with an outdated cache in member1

MatousJobanek · 2025-10-31T18:12:59Z

The label solution should work fine, but I'm not happy about exposing e2e-specific workaround in the CRs.

Personally, I would do the following:

swap the deletion of the namespace & cluster-scoped resources here:

member-operator/controllers/nstemplateset/nstemplateset_controller.go

Lines 203 to 223 in 0568ed2

    
           // delete all namespace one by one 
        
           allDeleted, err := r.namespaces.ensureDeleted(ctx, nsTmplSet) 
        
           // when err, status Update will not trigger reconcile, sending returning error. 
        
           if err != nil { 
        
           	return reconcile.Result{}, r.status.wrapErrorWithStatusUpdate(ctx, nsTmplSet, r.status.setStatusTerminatingFailed, err, "failed to ensure namespace deletion") 
        
           } 
        
           if !allDeleted { 
        
           	if time.Since(nsTmplSet.DeletionTimestamp.Time) > 60*time.Second { 
        
           		return reconcile.Result{}, fmt.Errorf("NSTemplateSet deletion has not completed in over 1 minute") 
        
           	} 
        
           	// One or more namespaces may not yet be deleted. We can stop here. 
        
           	return reconcile.Result{ 
        
           		RequeueAfter: time.Second, 
        
           	}, nil 
        
           } 
        
           // if no namespace was to be deleted, then we can proceed with the cluster resources associated with the user 
        
           err = r.clusterResources.delete(ctx, nsTmplSet) 
        
           if err != nil { 
        
           	return reconcile.Result{}, err 
        
           }

so we delete cluster-scoped resources first and then the namespaces

set the NSTemplateSet.Status.ClusterResources to nil as soon as the deletion of the cluster-scoped resources is performed

This should fix the bug with later reconciliations caused by the outdated cache

mfrancisc · 2025-11-03T08:30:32Z

swap the deletion of the namespace & cluster-scoped resources here:

This makes sense, not sure why we delete first the namespaces and only later the cluster scope resources now. The other way makes more sense.

Even if I'm not 100% sure it will solve the cache issue, we cannot know when the cache is updated apparently. But still worth doing this change as it makes more sense to delete resources in this order 👍

MatousJobanek · 2025-11-03T08:59:25Z

if we update also the status of NSTemplateSet by setting NSTemplateSet.Status.ClusterResources to nil (as I stated above), then it should definitely help - if the controller gets reconciled for any reason again, then it won't try to re-delete the cluster-scoped resources, see:

member-operator/controllers/nstemplateset/cluster_resources.go

Lines 120 to 124 in 0568ed2

    
           // delete deletes all cluster-scoped resources referenced by the nstemplateset. 
        
           func (r *clusterResourcesManager) delete(ctx context.Context, nsTmplSet *toolchainv1alpha1.NSTemplateSet) error { 
        
           	if nsTmplSet.Status.ClusterResources == nil { 
        
           		return nil 
        
           	}

mfrancisc · 2025-11-03T09:16:02Z

if we update also the status of NSTemplateSet by setting NSTemplateSet.Status.ClusterResources to nil (as I stated above)

Ok I see now, thanks for explaining it. That should definitely help indeed, even if the cache is out of date 👍

rsoaresd · 2025-11-03T09:49:35Z

Thank you so much @MatousJobanek, for the explanation and suggestion! Going to address the changes

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

controllers/nstemplateset/nstemplateset_controller.go (1)
203-207: Consider wrapping the error with status update for consistency.

The cluster resource deletion error is propagated directly without updating the NSTemplateSet status to TerminatingFailed, unlike the namespace deletion error handling at line 218. This inconsistency means users may not see a clear failure status if cluster resource deletion fails.

Consider applying this pattern for consistency:
 	// delete cluster resources first
 	err := r.clusterResources.delete(ctx, nsTmplSet)
 	if err != nil {
-		return reconcile.Result{}, err
+		return reconcile.Result{}, r.status.wrapErrorWithStatusUpdate(ctx, nsTmplSet, r.status.setStatusTerminatingFailed, err, "failed to delete cluster resources")
 	}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4478f05 and 7ac1a8e.

📒 Files selected for processing (2)

controllers/nstemplateset/nstemplateset_controller.go (1 hunks)
controllers/nstemplateset/status.go (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.

📚 Learning: 2025-09-24T11:19:39.229Z

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:187-196
Timestamp: 2025-09-24T11:19:39.229Z
Learning: In controllers/nstemplateset/cluster_resources.go, the delete method specifically handles cluster-scoped resources only (not namespaced resources), so using types.NamespacedName{Name: toDelete.GetName()} in Client.Get calls is correct since cluster resources don't have namespaces. However, using runtimeclient.ObjectKeyFromObject() is preferred for future-proofing.

Applied to files:

controllers/nstemplateset/nstemplateset_controller.go

📚 Learning: 2025-09-24T10:52:01.009Z

Learnt from: metlos
Repo: codeready-toolchain/member-operator PR: 700
File: controllers/nstemplateset/cluster_resources.go:46-49
Timestamp: 2025-09-24T10:52:01.009Z
Learning: In the NSTemplateSet cluster resources controller, the current approach compares template references (oldTemplateRef vs newTemplateRef) rather than comparing actual cluster resource contents. This assumes TierTemplates are immutable - any content changes result in new TierTemplate versions with new references. The short-circuit when template refs are equal is intentional based on this design, as it cannot detect changes within the same TierTemplate.

Applied to files:

controllers/nstemplateset/nstemplateset_controller.go

🪛 GitHub Actions: govulncheck

controllers/nstemplateset/nstemplateset_controller.go

[error] 116-116: GO-2025-4012: Lack of limit when parsing cookies can cause memory exhaustion in net/http. Found in: net/http@go1.23.12. Fixed in: net/http@go1.24.8

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build & push operator bundles for e2e tests

🔇 Additional comments (2)

controllers/nstemplateset/status.go (1)

294-301: LGTM! Well-designed helper method.

The implementation correctly guards against unnecessary API calls when the status is already nil, and follows the established patterns in this file for status updates.

controllers/nstemplateset/nstemplateset_controller.go (1)

203-212: Excellent fix addressing the race condition!

The reordered deletion flow (cluster resources first, then status clearing, then namespaces) effectively addresses the race condition described in the PR objectives. Clearing the status immediately after cluster resource deletion prevents redundant deletion attempts during subsequent reconciliations with stale cache, as discussed in the PR comments.

MatousJobanek

Nice, can you add/update unit tests please?

rsoaresd · 2025-11-03T16:49:09Z

Nice, can you add/update unit tests please?

Sure! Addressed

MatousJobanek

Looks good, just small comments regarding the unit tests

controllers/nstemplateset/nstemplateset_controller_test.go

sonarqubecloud · 2025-11-04T10:42:19Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

openshift-ci · 2025-11-04T11:30:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fbm3307, MatousJobanek, rsoaresd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [MatousJobanek,fbm3307,rsoaresd]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from alexeykazakov and metlos October 30, 2025 21:32

openshift-ci bot added the approved label Oct 30, 2025

openshift-merge-robot added the needs-rebase label Oct 30, 2025

openshift-merge-robot removed the needs-rebase label Oct 30, 2025

metlos reviewed Oct 31, 2025

View reviewed changes

mfrancisc reviewed Oct 31, 2025

View reviewed changes

rsoaresd mentioned this pull request Oct 31, 2025

add cluster owner label for the nstemplateset resource codeready-toolchain/api#489

Closed

rsoaresd commented Oct 31, 2025

View reviewed changes

rsoaresd requested review from metlos and mfrancisc October 31, 2025 11:53

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

requested changes

7ac1a8e

rsoaresd force-pushed the fix_delete branch from 4478f05 to 7ac1a8e Compare November 3, 2025 10:15

coderabbitai bot reviewed Nov 3, 2025

View reviewed changes

MatousJobanek reviewed Nov 3, 2025

View reviewed changes

add unit tests

60f0271

rsoaresd requested a review from MatousJobanek November 3, 2025 19:21

Merge branch 'master' into fix_delete

73ed62e

MatousJobanek approved these changes Nov 4, 2025

View reviewed changes

controllers/nstemplateset/nstemplateset_controller_test.go Outdated Show resolved Hide resolved

controllers/nstemplateset/nstemplateset_controller_test.go Outdated Show resolved Hide resolved

rsoaresd added 2 commits November 4, 2025 10:40

requested changes

d1b542b

Merge branch 'master' into fix_delete

502d7aa

fbm3307 approved these changes Nov 4, 2025

View reviewed changes

rsoaresd merged commit 660fbf4 into codeready-toolchain:master Nov 5, 2025
14 of 15 checks passed


		toolchain go1.23.12

		replace github.com/codeready-toolchain/api => github.com/rsoaresd/api v0.0.0-20251031113148-f990a908d571

Conversation

rsoaresd commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why could this be happening?

Issue ticket number and link

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

codecov bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

metlos left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

metlos Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

metlos Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rsoaresd Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

mfrancisc left a comment

Choose a reason for hiding this comment

Uh oh!

mfrancisc Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

rsoaresd Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rsoaresd Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

MatousJobanek commented Oct 31, 2025

Uh oh!

MatousJobanek commented Oct 31, 2025

Uh oh!

MatousJobanek commented Oct 31, 2025

Uh oh!

mfrancisc commented Nov 3, 2025

Uh oh!

MatousJobanek commented Nov 3, 2025

Uh oh!

mfrancisc commented Nov 3, 2025

Uh oh!

rsoaresd commented Nov 3, 2025

Uh oh!

rsoaresd commented Oct 30, 2025 •

edited

Loading

coderabbitai bot commented Oct 30, 2025 •

edited

Loading

codecov bot commented Oct 30, 2025 •

edited

Loading

metlos left a comment •

edited

Loading

metlos Oct 31, 2025 •

edited

Loading

metlos Oct 31, 2025 •

edited

Loading

rsoaresd Oct 31, 2025 •

edited

Loading