Race condition in taint patching can clobber concurrent taint modifications

## Summary

The `addTaintBySpec` and `removeTaintBySpec` functions in `node_controller.go` use `client.StrategicMergeFrom()` without optimistic locking, which can cause race conditions where concurrent taint modifications from other controllers are silently overwritten.

## Background

When NRC adds or removes a taint, it:
1. Creates a deep copy of the node at time T1
2. Modifies the taints list
3. Patches the node

If another controller (e.g., Karpenter, CSI drivers, cloud-controller-manager) modifies the node's taints between T1 and the patch, NRC's patch will overwrite those changes with stale state.

## Current Code

https://github.com/kubernetes-sigs/node-readiness-controller/blob/main/internal/controller/node_controller.go#L257-L274

```go
func (r *RuleReadinessController) removeTaintBySpec(ctx context.Context, node *corev1.Node, taintSpec corev1.Taint, ruleName string) error {
    patch := client.StrategicMergeFrom(node.DeepCopy())  // No optimistic locking
    var newTaints []corev1.Taint
    for _, taint := range node.Spec.Taints {
        if taint.Key != taintSpec.Key || taint.Effect != taintSpec.Effect {
            newTaints = append(newTaints, taint)
        }
    }
    node.Spec.Taints = newTaints
    if err := r.Patch(ctx, node, patch); err != nil {
        return err
    }
    // ...
}
```

## Race Condition Example

```
T1: NRC caches node taints: [karpenter.sh/unregistered, xyz.co/taint]
T2: Karpenter removes karpenter.sh/unregistered → [xyz.co/taint]
T3: NRC removes xyz.co/taint using stale T1 state
    → patches with [karpenter.sh/unregistered]  // Karpenter's removal reverted!
```

This affects any taint modified by other controllers during the window between NRC's read and write.

## Proposed Fix
Use `MergeFromWithOptions` with optimistic locking, as Karpenter does for both add and remove taint operations:

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/state/statenode.go#L483-L533

```go
// Karpenter's RequireNoScheduleTaint (handles both add AND remove):
stored := node.DeepCopy()
// ... taint modifications (add or remove) ...
if !equality.Semantic.DeepEqual(stored, node) {
    // We use client.MergeFromWithOptimisticLock because patching
    // a list with a JSON merge patch can cause races due to the
    // fact that it fully replaces the list on a change
    return kubeClient.Patch(ctx, node,
        client.MergeFromWithOptions(stored, client.MergeFromWithOptimisticLock{}))
}
```

The fix for NRC would be:

```go
// Fails with conflict if node changed, allowing retry
patch := client.MergeFromWithOptions(node.DeepCopy(), client.MergeFromWithOptimisticLock{})
```

With optimistic locking, if the node's `resourceVersion` changed between read and write, the patch fails with a conflict error, allowing the controller to retry with fresh state.


## Impact

Nodes stuck with startup taints that should have been removed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in taint patching can clobber concurrent taint modifications #179

Summary

Background

Current Code

Race Condition Example

Proposed Fix

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition in taint patching can clobber concurrent taint modifications #179

Description

Summary

Background

Current Code

Race Condition Example

Proposed Fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions