Skip to content

Add taints on health events#983

Merged
guptaNswati merged 4 commits intokubernetes-sigs:mainfrom
guptaNswati:device-taint-toleration
Apr 10, 2026
Merged

Add taints on health events#983
guptaNswati merged 4 commits intokubernetes-sigs:mainfrom
guptaNswati:device-taint-toleration

Conversation

@guptaNswati
Copy link
Copy Markdown
Contributor

@guptaNswati guptaNswati commented Apr 3, 2026

Implement Option A from #905

And introduced Informational Taints for non-fatal XIDs. Earlier, a list of skipped XID errors (e.g., non-fatal, application-level faults) were silently dropped by the monitor. Now they are propagated as informational taints (gpu.nvidia.com/xid with Effect: None). This improves observability by surfacing non-fatal errors on the device without affecting pod scheduling or triggering evictions.

@k8s-ci-robot k8s-ci-robot requested review from dims and klueska April 3, 2026 00:20
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 3, 2026
@guptaNswati guptaNswati force-pushed the device-taint-toleration branch 2 times, most recently from 60de594 to fe32560 Compare April 6, 2026 23:48
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2026
@guptaNswati
Copy link
Copy Markdown
Contributor Author

guptaNswati commented Apr 6, 2026

Need to double check the logging but its ready to review:

test logs

I0406 23:38:10.564972       1 device_health.go:248] Processing event XID=43 event
I0406 23:38:10.565071       1 device_health.go:264] Sending XID=43 health event for device GPU-f50641a9-38d7-6df1-0bec-e661b4ca1847
W0406 23:38:10.565153       1 driver.go:456] Received xid health event for device GPU-f50641a9-38d7-6df1-0bec-e661b4ca1847


 kubectl get  resourceslice dra-taint-test-worker-gpu.nvidia.com-qdz7b -o yaml | grep taints: -A 5
    taints:
    - effect: None
      key: gpu.nvidia.com/xid
      timeAdded: "2026-04-06T23:38:10Z"
      value: "43"

@guptaNswati guptaNswati changed the title Draft: add taints on health events Add taints on health events Apr 6, 2026
Comment thread cmd/gpu-kubelet-plugin/device_health.go Outdated
// sendHealthEventsForDevices sends a DeviceHealthEvent for every device under
// the given parent-level map. Uses non-blocking sends to protect the monitor
// goroutine from deadlocks when the channel is full.
func (m *nvmlDeviceHealthMonitor) sendHealthEventsForDevices(giMap map[uint32]map[uint32]*AllocatableDevice, eventType DeviceHealthEventType) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
func (m *nvmlDeviceHealthMonitor) sendHealthEventsForDevices(giMap map[uint32]map[uint32]*AllocatableDevice, eventType DeviceHealthEventType) {
func (m *nvmlDeviceHealthMonitor) sendHealthEventForDevices(giMap map[uint32]map[uint32]*AllocatableDevice, eventType DeviceHealthEventType) {

Comment thread cmd/gpu-kubelet-plugin/device_health.go Outdated
case HealthEventGPULost:
return resourceapi.DeviceTaint{
Key: TaintKeyGPULost,
Effect: resourceapi.DeviceTaintEffectNoExecute,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this effect NoSchedule? Not sure how likely, but there may be a scenario where a GPU may temporarily fall off the bus and come back. If a workload wants to be resilient to it, we'd be able to support it with the NoSchedule effect.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make these effects configurable so admins can override the default behavior for specific error types? for e.g. choosing between evicting existing workloads vs. only preventing new pods from being scheduled?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea Shiva. Configure through a configmap?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varunrsekar mmm we should have some real world data here. in case of GPULost event, can it be recovered without any intervention? Let me ask some folks internally

@shivamerla no we should not make it configurable, this should be done by DeviceTaintRule published by admins.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per docs LostGPU error is not recoverable without intervention

NVML_ERROR_GPU_IS_LOST = 15
The GPU has fallen off the bus or has otherwise become inaccessible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmm, so in any error case, we should not apply NoExecute because we should not handle eviction from dra-driver?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me summarize the DeviceTaintRule from the KEP that does exactly what we want to override :)

A DeviceTaintRule acts like a dynamic patch. Instead of the DRA driver explicitly writing a taint into a ResourceSlice, a cluster administrator creates a DeviceTaintRule manifest.
The Kubernetes control plane reads this rule, looks for any devices in the cluster that match the rule's criteria, and automatically overlays the specified taint onto those devices.

It has two main components:
- deviceSelector: A query that defines which devices this rule applies to (e.g., "All devices from the NVIDIA driver", or "All devices currently reporting an XID 43").
- taint: The exact taint (Key, Value, Effect) to apply to the matched devices.

these user stories are the cases we discussing

And i can test the override with it. Only thing to note is that DeviceTaintRule is an additional FG and not enabled by default even in 1.36.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that, we'll create a static set of DeviceTaintRule CRs for each error we detect and then dynamically add devices to the appropriate rule?

Copy link
Copy Markdown
Contributor

@shivamerla shivamerla Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guptaNswati IIUC, These are two distinct mechanisms for applying device taints:

  • The DRA driver can detect device health status and apply taints directly within the ResourceSlice.
  • An admin can define a DeviceTaintRule, specifying a device selector (driver, device, or pool) along with the taints to be applied to all matching devices in the ResourceSlice.

It appears that taints defined in a DeviceTaintRule override those set by the DRA driver. Based on this, we should avoid applying a NoExecute taint at the driver level and instead allow administrators to enforce it via DeviceTaintRule if needed. We can validate this behavior by testing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes correct.

Comment thread cmd/gpu-kubelet-plugin/allocatable.go Outdated
Comment on lines +285 to +287
// key. Within the same key, the effect can only escalate
// (None < NoSchedule < NoExecute), and the value is always updated to the
// latest. Returns true if the taint set was modified.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand this comment? Why is it only allowed for the effect to escalate? Also in this PR, I don't see anywhere an existing taint being escalated. So I'm not sure why we need that validation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch: I added it for handling xid events, but i still see the logic is off. right now, all xids have None but i need to update it to differentiate between fatal and non-fatal xids.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i removed it and refractored the code to handle non-fatal xids as None

return nil, fmt.Errorf("requested device is not allocatable: %v", result.Device)
}
// only proceed with config mapping if device is healthy.
if featuregates.Enabled(featuregates.NVMLDeviceHealthCheck) {
Copy link
Copy Markdown
Contributor

@shivamerla shivamerla Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that NVMLDeviceHealthCheck depends on the DRADeviceTaints feature gate being enabled (which is not the default in v1.35 and earlier). We should call out this limitation explicitly.

Copy link
Copy Markdown
Contributor

@shivamerla shivamerla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! left few comments

Comment thread cmd/gpu-kubelet-plugin/driver.go Outdated
Pools: map[string]resourceslice.Pool{
nodeName: {Slices: []resourceslice.Slice{resourceSlice}},
},
taints := dev.Taints()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: For pre-created and dynamic MIG devices, do we need to update this for every profile advertised?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we would need to. This wont be very efficient for a large number?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic MIG is btw mutually-exclusive with nvml health check. So only static MIG needs to be considered, which is a smaller number of devices.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point they should converge and not be mutually exclusive. We should do be health checking dynamically created MIG devices as well. It can still be a separate path though

Comment thread cmd/gpu-kubelet-plugin/device_health.go Outdated
return resourceapi.DeviceTaint{
Key: TaintKeyXID,
Value: strconv.FormatUint(event.EventData, 10),
Effect: DeviceTaintEffectNone,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should differentiate between fatal and non-fatal xid errors.

for non-fatal ones, effect can be None but what should be the effect for Fatal xids?

NoSchedule: Kubernetes will prevent new pods from being assigned to this broken GPU, but it will leave the existing pods alone.

NoExecute: Kubernetes will actively and immediately evict any running pods that are currently using that GPU.

Based on offline discussion,

  • we should not evict the workload as it could still have local state which will be lost if evicted
  • it might disrupt the workload negatively
  • It puts a burden on the DRA driver to handle Unprepare for the devices safely
  • It gives the DRA driver power over workload lifecycle

cc @varunrsekar

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configurable effects will make sense here. We should consider NoEffect to be a nuke and only allow the admin to request it if they know what they're doing. So for non-fatal errors, we can add a default None effect and for fatal errors, we can add a default NoSchedule taint.

It'd be easy to say gpu-lost is a fatal error. But I'd imagine that trying to classify the xids is not easy. So I'm not sure how you'd want to classify them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a hardcoded list of non-fatal xids and an env var where user can pass an additional list to skip. so i was thinking to make all of those None and rest NoSchedule

@guptaNswati
Copy link
Copy Markdown
Contributor Author

With the refractor, i don't need to comment out xid 43 now for testing. Its getting added with None effect.

taints:
    - effect: None
      key: gpu.nvidia.com/xid
      timeAdded: "2026-04-07T21:57:44Z"
      value: "43"
      
"ResourceSlice update" logger="ResourceSlice controller" slice="dra-taint-test-worker-gpu.nvidia.com-l2c2h" diff=<
	@@ -3,8 +3,8 @@
	   "name": "dra-taint-test-worker-gpu.nvidia.com-l2c2h",
	   "generateName": "dra-taint-test-worker-gpu.nvidia.com-",
	   "uid": "55b55eba-f4d0-4e32-9544-7368a57ca98d",
	-  "resourceVersion": "650",
	-  "generation": 1,
	+  "resourceVersion": "765",
	+  "generation": 2,
	   "creationTimestamp": "2026-04-07T21:56:30Z",
	   "ownerReferences": [
	    {
	@@ -20,7 +20,7 @@
	     "manager": "gpu-kubelet-plugin",
	     "operation": "Update",
	     "apiVersion": "resource.k8s.io/v1",
	-    "time": "2026-04-07T21:56:30Z",
	+    "time": "2026-04-07T21:57:44Z",
	     "fieldsType": "FieldsV1",
	     "fieldsV1": {
	      "f:metadata": {
	@@ -94,7 +94,15 @@
	      "memory": {
	       "value": "24Gi"
	      }
	-    }
	+    },
	+    "taints": [
	+     {
	+      "key": "gpu.nvidia.com/xid",
	+      "value": "43",
	+      "effect": "None",
	+      "timeAdded": "2026-04-07T21:57:44Z"
	+     }
	+    ]
	    },
	    {
	     "name": "gpu-1",

@guptaNswati
Copy link
Copy Markdown
Contributor Author

I updated xid handling and addressed the nits. PTAL.

And regarding configurable taint effect, we are making best effort decision and not handling evictions so if an admin wants to override taints, it should be done with DeviceTaintRule. I want to hear more on why we should do it on our own with a configmap. I am not convinced rn.

Comment thread cmd/gpu-kubelet-plugin/driver.go Outdated
Pools: map[string]resourceslice.Pool{
nodeName: {Slices: []resourceslice.Slice{resourceSlice}},
},
taints := dev.Taints()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic MIG is btw mutually-exclusive with nvml health check. So only static MIG needs to be considered, which is a smaller number of devices.

Comment thread cmd/gpu-kubelet-plugin/allocatable.go Outdated
}

if changed {
d.taints[i].TimeAdded = nil // reset timestamp for the API server
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it difficult to wrap my head around the TimeAdded field. For correctness, should this be when the plugin determined the device is to be tainted or is it when the APIServer actually applies the taint?

In your change, you've determined the API server is the one that needs to be setting it. But why not always apply it on our end?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimeAdded is for API server and not for the dra-driver as Scheduler uses it for state changes and clock drift for tolerations.

not very well explained but see this https://pkg.go.dev/k8s.io/dynamic-resource-allocation/resourceslice#DevicesDeepEqual

Comment thread cmd/gpu-kubelet-plugin/device_health.go Outdated
}

func (m *nvmlDeviceHealthMonitor) markAllDevicesUnhealthy() {
func (m *nvmlDeviceHealthMonitor) sendHealthEventsForAllDevices(eventType DeviceHealthEventType) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By consumption of this method in the driver.deviceHealthEvents monitoring thread, We will be individually updating the resourceslice for every device of the GPU (eg: full-gpu + all its pre-created migs).

Can we instead rework it to aggregate all the devices into a single event, so that we make a single resourceslice update?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor Author

@guptaNswati guptaNswati Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm need to think it a bit as we act on a single event notification which is per device.

Copy link
Copy Markdown
Contributor Author

@guptaNswati guptaNswati Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVML inherently processes and fires events on a per-device basis so instead of aggregating them before sending down the channel. we can probably handle the aggregation on the receiving side in the driver.deviceHealthEvents loop and before we call publishResources. We can either

Option 1: Batch processing or full channel drain
When the main loop receives the first event, we immediately enter a non-blocking default select loop to pull any other pending events off the channel until it's empty and then publish once.

  • Pro: no latency, it publishes the moment the channel is empty.

  • Con: If the NVML monitor lags in sending events, the drain loop might finish too early, resulting in 2 API calls instead of 1.

Option 2: Timer based Batching
When the first event arrives, we start a small timer (e.g 50ms). Any subsequent events that arrive within that 50ms window update the local state but do not trigger a publish. Once the timer expires, we do a single publishResources call for the whole batch.

  • Pro: more standard Kubernetes pattern for protecting the API server.

  • Con: Introduces a tiny 50ms delay before the Kubernetes API is notified

Copy link
Copy Markdown
Contributor Author

@guptaNswati guptaNswati Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lalitadithya showed me a burst of events can happen in real cluster environments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm need to think it a bit as we act on a single event notification which is per device.

Sorry maybe I didnt word it correctly. For this review, I was only talking about the codepath where you invoke sendHealthEventForAllDevices for the GPU_LOST event where you're sending individual events for all devices belonging to that GPU. So instead of sending 1 GPU_LOST event each for gpu-0, gpu-0-mig-0, gpu-0-mig-1, etc, we can handle it with a single event which includes all those devices.

Your comment on batching make sense to me and would be a good improvement for the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i see. make sense. yes i can fix that. and yes this should be a follow-up improvement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varunrsekar i refactored to batch multiple devices on GPU_LOST and unmonitored events. PTAL

Test logs in mig-mode with xid 43.

0409 23:33:27.290825       1 device_health.go:255] Processing event XID=43 event
I0409 23:33:27.290936       1 device_health.go:271] Sending XID=43 health event for device MIG-f19df5ab-c572-5712-9b1b-6abbdd67a285
W0409 23:33:27.291027       1 driver.go:477] Received xid health event for device MIG-f19df5ab-c572-5712-9b1b-6abbdd67a285

 name: gpu-0-mig-1g6gb-14-0
    taints:
    - effect: None
      key: gpu.nvidia.com/xid
      timeAdded: "2026-04-09T23:33:27Z"
      value: "43"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this! Its a meaningful change.

@shivamerla
Copy link
Copy Markdown
Contributor

I updated xid handling and addressed the nits. PTAL.

And regarding configurable taint effect, we are making best effort decision and not handling evictions so if an admin wants to override taints, it should be done with DeviceTaintRule. I want to hear more on why we should do it on our own with a configmap. I am not convinced rn.

Yes, makes sense. We can drop that thought for now as it is clear that admins can override the taint to escalate as desired using a DeviceTaintRule.

@guptaNswati guptaNswati force-pushed the device-taint-toleration branch from 46c02e0 to 7048d97 Compare April 8, 2026 20:47
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2026
@guptaNswati guptaNswati force-pushed the device-taint-toleration branch from 7048d97 to ddd1200 Compare April 8, 2026 20:51
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2026
@guptaNswati guptaNswati force-pushed the device-taint-toleration branch from ddd1200 to 46c02e0 Compare April 8, 2026 21:13
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2026
Copy link
Copy Markdown
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Add taints on health events

The design is sound and well-aligned with KEP-5055 Option A. Taint key schema (gpu.nvidia.com/xid, gpu.nvidia.com/gpu-lost, gpu.nvidia.com/unmonitored) follows KEP guidance of separate keys per health dimension. Keeping tainted devices in the ResourceSlice instead of removing them is the correct approach.

However, there are a few issues that need to be addressed:

Blocker

  • Merge conflicts: The PR is not mergeable against current main. The device_health.go file, AllocatableDevice struct, and health monitoring infrastructure have been significantly refactored on main. A rebase is needed.

Must-fix

  • Race condition on Taints() read — see inline comment on driver.go
  • Missing testsAddOrUpdateTaint, healthEventToTaint, and IsEventNonFatal are all unit-testable and should have coverage

Worth discussing

  • Skipped XIDs behavioral change (now produce informational taints vs. being silently dropped)
  • No taint removal path (recovery requires driver restart)
  • Single XID taint per device — XID 48 then XID 63 overwrites the value, losing the first XID info

)

// TODO: remove later, as its not in vendored api
// DeviceTaintEffectNone is an informational effect that does not affect
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcoded constant is fragile. If the upstream DeviceTaintEffectNone lands with a different string value, this silently breaks at runtime with no compile-time safety net.

Consider pinning the vendored k8s.io/api dependency to a version that includes it, or at minimum add a build-time or init-time assertion that validates the value matches upstream once the dependency is bumped. The TODO is good but should also reference a tracking issue.

Copy link
Copy Markdown
Contributor Author

@guptaNswati guptaNswati Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm its going to be the same https://pkg.go.dev/k8s.io/kubernetes/pkg/apis/resource#DeviceTaintEffect

I created the issue: #1001

Comment thread cmd/gpu-kubelet-plugin/device_health.go Outdated
klog.Warningf("Failed to determine uuid for event %v: %v; Marking all devices as unhealthy.", event, ret)
m.markAllDevicesUnhealthy()
klog.Warningf("Failed to determine uuid for event %v: %v; Tainting all devices with %s", event, ret, TaintKeyGPULost)
m.sendHealthEventsForAllDevices(HealthEventGPULost)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral change worth calling out: The removed m.skippedXids[xid] early-return means skipped XIDs are no longer silently dropped — they now flow through as HealthEventXID events and get mapped to Effect: None taints via IsEventNonFatal.

This is a reasonable design choice (informational taints > invisible events), but it changes observable behavior: a device that previously had zero taints for skipped XIDs will now carry a gpu.nvidia.com/xid taint with Effect: None. Worth noting in the PR description or commit message so reviewers and users are aware.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will add this.

Comment thread cmd/gpu-kubelet-plugin/driver.go Outdated
if len(taints) > 0 {
d.Taints = taints
}
resourceSlice.Devices = append(resourceSlice.Devices, d)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: dev.Taints() reads d.taints without holding the DeviceState lock, but AddDeviceTaint (called a few lines above) writes under lock. If another health event fires concurrently and modifies the slice while this loop is iterating, you get a data race.

Either:

  1. Hold the DeviceState lock for the entire resource slice construction (read-lock would suffice), or
  2. Have Taints() return a copy of the slice (return slices.Clone(d.taints)), or
  3. Build the resource slice inside AddDeviceTaint under the existing lock

Option 2 is the least invasive fix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if existing.Value != taint.Value {
d.taints[i].Value = taint.Value
changed = true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddOrUpdateTaint deduplicates by key, so if a device gets XID 48 then XID 63, the value overwrites to "63" and the first XID info is lost. This is consistent with Option A's one-key-per-dimension design, but should be documented — e.g., a comment here noting that only the most recent XID is retained per device.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Returns true if the taint set was modified.
func (d *AllocatableDevice) AddOrUpdateTaint(taint resourceapi.DeviceTaint) bool {
for i, existing := range d.taints {
if existing.Key == taint.Key {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taints() returns the internal slice directly. This is read from driver.go:deviceHealthEvents without holding the lock while AddDeviceTaint writes under lock — see the race condition comment on driver.go. At minimum, return a copy:

func (d *AllocatableDevice) Taints() []resourceapi.DeviceTaint {
	return slices.Clone(d.taints)
}

IsEventNonFatal(event *DeviceHealthEvent) bool
}

type driver struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding IsEventNonFatal to the deviceHealthMonitor interface gives it dual responsibility: monitoring events AND defining taint policy. This couples taint severity decisions to the monitor implementation.

Consider whether this policy check should live in healthEventToTaint directly (e.g., accepting the skipped XIDs map as a parameter) rather than going through the monitor interface. Not a blocker, but worth thinking about for testability — mocking the monitor just to test taint mapping is more friction than passing a simple map.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually went back and forth on this exact decoupling when i was refactoring the skipped xid logic. My reasoning for keeping IsEventNonFatal tied to the monitor was that all logic related to monitoring should stay in the monitor. It can or should do both.

If we move the policy up into healthEventToTaint, we end up leaking hardware-specific error definitions into the Kubernetes taint-mapping layer.

I agree it adds a bit of friction to mocking the interface for unit tests, but I felt the encapsulation of all health logic should be in the monitor. Not strongly opinionated on it though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative I can think of is that you can make that determination at the time of DeviceHealthEvent creation and make it a part of the object. We would be making the determination at the time of receiving the nvml events which makes sense to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmm like how it was initially done. I still like the way it is rn IsEventNonFatal as an explicit method on the monitor interface which combines two actions where as DeviceHealthEvent encapsulates only the event data.

Signed-off-by: Swati Gupta <swatig@nvidia.com>
@guptaNswati guptaNswati force-pushed the device-taint-toleration branch from e99bfb8 to c6bc6ed Compare April 9, 2026 21:54
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2026
@guptaNswati guptaNswati force-pushed the device-taint-toleration branch from c63ff01 to 31dfda9 Compare April 9, 2026 22:29
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 10, 2026
limitations under the License.
*/

package main
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test run @ArangoGutierrez

=== RUN   TestAddOrUpdateTaint_NewTaint
--- PASS: TestAddOrUpdateTaint_NewTaint (0.00s)
=== RUN   TestAddOrUpdateTaint_DuplicateNoChange
--- PASS: TestAddOrUpdateTaint_DuplicateNoChange (0.00s)
=== RUN   TestAddOrUpdateTaint_UpdateValue
--- PASS: TestAddOrUpdateTaint_UpdateValue (0.00s)
=== RUN   TestAddOrUpdateTaint_UpdateEffect
--- PASS: TestAddOrUpdateTaint_UpdateEffect (0.00s)
=== RUN   TestAddOrUpdateTaint_DifferentKeysAppended
--- PASS: TestAddOrUpdateTaint_DifferentKeysAppended (0.00s)
=== RUN   TestAddOrUpdateTaint_TimeAddedResetOnChange
--- PASS: TestAddOrUpdateTaint_TimeAddedResetOnChange (0.00s)
=== RUN   TestHealthEventToTaint
=== RUN   TestHealthEventToTaint/fatal_XID
=== RUN   TestHealthEventToTaint/non-fatal_XID_(skipped)
=== RUN   TestHealthEventToTaint/XID_with_nil_monitor_defaults_to_fatal
=== RUN   TestHealthEventToTaint/GPU_lost
=== RUN   TestHealthEventToTaint/unmonitored
=== RUN   TestHealthEventToTaint/unknown_event_type_defaults_to_unmonitored
E0410 00:22:38.797546  141863 device_health.go:98] Unknown health event type "bogus", defaulting to unmonitored taint
--- PASS: TestHealthEventToTaint (0.00s)
    --- PASS: TestHealthEventToTaint/fatal_XID (0.00s)
    --- PASS: TestHealthEventToTaint/non-fatal_XID_(skipped) (0.00s)
    --- PASS: TestHealthEventToTaint/XID_with_nil_monitor_defaults_to_fatal (0.00s)
    --- PASS: TestHealthEventToTaint/GPU_lost (0.00s)
    --- PASS: TestHealthEventToTaint/unmonitored (0.00s)
    --- PASS: TestHealthEventToTaint/unknown_event_type_defaults_to_unmonitored (0.00s)
=== RUN   TestIsEventNonFatal
=== RUN   TestIsEventNonFatal/skipped_XID_is_non-fatal
=== RUN   TestIsEventNonFatal/non-skipped_XID_is_fatal
=== RUN   TestIsEventNonFatal/GPU_LOST_is_always_fatal
=== RUN   TestIsEventNonFatal/unmonitored_is_not_an_XID_event
--- PASS: TestIsEventNonFatal (0.00s)
    --- PASS: TestIsEventNonFatal/skipped_XID_is_non-fatal (0.00s)
    --- PASS: TestIsEventNonFatal/non-skipped_XID_is_fatal (0.00s)
    --- PASS: TestIsEventNonFatal/GPU_LOST_is_always_fatal (0.00s)
    --- PASS: TestIsEventNonFatal/unmonitored_is_not_an_XID_event (0.00s)

Comment thread cmd/gpu-kubelet-plugin/device_health.go Outdated
// healthEventToTaint maps a DeviceHealthEvent to the corresponding DRA
// DeviceTaint using the Option A taint key schema: one key per health
// dimension under the gpu.nvidia.com domain.
func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) resourceapi.DeviceTaint {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Stylistically, I'd prefer if returned objects are always pointers to avoid unnecessary copies

Suggested change
func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) resourceapi.DeviceTaint {
func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) *resourceapi.DeviceTaint {

Comment thread cmd/gpu-kubelet-plugin/device_health.go Outdated
Comment on lines 196 to 197
}
if ret != nvml.SUCCESS {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old nit: what's the expectation here? It seems like we log 2 warnings if ret is nvml.ERROR_NOT_SUPPORTED

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment thread cmd/gpu-kubelet-plugin/device_health.go Outdated
}

// collectDevices flattens a GI→CI device map into a slice.
func collectDevices(giMap map[uint32]map[uint32]*AllocatableDevice) []*AllocatableDevice {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you pick a more descriptive name for this function? Something like: flattenMIGDeviceMap()

Comment thread cmd/gpu-kubelet-plugin/deviceinfo.go Outdated
giProfileInfo *nvml.GpuInstanceProfileInfo
ciProfileInfo *nvml.ComputeInstanceProfileInfo
health HealthStatus
pcieBusID string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errors here seem to be from a botched rebase. Can you fix this?

IsEventNonFatal(event *DeviceHealthEvent) bool
}

type driver struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative I can think of is that you can make that determination at the time of DeviceHealthEvent creation and make it a part of the object. We would be making the determination at the time of receiving the nvml events which makes sense to me.

Comment thread cmd/gpu-kubelet-plugin/device_health.go Outdated
}

func (m *nvmlDeviceHealthMonitor) markAllDevicesUnhealthy() {
func (m *nvmlDeviceHealthMonitor) sendHealthEventsForAllDevices(eventType DeviceHealthEventType) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this! Its a meaningful change.

@guptaNswati guptaNswati force-pushed the device-taint-toleration branch from 314bc2e to 9598cd4 Compare April 10, 2026 17:16
Copy link
Copy Markdown
Contributor

@varunrsekar varunrsekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Some minor comments..

Comment thread cmd/gpu-kubelet-plugin/allocatable.go Outdated
}

// Key doesn't exist yet, append the new taint
d.taints = append(d.taints, t)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can directly append *taint here. No reason to create a copy beforehand.

Comment thread cmd/gpu-kubelet-plugin/device_health.go Outdated
// healthEventToTaint maps a DeviceHealthEvent to the corresponding DRA
// DeviceTaint using the Option A taint key schema: one key per health
// dimension under the gpu.nvidia.com domain.
func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) *resourceapi.DeviceTaint {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you swap ordering of the args here? Helpers need to be before inputs.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: guptaNswati, varunrsekar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Swati Gupta <swatig@nvidia.com>

Signed-off-by: Swati Gupta <swatig@nvidia.com>

Signed-off-by: Swati Gupta <swatig@nvidia.com>
…updates

Signed-off-by: Swati Gupta <swatig@nvidia.com>

Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Swati Gupta <swatig@nvidia.com>

Signed-off-by: Swati Gupta <swatig@nvidia.com>

Signed-off-by: Swati Gupta <swatig@nvidia.com>

Signed-off-by: Swati Gupta <swatig@nvidia.com>
@guptaNswati guptaNswati force-pushed the device-taint-toleration branch from 8dfe9cf to 7e8685e Compare April 10, 2026 22:19
@guptaNswati
Copy link
Copy Markdown
Contributor Author

Retested and vetted. Merging.

@guptaNswati guptaNswati merged commit 47888a0 into kubernetes-sigs:main Apr 10, 2026
9 of 10 checks passed
@ArangoGutierrez
Copy link
Copy Markdown
Contributor

@guptaNswati This is no longer an NVIDIA-owned repo, why are you manually merging PR's. and not using PROW lgtm? I haven't reviewed the fixes of my comments. Please don't do this again

@guptaNswati
Copy link
Copy Markdown
Contributor Author

@guptaNswati This is no longer an NVIDIA-owned repo, why are you manually merging PR's. and not using PROW lgtm? I haven't reviewed the fixes of my comments. Please don't do this again

Yes, i realized i should not have done that whether it’s nvidia owned or not. I should have waited for your review and approval as well LGTM label from one of the maintainers. It won’t happen again.

Please review and if there are any unaddressed concerns, we can either revert the PR and do a follow up fix.

In the repo, w need add the PR merge safeguards like mandatory approvals and LGTM labels.

@klueska
Copy link
Copy Markdown
Contributor

klueska commented Apr 11, 2026

I haven't looked deeply at this but the high level comments I have on the direction this should take:

  1. All new taint code should be feature gated

  2. A (separate) helm option should be added to turn off adding taints to devices so that external components can take full control of managing the health state of devices if they want (e.g. NVSentinel).

Maybe this is already being done, but just wanted to make sure these points are taken into consideration.

@guptaNswati
Copy link
Copy Markdown
Contributor Author

I haven't looked deeply at this but the high level comments I have on the direction this should take:

  1. All new taint code should be feature gated
  2. A (separate) helm option should be added to turn off adding taints to devices so that external components can take full control of managing the health state of devices if they want (e.g. NVSentinel).

Maybe this is already being done, but just wanted to make sure these points are taken into consideration.

Oh why a new feature gate for tainting? Right now, i replaced the device removal logic with device tainting in Resourceslice. And its all behind NVMLDeviceHealthCheck FG.

for 2. external components can override the existing taints with DeviceTaintRule. The current implementation is more of a built-in or default health checking. When there are external agents, they can overwrite the default. And that i will do later as part of device-api integration.

You also shared similar concern on the design. And this is what i had in mind #905 (comment)

@klueska
Copy link
Copy Markdown
Contributor

klueska commented Apr 14, 2026

  1. Thats fine if you are repurposing the existing feature gate for this -- so long as there is a feature gate protecting it.
  2. At some point the feature gate from (1) will be removed when this feature is stable, and we will still want a way to completely shut off doing any sort of device health checking from this component and leave it up to an external component

@guptaNswati
Copy link
Copy Markdown
Contributor Author

Review: Add taints on health events

The design is sound and well-aligned with KEP-5055 Option A. Taint key schema (gpu.nvidia.com/xid, gpu.nvidia.com/gpu-lost, gpu.nvidia.com/unmonitored) follows KEP guidance of separate keys per health dimension. Keeping tainted devices in the ResourceSlice instead of removing them is the correct approach.

However, there are a few issues that need to be addressed:

Blocker

  • Merge conflicts: The PR is not mergeable against current main. The device_health.go file, AllocatableDevice struct, and health monitoring infrastructure have been significantly refactored on main. A rebase is needed.

Must-fix

  • Race condition on Taints() read — see inline comment on driver.go
  • Missing testsAddOrUpdateTaint, healthEventToTaint, and IsEventNonFatal are all unit-testable and should have coverage

Worth discussing

  • Skipped XIDs behavioral change (now produce informational taints vs. being silently dropped)
  • No taint removal path (recovery requires driver restart)
  • Single XID taint per device — XID 48 then XID 63 overwrites the value, losing the first XID info

Keeping tainted devices in the ResourceSlice instead of removing them is the correct approach.
Yes, in the current implementation most of the previous logic is same and replaced device removal with device taint. And also propagating the earlier skipped xids as informational taints.

  • PR was rebased

  • Race condition was fixed with a slice as you suggested

  • added tests

  • Yes we still need the recovery path as noted here DeviceHealthCheck: Alpha to GA #904 (comment)

  • Is this style of overwriting xids is valid or not, needs more research as i did not find much in docs on how NVML handles multiple xid events. Is it the last one or first one thats more valid. Or we should not overwrite at all. I will reach out to NVML team.

@ArangoGutierrez If you still find something unaddressed, please let me know.

@guptaNswati
Copy link
Copy Markdown
Contributor Author

  1. Thats fine if you are repurposing the existing feature gate for this -- so long as there is a feature gate protecting it.
  2. At some point the feature gate from (1) will be removed when this feature is stable, and we will still want a way to completely shut off doing any sort of device health checking from this component and leave it up to an external component

Yes, correct. Thats option 3 from above link, a fully decoupled monitoring option where we don't use the default dra-driver health check. I will do this as part of Beta promotion and integration with device-api where we may auto discover or add a flag for an external healthcheck endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

DeviceHealthCheck: Design Proposal for Taint Key/Value Schema

7 participants