Add taints on health events by guptaNswati · Pull Request #983 · kubernetes-sigs/dra-driver-nvidia-gpu

guptaNswati · 2026-04-03T00:19:58Z

Implement Option A from #905

And introduced Informational Taints for non-fatal XIDs. Earlier, a list of skipped XID errors (e.g., non-fatal, application-level faults) were silently dropped by the monitor. Now they are propagated as informational taints (gpu.nvidia.com/xid with Effect: None). This improves observability by surfacing non-fatal errors on the device without affecting pod scheduling or triggering evictions.

guptaNswati · 2026-04-06T23:51:26Z

Need to double check the logging but its ready to review:

test logs

I0406 23:38:10.564972       1 device_health.go:248] Processing event XID=43 event
I0406 23:38:10.565071       1 device_health.go:264] Sending XID=43 health event for device GPU-f50641a9-38d7-6df1-0bec-e661b4ca1847
W0406 23:38:10.565153       1 driver.go:456] Received xid health event for device GPU-f50641a9-38d7-6df1-0bec-e661b4ca1847


 kubectl get  resourceslice dra-taint-test-worker-gpu.nvidia.com-qdz7b -o yaml | grep taints: -A 5
    taints:
    - effect: None
      key: gpu.nvidia.com/xid
      timeAdded: "2026-04-06T23:38:10Z"
      value: "43"

varunrsekar · 2026-04-06T18:14:31Z

+// sendHealthEventsForDevices sends a DeviceHealthEvent for every device under
+// the given parent-level map. Uses non-blocking sends to protect the monitor
+// goroutine from deadlocks when the channel is full.
+func (m *nvmlDeviceHealthMonitor) sendHealthEventsForDevices(giMap map[uint32]map[uint32]*AllocatableDevice, eventType DeviceHealthEventType) {


nit:

Suggested change

func (m *nvmlDeviceHealthMonitor) sendHealthEventsForDevices(giMap map[uint32]map[uint32]*AllocatableDevice, eventType DeviceHealthEventType) {

func (m *nvmlDeviceHealthMonitor) sendHealthEventForDevices(giMap map[uint32]map[uint32]*AllocatableDevice, eventType DeviceHealthEventType) {

varunrsekar · 2026-04-06T18:39:33Z

+	case HealthEventGPULost:
+		return resourceapi.DeviceTaint{
+			Key:    TaintKeyGPULost,
+			Effect: resourceapi.DeviceTaintEffectNoExecute,


Can you make this effect NoSchedule? Not sure how likely, but there may be a scenario where a GPU may temporarily fall off the bus and come back. If a workload wants to be resilient to it, we'd be able to support it with the NoSchedule effect.

Should we make these effects configurable so admins can override the default behavior for specific error types? for e.g. choosing between evicting existing workloads vs. only preventing new pods from being scheduled?

Good idea Shiva. Configure through a configmap?

@varunrsekar mmm we should have some real world data here. in case of GPULost event, can it be recovered without any intervention? Let me ask some folks internally

@shivamerla no we should not make it configurable, this should be done by DeviceTaintRule published by admins.

As per docs LostGPU error is not recoverable without intervention

NVML_ERROR_GPU_IS_LOST = 15 The GPU has fallen off the bus or has otherwise become inaccessible.

mmmm, so in any error case, we should not apply NoExecute because we should not handle eviction from dra-driver?

Let me summarize the DeviceTaintRule from the KEP that does exactly what we want to override :)

A DeviceTaintRule acts like a dynamic patch. Instead of the DRA driver explicitly writing a taint into a ResourceSlice, a cluster administrator creates a DeviceTaintRule manifest. The Kubernetes control plane reads this rule, looks for any devices in the cluster that match the rule's criteria, and automatically overlays the specified taint onto those devices. It has two main components: - deviceSelector: A query that defines which devices this rule applies to (e.g., "All devices from the NVIDIA driver", or "All devices currently reporting an XID 43"). - taint: The exact taint (Key, Value, Effect) to apply to the matched devices.

these user stories are the cases we discussing

And i can test the override with it. Only thing to note is that DeviceTaintRule is an additional FG and not enabled by default even in 1.36.

Do you mean that, we'll create a static set of DeviceTaintRule CRs for each error we detect and then dynamically add devices to the appropriate rule?

@guptaNswati IIUC, These are two distinct mechanisms for applying device taints:

The DRA driver can detect device health status and apply taints directly within the ResourceSlice.

An admin can define a DeviceTaintRule, specifying a device selector (driver, device, or pool) along with the taints to be applied to all matching devices in the ResourceSlice.

It appears that taints defined in a DeviceTaintRule override those set by the DRA driver. Based on this, we should avoid applying a NoExecute taint at the driver level and instead allow administrators to enforce it via DeviceTaintRule if needed. We can validate this behavior by testing.

yes correct.

varunrsekar · 2026-04-06T18:41:01Z

+// key. Within the same key, the effect can only escalate
+// (None < NoSchedule < NoExecute), and the value is always updated to the
+// latest. Returns true if the taint set was modified.


Can you help me understand this comment? Why is it only allowed for the effect to escalate? Also in this PR, I don't see anywhere an existing taint being escalated. So I'm not sure why we need that validation.

Good catch: I added it for handling xid events, but i still see the logic is off. right now, all xids have None but i need to update it to differentiate between fatal and non-fatal xids.

i removed it and refractored the code to handle non-fatal xids as None

shivamerla · 2026-04-07T12:51:59Z

 			return nil, fmt.Errorf("requested device is not allocatable: %v", result.Device)
 		}
-		// only proceed with config mapping if device is healthy.
-		if featuregates.Enabled(featuregates.NVMLDeviceHealthCheck) {


This assumes that NVMLDeviceHealthCheck depends on the DRADeviceTaints feature gate being enabled (which is not the default in v1.35 and earlier). We should call out this limitation explicitly.

shivamerla

Looks great! left few comments

shivamerla · 2026-04-07T13:04:02Z

-				Pools: map[string]resourceslice.Pool{
-					nodeName: {Slices: []resourceslice.Slice{resourceSlice}},
-				},
+				taints := dev.Taints()


Question: For pre-created and dynamic MIG devices, do we need to update this for every profile advertised?

yes we would need to. This wont be very efficient for a large number?

dynamic MIG is btw mutually-exclusive with nvml health check. So only static MIG needs to be considered, which is a smaller number of devices.

at some point they should converge and not be mutually exclusive. We should do be health checking dynamically created MIG devices as well. It can still be a separate path though

guptaNswati · 2026-04-07T18:57:06Z

+		return resourceapi.DeviceTaint{
+			Key:    TaintKeyXID,
+			Value:  strconv.FormatUint(event.EventData, 10),
+			Effect: DeviceTaintEffectNone,


we should differentiate between fatal and non-fatal xid errors.

for non-fatal ones, effect can be None but what should be the effect for Fatal xids?

NoSchedule: Kubernetes will prevent new pods from being assigned to this broken GPU, but it will leave the existing pods alone.

NoExecute: Kubernetes will actively and immediately evict any running pods that are currently using that GPU.

Based on offline discussion,

we should not evict the workload as it could still have local state which will be lost if evicted

it might disrupt the workload negatively

It puts a burden on the DRA driver to handle Unprepare for the devices safely

It gives the DRA driver power over workload lifecycle

cc @varunrsekar

Configurable effects will make sense here. We should consider NoEffect to be a nuke and only allow the admin to request it if they know what they're doing. So for non-fatal errors, we can add a default None effect and for fatal errors, we can add a default NoSchedule taint.

It'd be easy to say gpu-lost is a fatal error. But I'd imagine that trying to classify the xids is not easy. So I'm not sure how you'd want to classify them.

we have a hardcoded list of non-fatal xids and an env var where user can pass an additional list to skip. so i was thinking to make all of those None and rest NoSchedule

guptaNswati · 2026-04-07T22:01:16Z

With the refractor, i don't need to comment out xid 43 now for testing. Its getting added with None effect.

taints:
    - effect: None
      key: gpu.nvidia.com/xid
      timeAdded: "2026-04-07T21:57:44Z"
      value: "43"
      
"ResourceSlice update" logger="ResourceSlice controller" slice="dra-taint-test-worker-gpu.nvidia.com-l2c2h" diff=<
	@@ -3,8 +3,8 @@
	   "name": "dra-taint-test-worker-gpu.nvidia.com-l2c2h",
	   "generateName": "dra-taint-test-worker-gpu.nvidia.com-",
	   "uid": "55b55eba-f4d0-4e32-9544-7368a57ca98d",
	-  "resourceVersion": "650",
	-  "generation": 1,
	+  "resourceVersion": "765",
	+  "generation": 2,
	   "creationTimestamp": "2026-04-07T21:56:30Z",
	   "ownerReferences": [
	    {
	@@ -20,7 +20,7 @@
	     "manager": "gpu-kubelet-plugin",
	     "operation": "Update",
	     "apiVersion": "resource.k8s.io/v1",
	-    "time": "2026-04-07T21:56:30Z",
	+    "time": "2026-04-07T21:57:44Z",
	     "fieldsType": "FieldsV1",
	     "fieldsV1": {
	      "f:metadata": {
	@@ -94,7 +94,15 @@
	      "memory": {
	       "value": "24Gi"
	      }
	-    }
	+    },
	+    "taints": [
	+     {
	+      "key": "gpu.nvidia.com/xid",
	+      "value": "43",
	+      "effect": "None",
	+      "timeAdded": "2026-04-07T21:57:44Z"
	+     }
	+    ]
	    },
	    {
	     "name": "gpu-1",

guptaNswati · 2026-04-07T22:05:45Z

I updated xid handling and addressed the nits. PTAL.

And regarding configurable taint effect, we are making best effort decision and not handling evictions so if an admin wants to override taints, it should be done with DeviceTaintRule. I want to hear more on why we should do it on our own with a configmap. I am not convinced rn.

varunrsekar · 2026-04-07T23:44:38Z

-				Pools: map[string]resourceslice.Pool{
-					nodeName: {Slices: []resourceslice.Slice{resourceSlice}},
-				},
+				taints := dev.Taints()


dynamic MIG is btw mutually-exclusive with nvml health check. So only static MIG needs to be considered, which is a smaller number of devices.

varunrsekar · 2026-04-07T23:49:06Z

+			}
+
+			if changed {
+				d.taints[i].TimeAdded = nil // reset timestamp for the API server


I find it difficult to wrap my head around the TimeAdded field. For correctness, should this be when the plugin determined the device is to be tainted or is it when the APIServer actually applies the taint?

In your change, you've determined the API server is the one that needs to be setting it. But why not always apply it on our end?

TimeAdded is for API server and not for the dra-driver as Scheduler uses it for state changes and clock drift for tolerations.

not very well explained but see this https://pkg.go.dev/k8s.io/dynamic-resource-allocation/resourceslice#DevicesDeepEqual

varunrsekar · 2026-04-07T23:54:32Z

 }

-func (m *nvmlDeviceHealthMonitor) markAllDevicesUnhealthy() {
+func (m *nvmlDeviceHealthMonitor) sendHealthEventsForAllDevices(eventType DeviceHealthEventType) {


By consumption of this method in the driver.deviceHealthEvents monitoring thread, We will be individually updating the resourceslice for every device of the GPU (eg: full-gpu + all its pre-created migs).

Can we instead rework it to aggregate all the devices into a single event, so that we make a single resourceslice update?

mmm need to think it a bit as we act on a single event notification which is per device.

NVML inherently processes and fires events on a per-device basis so instead of aggregating them before sending down the channel. we can probably handle the aggregation on the receiving side in the driver.deviceHealthEvents loop and before we call publishResources. We can either

Option 1: Batch processing or full channel drain
When the main loop receives the first event, we immediately enter a non-blocking default select loop to pull any other pending events off the channel until it's empty and then publish once.

Pro: no latency, it publishes the moment the channel is empty.

Con: If the NVML monitor lags in sending events, the drain loop might finish too early, resulting in 2 API calls instead of 1.

Option 2: Timer based Batching
When the first event arrives, we start a small timer (e.g 50ms). Any subsequent events that arrive within that 50ms window update the local state but do not trigger a publish. Once the timer expires, we do a single publishResources call for the whole batch.

Pro: more standard Kubernetes pattern for protecting the API server.

Con: Introduces a tiny 50ms delay before the Kubernetes API is notified

@lalitadithya showed me a burst of events can happen in real cluster environments.

mmm need to think it a bit as we act on a single event notification which is per device.

Sorry maybe I didnt word it correctly. For this review, I was only talking about the codepath where you invoke sendHealthEventForAllDevices for the GPU_LOST event where you're sending individual events for all devices belonging to that GPU. So instead of sending 1 GPU_LOST event each for gpu-0, gpu-0-mig-0, gpu-0-mig-1, etc, we can handle it with a single event which includes all those devices.

Your comment on batching make sense to me and would be a good improvement for the future.

Oh i see. make sense. yes i can fix that. and yes this should be a follow-up improvement.

@varunrsekar i refactored to batch multiple devices on GPU_LOST and unmonitored events. PTAL

Test logs in mig-mode with xid 43.

0409 23:33:27.290825 1 device_health.go:255] Processing event XID=43 event I0409 23:33:27.290936 1 device_health.go:271] Sending XID=43 health event for device MIG-f19df5ab-c572-5712-9b1b-6abbdd67a285 W0409 23:33:27.291027 1 driver.go:477] Received xid health event for device MIG-f19df5ab-c572-5712-9b1b-6abbdd67a285 name: gpu-0-mig-1g6gb-14-0 taints: - effect: None key: gpu.nvidia.com/xid timeAdded: "2026-04-09T23:33:27Z" value: "43"

Thanks for addressing this! Its a meaningful change.

shivamerla · 2026-04-08T17:23:03Z

I updated xid handling and addressed the nits. PTAL.

And regarding configurable taint effect, we are making best effort decision and not handling evictions so if an admin wants to override taints, it should be done with DeviceTaintRule. I want to hear more on why we should do it on our own with a configmap. I am not convinced rn.

Yes, makes sense. We can drop that thought for now as it is clear that admins can override the taint to escalate as desired using a DeviceTaintRule.

ArangoGutierrez

Review: Add taints on health events

The design is sound and well-aligned with KEP-5055 Option A. Taint key schema (gpu.nvidia.com/xid, gpu.nvidia.com/gpu-lost, gpu.nvidia.com/unmonitored) follows KEP guidance of separate keys per health dimension. Keeping tainted devices in the ResourceSlice instead of removing them is the correct approach.

However, there are a few issues that need to be addressed:

Blocker

Merge conflicts: The PR is not mergeable against current main. The device_health.go file, AllocatableDevice struct, and health monitoring infrastructure have been significantly refactored on main. A rebase is needed.

Must-fix

Race condition on Taints() read — see inline comment on driver.go
Missing tests — AddOrUpdateTaint, healthEventToTaint, and IsEventNonFatal are all unit-testable and should have coverage

Worth discussing

Skipped XIDs behavioral change (now produce informational taints vs. being silently dropped)
No taint removal path (recovery requires driver restart)
Single XID taint per device — XID 48 then XID 63 overwrites the value, losing the first XID info

ArangoGutierrez · 2026-04-09T11:37:50Z

+)
+
+// TODO: remove later, as its not in vendored api
+// DeviceTaintEffectNone is an informational effect that does not affect


This hardcoded constant is fragile. If the upstream DeviceTaintEffectNone lands with a different string value, this silently breaks at runtime with no compile-time safety net.

Consider pinning the vendored k8s.io/api dependency to a version that includes it, or at minimum add a build-time or init-time assertion that validates the value matches upstream once the dependency is bumped. The TODO is good but should also reference a tracking issue.

mmm its going to be the same https://pkg.go.dev/k8s.io/kubernetes/pkg/apis/resource#DeviceTaintEffect

I created the issue: #1001

ArangoGutierrez · 2026-04-09T11:37:50Z

-				klog.Warningf("Failed to determine uuid for event %v: %v; Marking all devices as unhealthy.", event, ret)
-				m.markAllDevicesUnhealthy()
+				klog.Warningf("Failed to determine uuid for event %v: %v; Tainting all devices with %s", event, ret, TaintKeyGPULost)
+				m.sendHealthEventsForAllDevices(HealthEventGPULost)


Behavioral change worth calling out: The removed m.skippedXids[xid] early-return means skipped XIDs are no longer silently dropped — they now flow through as HealthEventXID events and get mapped to Effect: None taints via IsEventNonFatal.

This is a reasonable design choice (informational taints > invisible events), but it changes observable behavior: a device that previously had zero taints for skipped XIDs will now carry a gpu.nvidia.com/xid taint with Effect: None. Worth noting in the PR description or commit message so reviewers and users are aware.

Good point. I will add this.

ArangoGutierrez · 2026-04-09T11:37:50Z

+				if len(taints) > 0 {
+					d.Taints = taints
+				}
+				resourceSlice.Devices = append(resourceSlice.Devices, d)


Race condition: dev.Taints() reads d.taints without holding the DeviceState lock, but AddDeviceTaint (called a few lines above) writes under lock. If another health event fires concurrently and modifies the slice while this loop is iterating, you get a data race.

Either:

Hold the DeviceState lock for the entire resource slice construction (read-lock would suffice), or

Have Taints() return a copy of the slice (return slices.Clone(d.taints)), or

Build the resource slice inside AddDeviceTaint under the existing lock

Option 2 is the least invasive fix.

ArangoGutierrez · 2026-04-09T11:37:50Z

+			if existing.Value != taint.Value {
+				d.taints[i].Value = taint.Value
+				changed = true
+			}


AddOrUpdateTaint deduplicates by key, so if a device gets XID 48 then XID 63, the value overwrites to "63" and the first XID info is lost. This is consistent with Option A's one-key-per-dimension design, but should be documented — e.g., a comment here noting that only the most recent XID is retained per device.

ArangoGutierrez · 2026-04-09T11:37:50Z

+// Returns true if the taint set was modified.
+func (d *AllocatableDevice) AddOrUpdateTaint(taint resourceapi.DeviceTaint) bool {
+	for i, existing := range d.taints {
+		if existing.Key == taint.Key {


Taints() returns the internal slice directly. This is read from driver.go:deviceHealthEvents without holding the lock while AddDeviceTaint writes under lock — see the race condition comment on driver.go. At minimum, return a copy:

func (d *AllocatableDevice) Taints() []resourceapi.DeviceTaint { return slices.Clone(d.taints) }

ArangoGutierrez · 2026-04-09T11:37:50Z

+	IsEventNonFatal(event *DeviceHealthEvent) bool
 }

 type driver struct {


Adding IsEventNonFatal to the deviceHealthMonitor interface gives it dual responsibility: monitoring events AND defining taint policy. This couples taint severity decisions to the monitor implementation.

Consider whether this policy check should live in healthEventToTaint directly (e.g., accepting the skipped XIDs map as a parameter) rather than going through the monitor interface. Not a blocker, but worth thinking about for testability — mocking the monitor just to test taint mapping is more friction than passing a simple map.

I actually went back and forth on this exact decoupling when i was refactoring the skipped xid logic. My reasoning for keeping IsEventNonFatal tied to the monitor was that all logic related to monitoring should stay in the monitor. It can or should do both.

If we move the policy up into healthEventToTaint, we end up leaking hardware-specific error definitions into the Kubernetes taint-mapping layer.

I agree it adds a bit of friction to mocking the interface for unit tests, but I felt the encapsulation of all health logic should be in the monitor. Not strongly opinionated on it though.

One alternative I can think of is that you can make that determination at the time of DeviceHealthEvent creation and make it a part of the object. We would be making the determination at the time of receiving the nvml events which makes sense to me.

mmmm like how it was initially done. I still like the way it is rn IsEventNonFatal as an explicit method on the monitor interface which combines two actions where as DeviceHealthEvent encapsulates only the event data.

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati · 2026-04-10T00:28:29Z

+limitations under the License.
+*/
+
+package main


Test run @ArangoGutierrez

=== RUN TestAddOrUpdateTaint_NewTaint --- PASS: TestAddOrUpdateTaint_NewTaint (0.00s) === RUN TestAddOrUpdateTaint_DuplicateNoChange --- PASS: TestAddOrUpdateTaint_DuplicateNoChange (0.00s) === RUN TestAddOrUpdateTaint_UpdateValue --- PASS: TestAddOrUpdateTaint_UpdateValue (0.00s) === RUN TestAddOrUpdateTaint_UpdateEffect --- PASS: TestAddOrUpdateTaint_UpdateEffect (0.00s) === RUN TestAddOrUpdateTaint_DifferentKeysAppended --- PASS: TestAddOrUpdateTaint_DifferentKeysAppended (0.00s) === RUN TestAddOrUpdateTaint_TimeAddedResetOnChange --- PASS: TestAddOrUpdateTaint_TimeAddedResetOnChange (0.00s) === RUN TestHealthEventToTaint === RUN TestHealthEventToTaint/fatal_XID === RUN TestHealthEventToTaint/non-fatal_XID_(skipped) === RUN TestHealthEventToTaint/XID_with_nil_monitor_defaults_to_fatal === RUN TestHealthEventToTaint/GPU_lost === RUN TestHealthEventToTaint/unmonitored === RUN TestHealthEventToTaint/unknown_event_type_defaults_to_unmonitored E0410 00:22:38.797546 141863 device_health.go:98] Unknown health event type "bogus", defaulting to unmonitored taint --- PASS: TestHealthEventToTaint (0.00s) --- PASS: TestHealthEventToTaint/fatal_XID (0.00s) --- PASS: TestHealthEventToTaint/non-fatal_XID_(skipped) (0.00s) --- PASS: TestHealthEventToTaint/XID_with_nil_monitor_defaults_to_fatal (0.00s) --- PASS: TestHealthEventToTaint/GPU_lost (0.00s) --- PASS: TestHealthEventToTaint/unmonitored (0.00s) --- PASS: TestHealthEventToTaint/unknown_event_type_defaults_to_unmonitored (0.00s) === RUN TestIsEventNonFatal === RUN TestIsEventNonFatal/skipped_XID_is_non-fatal === RUN TestIsEventNonFatal/non-skipped_XID_is_fatal === RUN TestIsEventNonFatal/GPU_LOST_is_always_fatal === RUN TestIsEventNonFatal/unmonitored_is_not_an_XID_event --- PASS: TestIsEventNonFatal (0.00s) --- PASS: TestIsEventNonFatal/skipped_XID_is_non-fatal (0.00s) --- PASS: TestIsEventNonFatal/non-skipped_XID_is_fatal (0.00s) --- PASS: TestIsEventNonFatal/GPU_LOST_is_always_fatal (0.00s) --- PASS: TestIsEventNonFatal/unmonitored_is_not_an_XID_event (0.00s)

varunrsekar · 2026-04-10T07:33:06Z

+// healthEventToTaint maps a DeviceHealthEvent to the corresponding DRA
+// DeviceTaint using the Option A taint key schema: one key per health
+// dimension under the gpu.nvidia.com domain.
+func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) resourceapi.DeviceTaint {


nit: Stylistically, I'd prefer if returned objects are always pointers to avoid unnecessary copies

Suggested change

func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) resourceapi.DeviceTaint {

func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) *resourceapi.DeviceTaint {

varunrsekar · 2026-04-10T07:35:37Z

 		}
 		if ret != nvml.SUCCESS {


old nit: what's the expectation here? It seems like we log 2 warnings if ret is nvml.ERROR_NOT_SUPPORTED

varunrsekar · 2026-04-10T07:38:40Z

+}
+
+// collectDevices flattens a GI→CI device map into a slice.
+func collectDevices(giMap map[uint32]map[uint32]*AllocatableDevice) []*AllocatableDevice {


nit: Can you pick a more descriptive name for this function? Something like: flattenMIGDeviceMap()

varunrsekar · 2026-04-10T07:42:02Z

 	giProfileInfo *nvml.GpuInstanceProfileInfo
 	ciProfileInfo *nvml.ComputeInstanceProfileInfo
-	health        HealthStatus
+	pcieBusID     string


The errors here seem to be from a botched rebase. Can you fix this?

varunrsekar · 2026-04-10T07:45:55Z

+	IsEventNonFatal(event *DeviceHealthEvent) bool
 }

 type driver struct {


One alternative I can think of is that you can make that determination at the time of DeviceHealthEvent creation and make it a part of the object. We would be making the determination at the time of receiving the nvml events which makes sense to me.

varunrsekar · 2026-04-10T07:51:38Z

 }

-func (m *nvmlDeviceHealthMonitor) markAllDevicesUnhealthy() {
+func (m *nvmlDeviceHealthMonitor) sendHealthEventsForAllDevices(eventType DeviceHealthEventType) {


Thanks for addressing this! Its a meaningful change.

varunrsekar

Overall LGTM. Some minor comments..

varunrsekar · 2026-04-10T18:25:10Z

+	}
+
+	// Key doesn't exist yet, append the new taint
+	d.taints = append(d.taints, t)


nit: you can directly append *taint here. No reason to create a copy beforehand.

varunrsekar · 2026-04-10T18:35:18Z

+// healthEventToTaint maps a DeviceHealthEvent to the corresponding DRA
+// DeviceTaint using the Option A taint key schema: one key per health
+// dimension under the gpu.nvidia.com domain.
+func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) *resourceapi.DeviceTaint {


nit: can you swap ordering of the args here? Helpers need to be before inputs.

k8s-ci-robot · 2026-04-10T18:36:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: guptaNswati, varunrsekar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [guptaNswati]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com>

…updates Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com>

Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati · 2026-04-10T22:35:19Z

Retested and vetted. Merging.

ArangoGutierrez · 2026-04-11T05:12:09Z

@guptaNswati This is no longer an NVIDIA-owned repo, why are you manually merging PR's. and not using PROW lgtm? I haven't reviewed the fixes of my comments. Please don't do this again

guptaNswati · 2026-04-11T05:37:17Z

@guptaNswati This is no longer an NVIDIA-owned repo, why are you manually merging PR's. and not using PROW lgtm? I haven't reviewed the fixes of my comments. Please don't do this again

Yes, i realized i should not have done that whether it’s nvidia owned or not. I should have waited for your review and approval as well LGTM label from one of the maintainers. It won’t happen again.

Please review and if there are any unaddressed concerns, we can either revert the PR and do a follow up fix.

In the repo, w need add the PR merge safeguards like mandatory approvals and LGTM labels.

klueska · 2026-04-11T07:59:34Z

I haven't looked deeply at this but the high level comments I have on the direction this should take:

All new taint code should be feature gated
A (separate) helm option should be added to turn off adding taints to devices so that external components can take full control of managing the health state of devices if they want (e.g. NVSentinel).

Maybe this is already being done, but just wanted to make sure these points are taken into consideration.

guptaNswati · 2026-04-14T17:10:11Z

I haven't looked deeply at this but the high level comments I have on the direction this should take:

All new taint code should be feature gated

A (separate) helm option should be added to turn off adding taints to devices so that external components can take full control of managing the health state of devices if they want (e.g. NVSentinel).

Maybe this is already being done, but just wanted to make sure these points are taken into consideration.

Oh why a new feature gate for tainting? Right now, i replaced the device removal logic with device tainting in Resourceslice. And its all behind NVMLDeviceHealthCheck FG.

for 2. external components can override the existing taints with DeviceTaintRule. The current implementation is more of a built-in or default health checking. When there are external agents, they can overwrite the default. And that i will do later as part of device-api integration.

You also shared similar concern on the design. And this is what i had in mind #905 (comment)

klueska · 2026-04-14T17:17:45Z

Thats fine if you are repurposing the existing feature gate for this -- so long as there is a feature gate protecting it.
At some point the feature gate from (1) will be removed when this feature is stable, and we will still want a way to completely shut off doing any sort of device health checking from this component and leave it up to an external component

guptaNswati · 2026-04-14T17:31:27Z

Review: Add taints on health events

The design is sound and well-aligned with KEP-5055 Option A. Taint key schema (gpu.nvidia.com/xid, gpu.nvidia.com/gpu-lost, gpu.nvidia.com/unmonitored) follows KEP guidance of separate keys per health dimension. Keeping tainted devices in the ResourceSlice instead of removing them is the correct approach.

However, there are a few issues that need to be addressed:

Blocker

Merge conflicts: The PR is not mergeable against current main. The device_health.go file, AllocatableDevice struct, and health monitoring infrastructure have been significantly refactored on main. A rebase is needed.

Must-fix

Race condition on Taints() read — see inline comment on driver.go

Missing tests — AddOrUpdateTaint, healthEventToTaint, and IsEventNonFatal are all unit-testable and should have coverage

Worth discussing

Skipped XIDs behavioral change (now produce informational taints vs. being silently dropped)

No taint removal path (recovery requires driver restart)

Single XID taint per device — XID 48 then XID 63 overwrites the value, losing the first XID info

Keeping tainted devices in the ResourceSlice instead of removing them is the correct approach.
Yes, in the current implementation most of the previous logic is same and replaced device removal with device taint. And also propagating the earlier skipped xids as informational taints.

PR was rebased
Race condition was fixed with a slice as you suggested
added tests
Yes we still need the recovery path as noted here DeviceHealthCheck: Alpha to GA #904 (comment)
Is this style of overwriting xids is valid or not, needs more research as i did not find much in docs on how NVML handles multiple xid events. Is it the last one or first one thats more valid. Or we should not overwrite at all. I will reach out to NVML team.

@ArangoGutierrez If you still find something unaddressed, please let me know.

guptaNswati · 2026-04-14T17:42:41Z

Thats fine if you are repurposing the existing feature gate for this -- so long as there is a feature gate protecting it.

At some point the feature gate from (1) will be removed when this feature is stable, and we will still want a way to completely shut off doing any sort of device health checking from this component and leave it up to an external component

Yes, correct. Thats option 3 from above link, a fully decoupled monitoring option where we don't use the default dra-driver health check. I will do this as part of Beta promotion and integration with device-api where we may auto discover or add a flag for an external healthcheck endpoint.

k8s-ci-robot requested review from dims and klueska April 3, 2026 00:20

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 3, 2026

guptaNswati force-pushed the device-taint-toleration branch 2 times, most recently from 60de594 to fe32560 Compare April 6, 2026 23:48

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2026

guptaNswati changed the title ~~Draft: add taints on health events~~ Add taints on health events Apr 6, 2026

varunrsekar reviewed Apr 6, 2026

View reviewed changes

shivamerla reviewed Apr 7, 2026

View reviewed changes

shivamerla requested changes Apr 7, 2026

View reviewed changes

guptaNswati commented Apr 7, 2026

View reviewed changes

varunrsekar reviewed Apr 7, 2026

View reviewed changes

guptaNswati force-pushed the device-taint-toleration branch from 46c02e0 to 7048d97 Compare April 8, 2026 20:47

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2026

guptaNswati force-pushed the device-taint-toleration branch from 7048d97 to ddd1200 Compare April 8, 2026 20:51

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2026

guptaNswati force-pushed the device-taint-toleration branch from ddd1200 to 46c02e0 Compare April 8, 2026 21:13

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2026

ArangoGutierrez suggested changes Apr 9, 2026

View reviewed changes

ArangoGutierrez mentioned this pull request Apr 9, 2026

Add docs folder and device health check docs #878

Open

guptaNswati mentioned this pull request Apr 9, 2026

Cleanup: Remove hardcoded DeviceTaintEffectNone after k8s.io/api bump #1001

Open

add taints on health events

e58c6da

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati force-pushed the device-taint-toleration branch from e99bfb8 to c6bc6ed Compare April 9, 2026 21:54

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2026

guptaNswati force-pushed the device-taint-toleration branch from c63ff01 to 31dfda9 Compare April 9, 2026 22:29

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 10, 2026

guptaNswati commented Apr 10, 2026

View reviewed changes

guptaNswati requested review from ArangoGutierrez and varunrsekar April 10, 2026 00:28

varunrsekar reviewed Apr 10, 2026

View reviewed changes

guptaNswati force-pushed the device-taint-toleration branch from 314bc2e to 9598cd4 Compare April 10, 2026 17:16

varunrsekar approved these changes Apr 10, 2026

View reviewed changes

guptaNswati added 3 commits April 10, 2026 22:18

surface skipped XIDs as informational taints

8007167

Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com>

batch GPU_LOST and unmonitored health events to reduce ResourceSlice …

f591b45

…updates Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com>

Add unit test for health events

7e8685e

Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati force-pushed the device-taint-toleration branch from 8dfe9cf to 7e8685e Compare April 10, 2026 22:19

guptaNswati merged commit 47888a0 into kubernetes-sigs:main Apr 10, 2026
9 of 10 checks passed

guptaNswati self-assigned this Apr 15, 2026

guptaNswati added this to DRA Driver for NVIDIA GPUs Apr 15, 2026

github-project-automation Bot moved this to Backlog in DRA Driver for NVIDIA GPUs Apr 15, 2026

github-project-automation Bot moved this from Backlog to Done in DRA Driver for NVIDIA GPUs Apr 15, 2026

guptaNswati added this to the v26.4.0 milestone Apr 15, 2026

guptaNswati linked an issue Apr 15, 2026 that may be closed by this pull request

DeviceHealthCheck: Design Proposal for Taint Key/Value Schema #905

Closed

	func (m nvmlDeviceHealthMonitor) sendHealthEventsForDevices(giMap map[uint32]map[uint32]AllocatableDevice, eventType DeviceHealthEventType) {
	func (m nvmlDeviceHealthMonitor) sendHealthEventForDevices(giMap map[uint32]map[uint32]AllocatableDevice, eventType DeviceHealthEventType) {

	func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) resourceapi.DeviceTaint {
	func healthEventToTaint(event DeviceHealthEvent, monitor deviceHealthMonitor) resourceapi.DeviceTaint {

Conversation

guptaNswati commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guptaNswati commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivamerla Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivamerla Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivamerla left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Apr 7, 2026

Uh oh!

guptaNswati commented Apr 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Apr 3, 2026 •

edited

Loading

guptaNswati commented Apr 6, 2026 •

edited

Loading

shivamerla Apr 7, 2026 •

edited

Loading

shivamerla Apr 7, 2026 •

edited

Loading

guptaNswati Apr 8, 2026 •

edited

Loading

guptaNswati Apr 8, 2026 •

edited

Loading

guptaNswati Apr 8, 2026 •

edited

Loading

guptaNswati Apr 9, 2026 •

edited

Loading