Add taints on health events#983
Conversation
60de594 to
fe32560
Compare
|
Need to double check the logging but its ready to review: test logs |
| // sendHealthEventsForDevices sends a DeviceHealthEvent for every device under | ||
| // the given parent-level map. Uses non-blocking sends to protect the monitor | ||
| // goroutine from deadlocks when the channel is full. | ||
| func (m *nvmlDeviceHealthMonitor) sendHealthEventsForDevices(giMap map[uint32]map[uint32]*AllocatableDevice, eventType DeviceHealthEventType) { |
There was a problem hiding this comment.
nit:
| func (m *nvmlDeviceHealthMonitor) sendHealthEventsForDevices(giMap map[uint32]map[uint32]*AllocatableDevice, eventType DeviceHealthEventType) { | |
| func (m *nvmlDeviceHealthMonitor) sendHealthEventForDevices(giMap map[uint32]map[uint32]*AllocatableDevice, eventType DeviceHealthEventType) { |
| case HealthEventGPULost: | ||
| return resourceapi.DeviceTaint{ | ||
| Key: TaintKeyGPULost, | ||
| Effect: resourceapi.DeviceTaintEffectNoExecute, |
There was a problem hiding this comment.
Can you make this effect NoSchedule? Not sure how likely, but there may be a scenario where a GPU may temporarily fall off the bus and come back. If a workload wants to be resilient to it, we'd be able to support it with the NoSchedule effect.
There was a problem hiding this comment.
Should we make these effects configurable so admins can override the default behavior for specific error types? for e.g. choosing between evicting existing workloads vs. only preventing new pods from being scheduled?
There was a problem hiding this comment.
Good idea Shiva. Configure through a configmap?
There was a problem hiding this comment.
@varunrsekar mmm we should have some real world data here. in case of GPULost event, can it be recovered without any intervention? Let me ask some folks internally
@shivamerla no we should not make it configurable, this should be done by DeviceTaintRule published by admins.
There was a problem hiding this comment.
As per docs LostGPU error is not recoverable without intervention
NVML_ERROR_GPU_IS_LOST = 15
The GPU has fallen off the bus or has otherwise become inaccessible.
There was a problem hiding this comment.
mmmm, so in any error case, we should not apply NoExecute because we should not handle eviction from dra-driver?
There was a problem hiding this comment.
Let me summarize the DeviceTaintRule from the KEP that does exactly what we want to override :)
A DeviceTaintRule acts like a dynamic patch. Instead of the DRA driver explicitly writing a taint into a ResourceSlice, a cluster administrator creates a DeviceTaintRule manifest.
The Kubernetes control plane reads this rule, looks for any devices in the cluster that match the rule's criteria, and automatically overlays the specified taint onto those devices.
It has two main components:
- deviceSelector: A query that defines which devices this rule applies to (e.g., "All devices from the NVIDIA driver", or "All devices currently reporting an XID 43").
- taint: The exact taint (Key, Value, Effect) to apply to the matched devices.
these user stories are the cases we discussing
And i can test the override with it. Only thing to note is that DeviceTaintRule is an additional FG and not enabled by default even in 1.36.
There was a problem hiding this comment.
Do you mean that, we'll create a static set of DeviceTaintRule CRs for each error we detect and then dynamically add devices to the appropriate rule?
There was a problem hiding this comment.
@guptaNswati IIUC, These are two distinct mechanisms for applying device taints:
- The DRA driver can detect device health status and apply taints directly within the
ResourceSlice. - An admin can define a
DeviceTaintRule, specifying a device selector (driver, device, or pool) along with the taints to be applied to all matching devices in the ResourceSlice.
It appears that taints defined in a DeviceTaintRule override those set by the DRA driver. Based on this, we should avoid applying a NoExecute taint at the driver level and instead allow administrators to enforce it via DeviceTaintRule if needed. We can validate this behavior by testing.
| // key. Within the same key, the effect can only escalate | ||
| // (None < NoSchedule < NoExecute), and the value is always updated to the | ||
| // latest. Returns true if the taint set was modified. |
There was a problem hiding this comment.
Can you help me understand this comment? Why is it only allowed for the effect to escalate? Also in this PR, I don't see anywhere an existing taint being escalated. So I'm not sure why we need that validation.
There was a problem hiding this comment.
Good catch: I added it for handling xid events, but i still see the logic is off. right now, all xids have None but i need to update it to differentiate between fatal and non-fatal xids.
There was a problem hiding this comment.
i removed it and refractored the code to handle non-fatal xids as None
| return nil, fmt.Errorf("requested device is not allocatable: %v", result.Device) | ||
| } | ||
| // only proceed with config mapping if device is healthy. | ||
| if featuregates.Enabled(featuregates.NVMLDeviceHealthCheck) { |
There was a problem hiding this comment.
This assumes that NVMLDeviceHealthCheck depends on the DRADeviceTaints feature gate being enabled (which is not the default in v1.35 and earlier). We should call out this limitation explicitly.
shivamerla
left a comment
There was a problem hiding this comment.
Looks great! left few comments
| Pools: map[string]resourceslice.Pool{ | ||
| nodeName: {Slices: []resourceslice.Slice{resourceSlice}}, | ||
| }, | ||
| taints := dev.Taints() |
There was a problem hiding this comment.
Question: For pre-created and dynamic MIG devices, do we need to update this for every profile advertised?
There was a problem hiding this comment.
yes we would need to. This wont be very efficient for a large number?
There was a problem hiding this comment.
dynamic MIG is btw mutually-exclusive with nvml health check. So only static MIG needs to be considered, which is a smaller number of devices.
There was a problem hiding this comment.
at some point they should converge and not be mutually exclusive. We should do be health checking dynamically created MIG devices as well. It can still be a separate path though
| return resourceapi.DeviceTaint{ | ||
| Key: TaintKeyXID, | ||
| Value: strconv.FormatUint(event.EventData, 10), | ||
| Effect: DeviceTaintEffectNone, |
There was a problem hiding this comment.
we should differentiate between fatal and non-fatal xid errors.
for non-fatal ones, effect can be None but what should be the effect for Fatal xids?
NoSchedule: Kubernetes will prevent new pods from being assigned to this broken GPU, but it will leave the existing pods alone.
NoExecute: Kubernetes will actively and immediately evict any running pods that are currently using that GPU.
Based on offline discussion,
- we should not evict the workload as it could still have local state which will be lost if evicted
- it might disrupt the workload negatively
- It puts a burden on the DRA driver to handle Unprepare for the devices safely
- It gives the DRA driver power over workload lifecycle
cc @varunrsekar
There was a problem hiding this comment.
Configurable effects will make sense here. We should consider NoEffect to be a nuke and only allow the admin to request it if they know what they're doing. So for non-fatal errors, we can add a default None effect and for fatal errors, we can add a default NoSchedule taint.
It'd be easy to say gpu-lost is a fatal error. But I'd imagine that trying to classify the xids is not easy. So I'm not sure how you'd want to classify them.
There was a problem hiding this comment.
we have a hardcoded list of non-fatal xids and an env var where user can pass an additional list to skip. so i was thinking to make all of those None and rest NoSchedule
|
With the refractor, i don't need to comment out xid |
|
I updated xid handling and addressed the nits. PTAL. And regarding configurable taint effect, we are making best effort decision and not handling evictions so if an admin wants to override taints, it should be done with |
| Pools: map[string]resourceslice.Pool{ | ||
| nodeName: {Slices: []resourceslice.Slice{resourceSlice}}, | ||
| }, | ||
| taints := dev.Taints() |
There was a problem hiding this comment.
dynamic MIG is btw mutually-exclusive with nvml health check. So only static MIG needs to be considered, which is a smaller number of devices.
| } | ||
|
|
||
| if changed { | ||
| d.taints[i].TimeAdded = nil // reset timestamp for the API server |
There was a problem hiding this comment.
I find it difficult to wrap my head around the TimeAdded field. For correctness, should this be when the plugin determined the device is to be tainted or is it when the APIServer actually applies the taint?
In your change, you've determined the API server is the one that needs to be setting it. But why not always apply it on our end?
There was a problem hiding this comment.
TimeAdded is for API server and not for the dra-driver as Scheduler uses it for state changes and clock drift for tolerations.
not very well explained but see this https://pkg.go.dev/k8s.io/dynamic-resource-allocation/resourceslice#DevicesDeepEqual
| } | ||
|
|
||
| func (m *nvmlDeviceHealthMonitor) markAllDevicesUnhealthy() { | ||
| func (m *nvmlDeviceHealthMonitor) sendHealthEventsForAllDevices(eventType DeviceHealthEventType) { |
There was a problem hiding this comment.
By consumption of this method in the driver.deviceHealthEvents monitoring thread, We will be individually updating the resourceslice for every device of the GPU (eg: full-gpu + all its pre-created migs).
Can we instead rework it to aggregate all the devices into a single event, so that we make a single resourceslice update?
There was a problem hiding this comment.
mmm need to think it a bit as we act on a single event notification which is per device.
There was a problem hiding this comment.
NVML inherently processes and fires events on a per-device basis so instead of aggregating them before sending down the channel. we can probably handle the aggregation on the receiving side in the driver.deviceHealthEvents loop and before we call publishResources. We can either
Option 1: Batch processing or full channel drain
When the main loop receives the first event, we immediately enter a non-blocking default select loop to pull any other pending events off the channel until it's empty and then publish once.
-
Pro: no latency, it publishes the moment the channel is empty.
-
Con: If the NVML monitor lags in sending events, the drain loop might finish too early, resulting in 2 API calls instead of 1.
Option 2: Timer based Batching
When the first event arrives, we start a small timer (e.g 50ms). Any subsequent events that arrive within that 50ms window update the local state but do not trigger a publish. Once the timer expires, we do a single publishResources call for the whole batch.
-
Pro: more standard Kubernetes pattern for protecting the API server.
-
Con: Introduces a tiny 50ms delay before the Kubernetes API is notified
There was a problem hiding this comment.
@lalitadithya showed me a burst of events can happen in real cluster environments.
There was a problem hiding this comment.
mmm need to think it a bit as we act on a single event notification which is per device.
Sorry maybe I didnt word it correctly. For this review, I was only talking about the codepath where you invoke sendHealthEventForAllDevices for the GPU_LOST event where you're sending individual events for all devices belonging to that GPU. So instead of sending 1 GPU_LOST event each for gpu-0, gpu-0-mig-0, gpu-0-mig-1, etc, we can handle it with a single event which includes all those devices.
Your comment on batching make sense to me and would be a good improvement for the future.
There was a problem hiding this comment.
Oh i see. make sense. yes i can fix that. and yes this should be a follow-up improvement.
There was a problem hiding this comment.
@varunrsekar i refactored to batch multiple devices on GPU_LOST and unmonitored events. PTAL
Test logs in mig-mode with xid 43.
0409 23:33:27.290825 1 device_health.go:255] Processing event XID=43 event
I0409 23:33:27.290936 1 device_health.go:271] Sending XID=43 health event for device MIG-f19df5ab-c572-5712-9b1b-6abbdd67a285
W0409 23:33:27.291027 1 driver.go:477] Received xid health event for device MIG-f19df5ab-c572-5712-9b1b-6abbdd67a285
name: gpu-0-mig-1g6gb-14-0
taints:
- effect: None
key: gpu.nvidia.com/xid
timeAdded: "2026-04-09T23:33:27Z"
value: "43"
There was a problem hiding this comment.
Thanks for addressing this! Its a meaningful change.
Yes, makes sense. We can drop that thought for now as it is clear that admins can override the taint to escalate as desired using a |
46c02e0 to
7048d97
Compare
7048d97 to
ddd1200
Compare
ddd1200 to
46c02e0
Compare
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Review: Add taints on health events
The design is sound and well-aligned with KEP-5055 Option A. Taint key schema (gpu.nvidia.com/xid, gpu.nvidia.com/gpu-lost, gpu.nvidia.com/unmonitored) follows KEP guidance of separate keys per health dimension. Keeping tainted devices in the ResourceSlice instead of removing them is the correct approach.
However, there are a few issues that need to be addressed:
Blocker
- Merge conflicts: The PR is not mergeable against current
main. Thedevice_health.gofile,AllocatableDevicestruct, and health monitoring infrastructure have been significantly refactored onmain. A rebase is needed.
Must-fix
- Race condition on
Taints()read — see inline comment ondriver.go - Missing tests —
AddOrUpdateTaint,healthEventToTaint, andIsEventNonFatalare all unit-testable and should have coverage
Worth discussing
- Skipped XIDs behavioral change (now produce informational taints vs. being silently dropped)
- No taint removal path (recovery requires driver restart)
- Single XID taint per device — XID 48 then XID 63 overwrites the value, losing the first XID info
| ) | ||
|
|
||
| // TODO: remove later, as its not in vendored api | ||
| // DeviceTaintEffectNone is an informational effect that does not affect |
There was a problem hiding this comment.
This hardcoded constant is fragile. If the upstream DeviceTaintEffectNone lands with a different string value, this silently breaks at runtime with no compile-time safety net.
Consider pinning the vendored k8s.io/api dependency to a version that includes it, or at minimum add a build-time or init-time assertion that validates the value matches upstream once the dependency is bumped. The TODO is good but should also reference a tracking issue.
There was a problem hiding this comment.
mmm its going to be the same https://pkg.go.dev/k8s.io/kubernetes/pkg/apis/resource#DeviceTaintEffect
I created the issue: #1001
| klog.Warningf("Failed to determine uuid for event %v: %v; Marking all devices as unhealthy.", event, ret) | ||
| m.markAllDevicesUnhealthy() | ||
| klog.Warningf("Failed to determine uuid for event %v: %v; Tainting all devices with %s", event, ret, TaintKeyGPULost) | ||
| m.sendHealthEventsForAllDevices(HealthEventGPULost) |
There was a problem hiding this comment.
Behavioral change worth calling out: The removed m.skippedXids[xid] early-return means skipped XIDs are no longer silently dropped — they now flow through as HealthEventXID events and get mapped to Effect: None taints via IsEventNonFatal.
This is a reasonable design choice (informational taints > invisible events), but it changes observable behavior: a device that previously had zero taints for skipped XIDs will now carry a gpu.nvidia.com/xid taint with Effect: None. Worth noting in the PR description or commit message so reviewers and users are aware.
There was a problem hiding this comment.
Good point. I will add this.
| if len(taints) > 0 { | ||
| d.Taints = taints | ||
| } | ||
| resourceSlice.Devices = append(resourceSlice.Devices, d) |
There was a problem hiding this comment.
Race condition: dev.Taints() reads d.taints without holding the DeviceState lock, but AddDeviceTaint (called a few lines above) writes under lock. If another health event fires concurrently and modifies the slice while this loop is iterating, you get a data race.
Either:
- Hold the
DeviceStatelock for the entire resource slice construction (read-lock would suffice), or - Have
Taints()return a copy of the slice (return slices.Clone(d.taints)), or - Build the resource slice inside
AddDeviceTaintunder the existing lock
Option 2 is the least invasive fix.
| if existing.Value != taint.Value { | ||
| d.taints[i].Value = taint.Value | ||
| changed = true | ||
| } |
There was a problem hiding this comment.
AddOrUpdateTaint deduplicates by key, so if a device gets XID 48 then XID 63, the value overwrites to "63" and the first XID info is lost. This is consistent with Option A's one-key-per-dimension design, but should be documented — e.g., a comment here noting that only the most recent XID is retained per device.
| // Returns true if the taint set was modified. | ||
| func (d *AllocatableDevice) AddOrUpdateTaint(taint resourceapi.DeviceTaint) bool { | ||
| for i, existing := range d.taints { | ||
| if existing.Key == taint.Key { |
There was a problem hiding this comment.
Taints() returns the internal slice directly. This is read from driver.go:deviceHealthEvents without holding the lock while AddDeviceTaint writes under lock — see the race condition comment on driver.go. At minimum, return a copy:
func (d *AllocatableDevice) Taints() []resourceapi.DeviceTaint {
return slices.Clone(d.taints)
}| IsEventNonFatal(event *DeviceHealthEvent) bool | ||
| } | ||
|
|
||
| type driver struct { |
There was a problem hiding this comment.
Adding IsEventNonFatal to the deviceHealthMonitor interface gives it dual responsibility: monitoring events AND defining taint policy. This couples taint severity decisions to the monitor implementation.
Consider whether this policy check should live in healthEventToTaint directly (e.g., accepting the skipped XIDs map as a parameter) rather than going through the monitor interface. Not a blocker, but worth thinking about for testability — mocking the monitor just to test taint mapping is more friction than passing a simple map.
There was a problem hiding this comment.
I actually went back and forth on this exact decoupling when i was refactoring the skipped xid logic. My reasoning for keeping IsEventNonFatal tied to the monitor was that all logic related to monitoring should stay in the monitor. It can or should do both.
If we move the policy up into healthEventToTaint, we end up leaking hardware-specific error definitions into the Kubernetes taint-mapping layer.
I agree it adds a bit of friction to mocking the interface for unit tests, but I felt the encapsulation of all health logic should be in the monitor. Not strongly opinionated on it though.
There was a problem hiding this comment.
One alternative I can think of is that you can make that determination at the time of DeviceHealthEvent creation and make it a part of the object. We would be making the determination at the time of receiving the nvml events which makes sense to me.
There was a problem hiding this comment.
mmmm like how it was initially done. I still like the way it is rn IsEventNonFatal as an explicit method on the monitor interface which combines two actions where as DeviceHealthEvent encapsulates only the event data.
Signed-off-by: Swati Gupta <swatig@nvidia.com>
e99bfb8 to
c6bc6ed
Compare
c63ff01 to
31dfda9
Compare
| limitations under the License. | ||
| */ | ||
|
|
||
| package main |
There was a problem hiding this comment.
Test run @ArangoGutierrez
=== RUN TestAddOrUpdateTaint_NewTaint
--- PASS: TestAddOrUpdateTaint_NewTaint (0.00s)
=== RUN TestAddOrUpdateTaint_DuplicateNoChange
--- PASS: TestAddOrUpdateTaint_DuplicateNoChange (0.00s)
=== RUN TestAddOrUpdateTaint_UpdateValue
--- PASS: TestAddOrUpdateTaint_UpdateValue (0.00s)
=== RUN TestAddOrUpdateTaint_UpdateEffect
--- PASS: TestAddOrUpdateTaint_UpdateEffect (0.00s)
=== RUN TestAddOrUpdateTaint_DifferentKeysAppended
--- PASS: TestAddOrUpdateTaint_DifferentKeysAppended (0.00s)
=== RUN TestAddOrUpdateTaint_TimeAddedResetOnChange
--- PASS: TestAddOrUpdateTaint_TimeAddedResetOnChange (0.00s)
=== RUN TestHealthEventToTaint
=== RUN TestHealthEventToTaint/fatal_XID
=== RUN TestHealthEventToTaint/non-fatal_XID_(skipped)
=== RUN TestHealthEventToTaint/XID_with_nil_monitor_defaults_to_fatal
=== RUN TestHealthEventToTaint/GPU_lost
=== RUN TestHealthEventToTaint/unmonitored
=== RUN TestHealthEventToTaint/unknown_event_type_defaults_to_unmonitored
E0410 00:22:38.797546 141863 device_health.go:98] Unknown health event type "bogus", defaulting to unmonitored taint
--- PASS: TestHealthEventToTaint (0.00s)
--- PASS: TestHealthEventToTaint/fatal_XID (0.00s)
--- PASS: TestHealthEventToTaint/non-fatal_XID_(skipped) (0.00s)
--- PASS: TestHealthEventToTaint/XID_with_nil_monitor_defaults_to_fatal (0.00s)
--- PASS: TestHealthEventToTaint/GPU_lost (0.00s)
--- PASS: TestHealthEventToTaint/unmonitored (0.00s)
--- PASS: TestHealthEventToTaint/unknown_event_type_defaults_to_unmonitored (0.00s)
=== RUN TestIsEventNonFatal
=== RUN TestIsEventNonFatal/skipped_XID_is_non-fatal
=== RUN TestIsEventNonFatal/non-skipped_XID_is_fatal
=== RUN TestIsEventNonFatal/GPU_LOST_is_always_fatal
=== RUN TestIsEventNonFatal/unmonitored_is_not_an_XID_event
--- PASS: TestIsEventNonFatal (0.00s)
--- PASS: TestIsEventNonFatal/skipped_XID_is_non-fatal (0.00s)
--- PASS: TestIsEventNonFatal/non-skipped_XID_is_fatal (0.00s)
--- PASS: TestIsEventNonFatal/GPU_LOST_is_always_fatal (0.00s)
--- PASS: TestIsEventNonFatal/unmonitored_is_not_an_XID_event (0.00s)
| // healthEventToTaint maps a DeviceHealthEvent to the corresponding DRA | ||
| // DeviceTaint using the Option A taint key schema: one key per health | ||
| // dimension under the gpu.nvidia.com domain. | ||
| func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) resourceapi.DeviceTaint { |
There was a problem hiding this comment.
nit: Stylistically, I'd prefer if returned objects are always pointers to avoid unnecessary copies
| func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) resourceapi.DeviceTaint { | |
| func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) *resourceapi.DeviceTaint { |
| } | ||
| if ret != nvml.SUCCESS { |
There was a problem hiding this comment.
old nit: what's the expectation here? It seems like we log 2 warnings if ret is nvml.ERROR_NOT_SUPPORTED
| } | ||
|
|
||
| // collectDevices flattens a GI→CI device map into a slice. | ||
| func collectDevices(giMap map[uint32]map[uint32]*AllocatableDevice) []*AllocatableDevice { |
There was a problem hiding this comment.
nit: Can you pick a more descriptive name for this function? Something like: flattenMIGDeviceMap()
| giProfileInfo *nvml.GpuInstanceProfileInfo | ||
| ciProfileInfo *nvml.ComputeInstanceProfileInfo | ||
| health HealthStatus | ||
| pcieBusID string |
There was a problem hiding this comment.
The errors here seem to be from a botched rebase. Can you fix this?
| IsEventNonFatal(event *DeviceHealthEvent) bool | ||
| } | ||
|
|
||
| type driver struct { |
There was a problem hiding this comment.
One alternative I can think of is that you can make that determination at the time of DeviceHealthEvent creation and make it a part of the object. We would be making the determination at the time of receiving the nvml events which makes sense to me.
| } | ||
|
|
||
| func (m *nvmlDeviceHealthMonitor) markAllDevicesUnhealthy() { | ||
| func (m *nvmlDeviceHealthMonitor) sendHealthEventsForAllDevices(eventType DeviceHealthEventType) { |
There was a problem hiding this comment.
Thanks for addressing this! Its a meaningful change.
314bc2e to
9598cd4
Compare
varunrsekar
left a comment
There was a problem hiding this comment.
Overall LGTM. Some minor comments..
| } | ||
|
|
||
| // Key doesn't exist yet, append the new taint | ||
| d.taints = append(d.taints, t) |
There was a problem hiding this comment.
nit: you can directly append *taint here. No reason to create a copy beforehand.
| // healthEventToTaint maps a DeviceHealthEvent to the corresponding DRA | ||
| // DeviceTaint using the Option A taint key schema: one key per health | ||
| // dimension under the gpu.nvidia.com domain. | ||
| func healthEventToTaint(event *DeviceHealthEvent, monitor deviceHealthMonitor) *resourceapi.DeviceTaint { |
There was a problem hiding this comment.
nit: can you swap ordering of the args here? Helpers need to be before inputs.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: guptaNswati, varunrsekar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com>
…updates Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com> Signed-off-by: Swati Gupta <swatig@nvidia.com>
8dfe9cf to
7e8685e
Compare
|
Retested and vetted. Merging. |
|
@guptaNswati This is no longer an NVIDIA-owned repo, why are you manually merging PR's. and not using PROW lgtm? I haven't reviewed the fixes of my comments. Please don't do this again |
Yes, i realized i should not have done that whether it’s nvidia owned or not. I should have waited for your review and approval as well LGTM label from one of the maintainers. It won’t happen again. Please review and if there are any unaddressed concerns, we can either revert the PR and do a follow up fix. In the repo, w need add the PR merge safeguards like mandatory approvals and LGTM labels. |
|
I haven't looked deeply at this but the high level comments I have on the direction this should take:
Maybe this is already being done, but just wanted to make sure these points are taken into consideration. |
Oh why a new feature gate for tainting? Right now, i replaced the device removal logic with device tainting in Resourceslice. And its all behind for 2. external components can override the existing taints with You also shared similar concern on the design. And this is what i had in mind #905 (comment) |
|
@ArangoGutierrez If you still find something unaddressed, please let me know. |
Yes, correct. Thats option 3 from above link, a fully decoupled monitoring option where we don't use the default dra-driver health check. I will do this as part of Beta promotion and integration with device-api where we may auto discover or add a flag for an external healthcheck endpoint. |
Implement Option A from #905
And introduced Informational Taints for non-fatal XIDs. Earlier, a list of skipped XID errors (e.g., non-fatal, application-level faults) were silently dropped by the monitor. Now they are propagated as informational taints (gpu.nvidia.com/xid with Effect: None). This improves observability by surfacing non-fatal errors on the device without affecting pod scheduling or triggering evictions.