Add docs folder and device health check docs by a-mccarthy · Pull Request #878 · kubernetes-sigs/dra-driver-nvidia-gpu

a-mccarthy · 2026-02-12T16:27:55Z

A draft attempt at documenting the device health check feature.

This also adds a new /docs folder to the base of the repo as a proposed more permanent home for docs instead of the current wiki

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>

a-mccarthy · 2026-02-12T20:04:19Z

+Add the NVIDIA Helm repo if you have not already, then install or upgrade with the feature gate enabled:
+
+```
+helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update


@guptaNswati I'd also like some feedback on including this install command here? Is this a generic enough install command that folks can use it by just copy/pasting? or is there too much configurability to the install that we should just have the --set featureGates.NVMLDeviceHealthCheck=true

k8s-triage-robot · 2026-04-02T19:43:18Z

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

ArangoGutierrez

Hey, thanks for starting on this -- having docs in the repo instead of just the wiki is a great move.

The main issue is that the content describes a state of the code that no longer exists on main. The NVMLDeviceHealthCheck feature gate was removed, device_health.go is gone, and the healthy/unhealthy binary model was dropped. PR #983 is reworking health monitoring to use DRA device taints (KEP-5055) instead of removing devices from ResourceSlices.

I'd suggest holding this PR until #983 (or its successor) lands, then rewriting to match the actual implementation. The structure and kubectl examples are solid -- they just need to describe taints rather than device removal.

See inline comments for specifics.

ArangoGutierrez · 2026-04-09T12:00:47Z

+
+The NVIDIA DRA driver supports GPU health monitoring using the [NVIDIA Management Library (NVML)](https://developer.nvidia.com/management-library-nvml) to check for [GPU XID errors](https://docs.nvidia.com/deploy/xid-errors/introduction.html) and determines if a GPU or MIG device is functioning properly.
+
+GPU health checking is managed by the ``NVMLDeviceHealthCheck`` feature gate. This is currently an alpha feature and is disabled by default.


This feature gate no longer exists on main. Current gates are TimeSlicingSettings, MPSSupport, IMEXDaemonsWithDNSNames, and PassthroughSupport. The health monitoring infra was removed in a recent refactor -- will need to update once it's re-landed (likely through #983 or a follow-up).

ArangoGutierrez · 2026-04-09T12:00:47Z

+* Healthy - GPU is functioning normally. The GPU may have a non-critical XID error but is still available for workloads.
+* Unhealthy - GPU has a critical XID error and is not suitable for workloads.
+
+The DRA Driver removes  `unhealthy` devices from the available [ResourceSlices](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resourceslice). 


Two things here:

Double space before unhealthy (minor)

More importantly, PR Add taints on health events #983 changes this behavior -- devices stay in the ResourceSlice with taints attached rather than being removed. This section will need a rewrite to describe the taint-based model once that lands.

ArangoGutierrez · 2026-04-09T12:00:47Z

+  --create-namespace \
+  --set featureGates.NVMLDeviceHealthCheck=true
+```
+


The wiki link still points to the old NVIDIA/k8s-dra-driver-gpu org. Should be kubernetes-sigs/nvidia-dra-driver-gpu.

ArangoGutierrez · 2026-04-09T12:00:47Z

+    kubectl get resourceslice <resourceslice-name> -o yaml
+    ```
+
+Unhealthy GPUs will not appear in the resource slice list. After the device recovers and is marked healthy again, you must restart the DRA Driver for the device to be added back into the available resources pool.


This repeats the restart requirement from line 13. Once the taint-based model from #983 lands, the restart story might change too -- worth consolidating into a single place.

k8s-ci-robot · 2026-04-09T12:00:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: a-mccarthy
Once this PR has been reviewed and has the lgtm label, please assign shivamerla for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add docs folder and device health check docs

e9f5041

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>

a-mccarthy requested review from guptaNswati and shivamerla February 12, 2026 16:28

guptaNswati reviewed Feb 12, 2026

View reviewed changes

Comment thread docs/device-healthchecks.md Outdated

guptaNswati reviewed Feb 12, 2026

View reviewed changes

Comment thread docs/device-healthchecks.md

guptaNswati reviewed Feb 12, 2026

View reviewed changes

Comment thread docs/device-healthchecks.md Outdated

a-mccarthy commented Feb 12, 2026

View reviewed changes

Comment thread docs/device-healthchecks.md

a-mccarthy added 2 commits February 12, 2026 13:46

updates from review

fb32a2b

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>

updates from reviews

05996d4

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>

a-mccarthy requested a review from guptaNswati February 12, 2026 18:51

a-mccarthy commented Feb 12, 2026

View reviewed changes

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 2, 2026

ArangoGutierrez suggested changes Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs folder and device health check docs#878

Add docs folder and device health check docs#878
a-mccarthy wants to merge 3 commits intokubernetes-sigs:mainfrom
a-mccarthy:update-docs

a-mccarthy commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

a-mccarthy Feb 12, 2026

Uh oh!

k8s-triage-robot commented Apr 2, 2026

Uh oh!

ArangoGutierrez left a comment

Uh oh!

ArangoGutierrez Apr 9, 2026

Uh oh!

ArangoGutierrez Apr 9, 2026

Uh oh!

ArangoGutierrez Apr 9, 2026

Uh oh!

ArangoGutierrez Apr 9, 2026

Uh oh!

k8s-ci-robot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		The NVIDIA DRA driver supports GPU health monitoring using the [NVIDIA Management Library (NVML)](https://developer.nvidia.com/management-library-nvml) to check for [GPU XID errors](https://docs.nvidia.com/deploy/xid-errors/introduction.html) and determines if a GPU or MIG device is functioning properly.

		GPU health checking is managed by the ``NVMLDeviceHealthCheck`` feature gate. This is currently an alpha feature and is disabled by default.

Conversation

a-mccarthy commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

a-mccarthy Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

k8s-triage-robot commented Apr 2, 2026

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants