Add docs folder and device health check docs#878
Add docs folder and device health check docs#878a-mccarthy wants to merge 3 commits intokubernetes-sigs:mainfrom
Conversation
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
| Add the NVIDIA Helm repo if you have not already, then install or upgrade with the feature gate enabled: | ||
|
|
||
| ``` | ||
| helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update |
There was a problem hiding this comment.
@guptaNswati I'd also like some feedback on including this install command here? Is this a generic enough install command that folks can use it by just copy/pasting? or is there too much configurability to the install that we should just have the --set featureGates.NVMLDeviceHealthCheck=true
|
Unknown CLA label state. Rechecking for CLA labels. Send feedback to sig-contributor-experience at kubernetes/community. /check-cla |
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Hey, thanks for starting on this -- having docs in the repo instead of just the wiki is a great move.
The main issue is that the content describes a state of the code that no longer exists on main. The NVMLDeviceHealthCheck feature gate was removed, device_health.go is gone, and the healthy/unhealthy binary model was dropped. PR #983 is reworking health monitoring to use DRA device taints (KEP-5055) instead of removing devices from ResourceSlices.
I'd suggest holding this PR until #983 (or its successor) lands, then rewriting to match the actual implementation. The structure and kubectl examples are solid -- they just need to describe taints rather than device removal.
See inline comments for specifics.
|
|
||
| The NVIDIA DRA driver supports GPU health monitoring using the [NVIDIA Management Library (NVML)](https://developer.nvidia.com/management-library-nvml) to check for [GPU XID errors](https://docs.nvidia.com/deploy/xid-errors/introduction.html) and determines if a GPU or MIG device is functioning properly. | ||
|
|
||
| GPU health checking is managed by the ``NVMLDeviceHealthCheck`` feature gate. This is currently an alpha feature and is disabled by default. |
There was a problem hiding this comment.
This feature gate no longer exists on main. Current gates are TimeSlicingSettings, MPSSupport, IMEXDaemonsWithDNSNames, and PassthroughSupport. The health monitoring infra was removed in a recent refactor -- will need to update once it's re-landed (likely through #983 or a follow-up).
| * Healthy - GPU is functioning normally. The GPU may have a non-critical XID error but is still available for workloads. | ||
| * Unhealthy - GPU has a critical XID error and is not suitable for workloads. | ||
|
|
||
| The DRA Driver removes `unhealthy` devices from the available [ResourceSlices](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resourceslice). |
There was a problem hiding this comment.
Two things here:
- Double space before
unhealthy(minor) - More importantly, PR Add taints on health events #983 changes this behavior -- devices stay in the ResourceSlice with taints attached rather than being removed. This section will need a rewrite to describe the taint-based model once that lands.
| --create-namespace \ | ||
| --set featureGates.NVMLDeviceHealthCheck=true | ||
| ``` | ||
|
|
There was a problem hiding this comment.
The wiki link still points to the old NVIDIA/k8s-dra-driver-gpu org. Should be kubernetes-sigs/nvidia-dra-driver-gpu.
| kubectl get resourceslice <resourceslice-name> -o yaml | ||
| ``` | ||
|
|
||
| Unhealthy GPUs will not appear in the resource slice list. After the device recovers and is marked healthy again, you must restart the DRA Driver for the device to be added back into the available resources pool. |
There was a problem hiding this comment.
This repeats the restart requirement from line 13. Once the taint-based model from #983 lands, the restart story might change too -- worth consolidating into a single place.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: a-mccarthy The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
A draft attempt at documenting the device health check feature.
This also adds a new /docs folder to the base of the repo as a proposed more permanent home for docs instead of the current wiki