From 310a52fa2f9d7fc127343d2e75f65a6415bf148e Mon Sep 17 00:00:00 2001 From: Harshal Patil <12152047+harche@users.noreply.github.com> Date: Fri, 27 Mar 2026 10:24:21 -0400 Subject: [PATCH] Document ResourceHealthStatus beta graduation for v1.36 Signed-off-by: Harshal Patil <12152047+harche@users.noreply.github.com> --- .../compute-storage-net/device-plugins.md | 12 +++++++----- .../dynamic-resource-allocation.md | 19 ++++++++++--------- .../feature-gates/ResourceHealthStatus.md | 9 ++++++++- 3 files changed, 25 insertions(+), 15 deletions(-) diff --git a/content/en/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins.md b/content/en/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins.md index 2c126d370cd8e..aff8354adccf8 100644 --- a/content/en/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins.md +++ b/content/en/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins.md @@ -189,14 +189,16 @@ failed device is to use the [PodResources API](#monitoring-device-plugin-resourc {{< feature-state feature_gate_name="ResourceHealthStatus" >}} -By enabling the feature gate `ResourceHealthStatus`, the field `allocatedResourcesStatus` -will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus` -field -reports health information for each device assigned to the container. +When the feature gate `ResourceHealthStatus` is enabled (beta and enabled by default since v1.36), +the field `allocatedResourcesStatus` +is added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus` +field reports health information for each device assigned to the container. +Each resource health entry can include an optional `message` field with additional +human readable context about the health status, such as error details or failure reasons. For a failed Pod, or where you suspect a fault, you can use this status to understand whether the Pod behavior may be associated with device failure. For example, if an accelerator is reporting -an over-temperature event, the `allocatedResourcesStatus` field may be able to report this. +an over-temperature event, the `allocatedResourcesStatus` field may report this. ## Device plugin deployment diff --git a/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md b/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md index e96b2a291e340..1599a4cb6e616 100644 --- a/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md +++ b/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md @@ -398,21 +398,22 @@ For details about the `status.devices` field, see the {{< feature-state feature_gate_name="ResourceHealthStatus" >}} -As an alpha feature, Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources. -For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy. -It is also helpful to find out if the device recovers. +Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources. +For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy. It is also helpful to find out if the device recovers. -To enable this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus/) -must be enabled, and the DRA driver must implement the `DRAResourceHealth` gRPC service. +To use this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/resource-health-status/) must be enabled (beta and enabled by default since v1.36), and the DRA driver must implement the `DRAResourceHealth` gRPC service. -When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. -This health information is then exposed directly in the Pod's status. -The kubelet populates the `allocatedResourcesStatus` field in the status of each container, -detailing the health of each device assigned to that container. +When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the `allocatedResourcesStatus` field in the status of each container, detailing the health of each device assigned to that container. Each resource health entry can include an optional `message` field with additional human-readable context about the health status, such as error details or failure reasons. + +If the kubelet does not receive a health update from a DRA driver within a timeout period, the device's health status is marked as "Unknown". DRA drivers can configure this timeout on a per-device basis by setting the `health_check_timeout_seconds` field in the `DeviceHealth` gRPC message. If not specified, the kubelet uses a default timeout of 30 seconds. This allows different hardware types (for example, GPUs, FPGAs, or storage devices) to use appropriate timeout values based on their health-reporting characteristics. This provides crucial visibility for users and controllers to react to hardware failures. For a Pod that is failing, you can inspect this status to determine if the failure was related to an unhealthy device. +{{< note >}} +Device health status is not updated in the Pod status after a Pod has terminated (for example, in Failed state). +{{< /note >}} + ## Pre-scheduled Pods When you - or another API client - create a Pod with `spec.nodeName` already set, the scheduler gets bypassed. diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus.md b/content/en/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus.md index a91cfb69069ee..9ffe7de4a2f4c 100644 --- a/content/en/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus.md +++ b/content/en/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus.md @@ -6,12 +6,19 @@ _build: render: false stages: - - stage: alpha + - stage: alpha defaultValue: false fromVersion: "1.31" + - stage: beta + defaultValue: true + fromVersion: "1.36" --- Enable the `allocatedResourcesStatus` field within the `.status` for a Pod. The field reports additional details for each container in the Pod, with the health information for each device assigned to the Pod. +Starting in v1.36 (beta), the health report includes an optional `message` field that +provides additional human-readable context about the health status, such as error details +or failure reasons. + This feature applies to devices managed by both [Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) and [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring). See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details. \ No newline at end of file