Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -189,14 +189,16 @@ failed device is to use the [PodResources API](#monitoring-device-plugin-resourc

{{< feature-state feature_gate_name="ResourceHealthStatus" >}}

By enabling the feature gate `ResourceHealthStatus`, the field `allocatedResourcesStatus`
will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
field
reports health information for each device assigned to the container.
When the feature gate `ResourceHealthStatus` is enabled (beta and enabled by default since v1.36),
the field `allocatedResourcesStatus`
is added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
field reports health information for each device assigned to the container.
Each resource health entry can include an optional `message` field with additional
human readable context about the health status, such as error details or failure reasons.

For a failed Pod, or where you suspect a fault, you can use this status to understand whether
the Pod behavior may be associated with device failure. For example, if an accelerator is reporting
an over-temperature event, the `allocatedResourcesStatus` field may be able to report this.
an over-temperature event, the `allocatedResourcesStatus` field may report this.


## Device plugin deployment
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -398,21 +398,22 @@ For details about the `status.devices` field, see the

{{< feature-state feature_gate_name="ResourceHealthStatus" >}}

As an alpha feature, Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources.
For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy.
It is also helpful to find out if the device recovers.
Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources.
For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy. It is also helpful to find out if the device recovers.

To enable this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus/)
must be enabled, and the DRA driver must implement the `DRAResourceHealth` gRPC service.
To use this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/resource-health-status/) must be enabled (beta and enabled by default since v1.36), and the DRA driver must implement the `DRAResourceHealth` gRPC service.

When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet.
This health information is then exposed directly in the Pod's status.
The kubelet populates the `allocatedResourcesStatus` field in the status of each container,
detailing the health of each device assigned to that container.
When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the `allocatedResourcesStatus` field in the status of each container, detailing the health of each device assigned to that container. Each resource health entry can include an optional `message` field with additional human-readable context about the health status, such as error details or failure reasons.

If the kubelet does not receive a health update from a DRA driver within a timeout period, the device's health status is marked as "Unknown". DRA drivers can configure this timeout on a per-device basis by setting the `health_check_timeout_seconds` field in the `DeviceHealth` gRPC message. If not specified, the kubelet uses a default timeout of 30 seconds. This allows different hardware types (for example, GPUs, FPGAs, or storage devices) to use appropriate timeout values based on their health-reporting characteristics.

This provides crucial visibility for users and controllers to react to hardware failures.
For a Pod that is failing, you can inspect this status to determine if the failure was related to an unhealthy device.

{{< note >}}
Device health status is not updated in the Pod status after a Pod has terminated (for example, in Failed state).
{{< /note >}}

## Pre-scheduled Pods

When you - or another API client - create a Pod with `spec.nodeName` already set, the scheduler gets bypassed.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,19 @@ _build:
render: false

stages:
- stage: alpha
- stage: alpha
defaultValue: false
fromVersion: "1.31"
- stage: beta
defaultValue: true
fromVersion: "1.36"
---
Enable the `allocatedResourcesStatus` field within the `.status` for a Pod. The field
reports additional details for each container in the Pod,
with the health information for each device assigned to the Pod.

Starting in v1.36 (beta), the health report includes an optional `message` field that
provides additional human-readable context about the health status, such as error details
or failure reasons.

This feature applies to devices managed by both [Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) and [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring). See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details.