Conversation
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
/area blog |
|
This PR should target main (all PRs that add blog articles should target main) |
|
/cc @nmn3m |
|
Hi @mortent, we're planning to fold our Resource Health Status feature (KEP-4680) into this umbrella blog post instead of maintaining a separate one (#54534). KEP-4680 is reaching Beta in v1.36. It exposes device health information from Device Plugin and DRA in Pod Status. Let us know if you'd like us to contribute a section or provide any input for the post. |
|
/wg device-management |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| author: > | ||
| The DRA team | ||
| --- | ||
|
|
There was a problem hiding this comment.
it will be great to include some information on adoption and gaps still left comparing to Device Plugin. Maybe a couple of words on available DRA drivers. So end users may make sense of this blog post.
There was a problem hiding this comment.
Added a little section about the availability of drivers. I'm a little worried that by mentioning some drivers here, we might be forgetting others that also should be included. But I can ask in the device management chat if someone knows about other drivers that should be included.
I need to think a bit more about the gaps vs Device Plugin.
| more optimal scheduling decisions. To support this capability, the ResourceSlice | ||
| controller toolkit now automatically generates names that reflect the exact device | ||
| ordering specified by the driver author. | ||
|
|
There was a problem hiding this comment.
I want to include kubernetes/enhancements#5491 if it's worth putting in the feature blog.
ref: docs PR is #54561
| **List Types for Attributes** | |
| With | |
| [List Types for Attributes](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#list-type-attributes), | |
| DRA can represent device attributes as typed lists (int, bool, string, and | |
| version), not just scalar values. This helps model real hardware topology, such | |
| as devices that belong to multiple PCIe roots or NUMA domains. | |
| This feature also extends `ResourceClaim` constraint behavior to work naturally | |
| with both scalar and list values: `matchAttribute` now checks for a non-empty | |
| intersection, and `distinctAttribute` checks for pairwise disjoint values. | |
| It also introduces `includes()` function in DRA CEL, which lets device selectors keep working | |
| more easily when an attribute changes between scalar and list representations. |
There was a problem hiding this comment.
Sorry I forgot this one, it is definitely worth including. Added your suggestion.
There was a problem hiding this comment.
Similar to my comment on #54567 (comment), do you think we could make it a bit more focused on just the benefits of the feature and leave some of the details to the DRA documentation? And see if we can keep it to a single paragraph?
| @@ -0,0 +1,134 @@ | |||
| --- | |||
| layout: blog | |||
| title: "Kubernetes v1.36: DRA has graduated to GA" | |||
There was a problem hiding this comment.
Yeah, just forgot to update this when I used the template from a previous post. I've updated the title now, but open to better alternatives.
| devices or FPGAs—are fully prepared. By explicitly modeling resource readiness, this | ||
| prevents premature assignments that can lead to Pod failures, ensuring a much more robust | ||
| and predictable deployment process. | ||
|
|
There was a problem hiding this comment.
I want to include kubernetes/enhancements#4680 in the feature blog.
ref: docs PR is #54420
| **Resource Health Status (Beta)** | |
| Knowing when a device has failed or become unhealthy is critical for | |
| workloads running on specialized hardware. With | |
| [Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring), | |
| Kubernetes now exposes device health information directly in the Pod | |
| Status through the `allocatedResourcesStatus` field. When a DRA driver | |
| detects that an allocated device has become unhealthy, it reports this | |
| back to the kubelet, which surfaces it in each container's status. | |
| In 1.36, the feature graduates to beta (enabled by default) and adds | |
| an optional `message` field providing human readable context about the | |
| health status, such as error details or failure reasons. DRA drivers | |
| can also configure per device health check timeouts, allowing different | |
| hardware types to use appropriate timeout values based on their | |
| health reporting characteristics. This gives users and controllers | |
| crucial visibility to quickly identify and react to hardware failures. |
There was a problem hiding this comment.
So I've added your proposal for now, but do you think we can shorten it a bit and make it just one paragraph? There is a large number of features and we don't want the blog post to be too long. Focus just on the benefits of this feature and what it enables and leave the details to the DRA docs which we link to. Also, including that it is graduating to beta in 1.36 is already given from the context.
There was a problem hiding this comment.
Sorry I forgot to add this in the first draft, it is of course something we should include in the blog.
|
|
||
| **DRA Resource Availability Visibility** | ||
|
|
||
| One of the most requested features from cluster administrators has been better visibility |
There was a problem hiding this comment.
I'd like to improve this section to mention the actual API name (ResourcePoolStatusRequest), the feature gate, and the alpha status — consistent with how other features in this post reference their API objects and maturity level.
ref: docs PR is #54456
| One of the most requested features from cluster administrators has been better visibility | |
| One of the most requested features from cluster administrators has been better visibility | |
| into hardware capacity. The new | |
| [ResourcePoolStatusRequest](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resource-pool-status) | |
| API (alpha, behind the `DRAResourcePoolStatus` feature gate) allows you to query | |
| the availability of devices in DRA resource pools. By creating a | |
| ResourcePoolStatusRequest object, you get a point-in-time snapshot of device counts | |
| — total, allocated, available, and unavailable — for each pool managed by a given | |
| driver. This enables better integration with dashboards and capacity planning tools. |
There was a problem hiding this comment.
I added your suggestion, but made some small adjustments to make it similar to the other features mentioned. Let me know if you want to make some changes to it.
| **Device Binding Conditions (Beta)** | ||
|
|
||
| To improve scheduling reliability, the Kubernetes scheduler can now use the | ||
| [Binding Conditions](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations) | ||
| feature to delay committing a Pod to a Node until its required external resources—such as attachable | ||
| devices or FPGAs—are fully prepared. By explicitly modeling resource readiness, this | ||
| prevents premature assignments that can lead to Pod failures, ensuring a much more robust | ||
| and predictable deployment process. |
There was a problem hiding this comment.
The Device Binding Conditions part looks good to me.
|
/remove-area localization |
| **ResourceClaim Support for Workloads** | ||
|
|
||
| To optimize large-scale AI/ML workloads that rely on strict topological scheduling, the | ||
| [ResourceClaim Support for Workloads](add_link_here) |
There was a problem hiding this comment.
This section looks good, thanks!
Right now I'm anticipating the link to be /docs/concepts/scheduling-eviction/dynamic-resource-allocation/#workload-resource-claims. I'll follow up once #54596 merges.
|
|
||
| Why should DRA only be for external accelerators? In v1.36, we are introducing the first | ||
| iterations of using the DRA API to manage Kubernetes native resources (like CPU and | ||
| Memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA |
There was a problem hiding this comment.
nit: I don't think this needs to be capitalized.
| Memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA | |
| memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA |
|
|
||
| With | ||
| [List Types for Attributes](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#list-type-attributes), | ||
| DRA can represent device attributes as typed lists (int, bool, string, and version), not |
There was a problem hiding this comment.
nit to match the names of the new fields:
| DRA can represent device attributes as typed lists (int, bool, string, and version), not | |
| DRA can represent device attributes as typed lists (`ints`, `bools`, `strings`, and `versions`), not |
| just scalar values. This helps model real hardware topology, such as devices that belong | ||
| to multiple PCIe roots or NUMA domains. | ||
|
|
||
| This feature also extends `ResourceClaim` constraint behavior to work naturally |
There was a problem hiding this comment.
nit: This shouldn't be formatted as code:
| This feature also extends `ResourceClaim` constraint behavior to work naturally | |
| This feature also extends ResourceClaim constraint behavior to work naturally |
| It also introduces `includes()` function in DRA CEL, which lets device selectors keep working | ||
| more easily when an attribute changes between scalar and list representations. | ||
|
|
||
| **Device Allocation Ordering through Lexicographical Ordering** |
There was a problem hiding this comment.
The section title feels a bit repetitive. Would something like 'Lexicographical Device Allocation' or 'Lexicographical Ordering for Device Allocation' be better?
The content looks good to me. Thank you!
Description
This is a PR for the blog post covering DRA updates for 1.36. We plan a single blog post covering all DRA updates rather than individual blog posts for each feature.
Issue