Skip to content

Blog post for DRA updates in 1.36#54567

Open
mortent wants to merge 2 commits intokubernetes:mainfrom
mortent:DRABlog136
Open

Blog post for DRA updates in 1.36#54567
mortent wants to merge 2 commits intokubernetes:mainfrom
mortent:DRABlog136

Conversation

@mortent
Copy link
Copy Markdown
Member

@mortent mortent commented Feb 20, 2026

Description

This is a PR for the blog post covering DRA updates for 1.36. We plan a single blog post covering all DRA updates rather than individual blog posts for each feature.

Issue

@k8s-ci-robot k8s-ci-robot added this to the 1.36 milestone Feb 20, 2026
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 20, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Feb 20, 2026

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit b0eea65
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-io-main-staging/deploys/69cc6945194c660007b864a8
😎 Deploy Preview https://deploy-preview-54567--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@lmktfy
Copy link
Copy Markdown
Member

lmktfy commented Feb 21, 2026

/area blog

@k8s-ci-robot k8s-ci-robot added the area/blog Issues or PRs related to the Kubernetes Blog subproject label Feb 21, 2026
@lmktfy
Copy link
Copy Markdown
Member

lmktfy commented Feb 21, 2026

This PR should target main (all PRs that add blog articles should target main)

@nmn3m
Copy link
Copy Markdown
Member

nmn3m commented Feb 25, 2026

/cc @nmn3m

@harche
Copy link
Copy Markdown
Contributor

harche commented Mar 9, 2026

Hi @mortent, we're planning to fold our Resource Health Status feature (KEP-4680) into this umbrella blog post instead of maintaining a separate one (#54534).

KEP-4680 is reaching Beta in v1.36. It exposes device health information from Device Plugin and DRA in Pod Status. Let us know if you'd like us to contribute a section or provide any input for the post.

@k8s-ci-robot k8s-ci-robot added area/localization General issues or PRs related to localization language/en Issues or PRs related to English language language/ja Issues or PRs related to Japanese language language/ko Issues or PRs related to Korean language language/pl Issues or PRs related to Polish language size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. language/zh Issues or PRs related to Chinese language sig/docs Categorizes an issue or PR as relevant to SIG Docs. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 17, 2026
@mortent mortent changed the base branch from dev-1.36 to main March 17, 2026 22:26
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 17, 2026
@mortent
Copy link
Copy Markdown
Member Author

mortent commented Mar 17, 2026

/wg device-management

@pohly pohly moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Mar 18, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lmktfy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 26, 2026
@mortent mortent changed the title [WIP] Blog post for DRA updates in 1.36 Blog post for DRA updates in 1.36 Mar 26, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 26, 2026
author: >
The DRA team
---

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be great to include some information on adoption and gaps still left comparing to Device Plugin. Maybe a couple of words on available DRA drivers. So end users may make sense of this blog post.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a little section about the availability of drivers. I'm a little worried that by mentioning some drivers here, we might be forgetting others that also should be included. But I can ask in the device management chat if someone knows about other drivers that should be included.

I need to think a bit more about the gaps vs Device Plugin.

more optimal scheduling decisions. To support this capability, the ResourceSlice
controller toolkit now automatically generates names that reflect the exact device
ordering specified by the driver author.

Copy link
Copy Markdown
Contributor

@everpeace everpeace Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to include kubernetes/enhancements#5491 if it's worth putting in the feature blog.

ref: docs PR is #54561

Suggested change
**List Types for Attributes**
With
[List Types for Attributes](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#list-type-attributes),
DRA can represent device attributes as typed lists (int, bool, string, and
version), not just scalar values. This helps model real hardware topology, such
as devices that belong to multiple PCIe roots or NUMA domains.
This feature also extends `ResourceClaim` constraint behavior to work naturally
with both scalar and list values: `matchAttribute` now checks for a non-empty
intersection, and `distinctAttribute` checks for pairwise disjoint values.
It also introduces `includes()` function in DRA CEL, which lets device selectors keep working
more easily when an attribute changes between scalar and list representations.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot this one, it is definitely worth including. Added your suggestion.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment on #54567 (comment), do you think we could make it a bit more focused on just the benefits of the feature and leave some of the details to the DRA documentation? And see if we can keep it to a single paragraph?

@@ -0,0 +1,134 @@
---
layout: blog
title: "Kubernetes v1.36: DRA has graduated to GA"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it already GA?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just forgot to update this when I used the template from a previous post. I've updated the title now, but open to better alternatives.

devices or FPGAs—are fully prepared. By explicitly modeling resource readiness, this
prevents premature assignments that can lead to Pod failures, ensuring a much more robust
and predictable deployment process.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to include kubernetes/enhancements#4680 in the feature blog.

ref: docs PR is #54420

Suggested change
**Resource Health Status (Beta)**
Knowing when a device has failed or become unhealthy is critical for
workloads running on specialized hardware. With
[Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring),
Kubernetes now exposes device health information directly in the Pod
Status through the `allocatedResourcesStatus` field. When a DRA driver
detects that an allocated device has become unhealthy, it reports this
back to the kubelet, which surfaces it in each container's status.
In 1.36, the feature graduates to beta (enabled by default) and adds
an optional `message` field providing human readable context about the
health status, such as error details or failure reasons. DRA drivers
can also configure per device health check timeouts, allowing different
hardware types to use appropriate timeout values based on their
health reporting characteristics. This gives users and controllers
crucial visibility to quickly identify and react to hardware failures.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've added your proposal for now, but do you think we can shorten it a bit and make it just one paragraph? There is a large number of features and we don't want the blog post to be too long. Focus just on the benefits of this feature and what it enables and leave the details to the DRA docs which we link to. Also, including that it is graduating to beta in 1.36 is already given from the context.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to add this in the first draft, it is of course something we should include in the blog.


**DRA Resource Availability Visibility**

One of the most requested features from cluster administrators has been better visibility
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to improve this section to mention the actual API name (ResourcePoolStatusRequest), the feature gate, and the alpha status — consistent with how other features in this post reference their API objects and maturity level.

ref: docs PR is #54456

Suggested change
One of the most requested features from cluster administrators has been better visibility
One of the most requested features from cluster administrators has been better visibility
into hardware capacity. The new
[ResourcePoolStatusRequest](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resource-pool-status)
API (alpha, behind the `DRAResourcePoolStatus` feature gate) allows you to query
the availability of devices in DRA resource pools. By creating a
ResourcePoolStatusRequest object, you get a point-in-time snapshot of device counts
— total, allocated, available, and unavailable — for each pool managed by a given
driver. This enables better integration with dashboards and capacity planning tools.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added your suggestion, but made some small adjustments to make it similar to the other features mentioned. Let me know if you want to make some changes to it.

Comment on lines +65 to +72
**Device Binding Conditions (Beta)**

To improve scheduling reliability, the Kubernetes scheduler can now use the
[Binding Conditions](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations)
feature to delay committing a Pod to a Node until its required external resources—such as attachable
devices or FPGAs—are fully prepared. By explicitly modeling resource readiness, this
prevents premature assignments that can lead to Pod failures, ensuring a much more robust
and predictable deployment process.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Device Binding Conditions part looks good to me.

@lmktfy
Copy link
Copy Markdown
Member

lmktfy commented Mar 31, 2026

/remove-area localization
/remove-language ja
/remove-language ko
/remove-language pl
/remove-language zh

@k8s-ci-robot k8s-ci-robot removed area/localization General issues or PRs related to localization language/ja Issues or PRs related to Japanese language language/ko Issues or PRs related to Korean language language/pl Issues or PRs related to Polish language language/zh Issues or PRs related to Chinese language labels Mar 31, 2026
**ResourceClaim Support for Workloads**

To optimize large-scale AI/ML workloads that rely on strict topological scheduling, the
[ResourceClaim Support for Workloads](add_link_here)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section looks good, thanks!

Right now I'm anticipating the link to be /docs/concepts/scheduling-eviction/dynamic-resource-allocation/#workload-resource-claims. I'll follow up once #54596 merges.


Why should DRA only be for external accelerators? In v1.36, we are introducing the first
iterations of using the DRA API to manage Kubernetes native resources (like CPU and
Memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think this needs to be capitalized.

Suggested change
Memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA
memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA


With
[List Types for Attributes](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#list-type-attributes),
DRA can represent device attributes as typed lists (int, bool, string, and version), not
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit to match the names of the new fields:

Suggested change
DRA can represent device attributes as typed lists (int, bool, string, and version), not
DRA can represent device attributes as typed lists (`ints`, `bools`, `strings`, and `versions`), not

just scalar values. This helps model real hardware topology, such as devices that belong
to multiple PCIe roots or NUMA domains.

This feature also extends `ResourceClaim` constraint behavior to work naturally
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This shouldn't be formatted as code:

Suggested change
This feature also extends `ResourceClaim` constraint behavior to work naturally
This feature also extends ResourceClaim constraint behavior to work naturally

It also introduces `includes()` function in DRA CEL, which lets device selectors keep working
more easily when an attribute changes between scalar and list representations.

**Device Allocation Ordering through Lexicographical Ordering**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section title feels a bit repetitive. Would something like 'Lexicographical Device Allocation' or 'Lexicographical Ordering for Device Allocation' be better?

The content looks good to me. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/blog Issues or PRs related to the Kubernetes Blog subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.