diff --git a/content/en/blog/_posts/2026/dra-136-update.md b/content/en/blog/_posts/2026/dra-136-update.md new file mode 100644 index 0000000000000..03b8e2891a002 --- /dev/null +++ b/content/en/blog/_posts/2026/dra-136-update.md @@ -0,0 +1,172 @@ +--- +layout: blog +title: "Kubernetes v1.36: More Drivers, New Features, and the Next Era of DRA" +slug: dra-136-updates +draft: true +date: XXXX-XX-XX +author: > + The DRA team +--- + +Dynamic Resource Allocation (DRA) has fundamentally changed how we handle hardware +accelerators and specialized resources in Kubernetes. In the v1.36 release, DRA +continues to mature, bringing a wave of feature graduations, critical usability +improvements, and new capabilities that extends the flexibility of DRA to native +resources like memory and CPU, and support for ResourceClaims in PodGroups. + +We have also seen significant momentum in driver availability. Both the +[NVIDIA GPU](https://github.com/NVIDIA/k8s-dra-driver-gpu) +and Google TPU DRA drivers are being transferred to the Kubernetes project, joining the +[DRANET](https://github.com/kubernetes-sigs/dranet) +driver that was added last year. + +Whether you are managing massive fleets of GPUs, need better handling of failures, +or simply looking for better ways to define resource fallback options, the upgrades +to DRA in 1.36 have something for you. Let's dive into the new features and graduations! + +## Feature graduations + +The community has been hard at work stabilizing core DRA concepts. In Kubernetes 1.36, +several highly anticipated features have graduated to Beta and Stable. + +**Prioritized List (Stable)** + +Hardware heterogeneity is a reality in most clusters. With the +[Prioritized List](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#prioritized-list) +feature, you can confidently define fallback preferences when requesting +devices. Instead of hardcoding a request for a specific device model, you can specify an +ordered list of preferences (e.g., "Give me an H100, but if none are available, fall back +to an A100"). The scheduler will evaluate these requests in order, drastically improving +scheduling flexibility and cluster utilization. + +**Extended Resource Support (Beta)** + +As DRA becomes the standard for resource allocation, bridging the gap with legacy systems +is crucial. The DRA +[Extended Resource](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#extended-resource) +feature allows users to request resources via traditional extended resources on a Pod. +This allows for a gradual transition to DRA, meaning application developers and +operators are not forced to immediately migrate their workloads to the ResourceClaim +API. + +**Partitionable Devices (Beta)** + +Hardware accelerators are powerful, and sometimes a single workload doesn't need an +entire device. The +[Partitionable Devices](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#partitionable-devices) +feature, provides native DRA support for carving physical hardware into smaller, +logical instances (such as Multi-Instance GPUs). This allows administrators to +safely and efficiently share expensive accelerators across multiple Pods. + +**Device Taints (Beta)** + +Just as you can taint a Kubernetes Node, you can now apply taints directly to specific DRA +devices. +[Device Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations) +empower cluster administrators to manage hardware more effectively. You can taint faulty +devices to prevent them from being allocated to standard claims, or reserve specific hardware +for dedicated teams, specialized workloads, and experiments. Ultimately, only Pods with +matching tolerations are permitted to claim these tainted devices. + +**Device Binding Conditions (Beta)** + +To improve scheduling reliability, the Kubernetes scheduler can now use the +[Binding Conditions](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations) +feature to delay committing a Pod to a Node until its required external resources—such as attachable +devices or FPGAs—are fully prepared. By explicitly modeling resource readiness, this +prevents premature assignments that can lead to Pod failures, ensuring a much more robust +and predictable deployment process. + +**Resource Health Status (Beta)** + +Knowing when a device has failed or become unhealthy is critical for workloads running on +specialized hardware. With +[Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring), +Kubernetes now exposes device health information directly in the Pod Status through the +`allocatedResourcesStatus` field. When a DRA driver detects that an allocated device +has become unhealthy, it reports this back to the kubelet, which surfaces it in each +container's status. + +In 1.36, the feature graduates to beta (enabled by default) and adds an optional `message` +field providing human readable context about the health status, such as error details or +failure reasons. DRA drivers can also configure per device health check timeouts, +allowing different hardware types to use appropriate timeout values based on their +health reporting characteristics. This gives users and controllers crucial visibility +to quickly identify and react to hardware failures. + +## New Features + +Beyond stabilizing existing capabilities, v1.36 introduces foundational new features +that expand what DRA can do. + +**ResourceClaim Support for Workloads** + +To optimize large-scale AI/ML workloads that rely on strict topological scheduling, the +[ResourceClaim Support for Workloads](add_link_here) +feature enables Kubernetes to seamlessly manage shared resources across massive sets +of Pods. By associating ResourceClaims or ResourceClaimTemplates with PodGroups, +this feature eliminates previous scaling bottlenecks, such as the limit on the +number of pods that can share a claim, and removes the burden of manual claim +management from specialized orchestrators. + +**DRA for Native Resources** + +Why should DRA only be for external accelerators? In v1.36, we are introducing the first +iterations of using the DRA API to manage Kubernetes native resources (like CPU and +Memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA +[Native Resources](add_link_here) +feature, users can leverage DRA's advanced placement, NUMA-awareness, and prioritization +semantics for standard compute resources, paving the way for incredibly fine-grained +performance tuning. + +**DRA Resource Availability Visibility** + +One of the most requested features from cluster administrators has been better visibility +into hardware capacity. The new +[DRAResourcePoolStatus](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resource-pool-status) +feature allows you to query the availability of devices in DRA resource pools. By creating a +`ResourcePoolStatusRequest` object, you get a point-in-time snapshot of device counts +— total, allocated, available, and unavailable — for each pool managed by a given +driver. This enables better integration with dashboards and capacity planning tools. + +**List Types for Attributes** + +With +[List Types for Attributes](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#list-type-attributes), +DRA can represent device attributes as typed lists (int, bool, string, and version), not +just scalar values. This helps model real hardware topology, such as devices that belong +to multiple PCIe roots or NUMA domains. + +This feature also extends `ResourceClaim` constraint behavior to work naturally +with both scalar and list values: `matchAttribute` now checks for a non-empty +intersection, and `distinctAttribute` checks for pairwise disjoint values. +It also introduces `includes()` function in DRA CEL, which lets device selectors keep working +more easily when an attribute changes between scalar and list representations. + +**Device Allocation Ordering through Lexicographical Ordering** + +The Kubernetes scheduler has been updated to evaluate devices using lexicographical +ordering based on resource pool and ResourceSlice names. This change empowers drivers +to proactively influence the scheduling process, leading to improved throughput and +more optimal scheduling decisions. To support this capability, the ResourceSlice +controller toolkit now automatically generates names that reflect the exact device +ordering specified by the driver author. + +## What’s next? + +This cycle introduced a wealth of new DRA features, and the momentum continues. +Our focus remains on progressing existing features toward beta and stable releases +while enhancing DRA's performance, scalability, and reliability. Additionally, +integrating DRA with Workload-Aware and Topology-Aware Scheduling will be a key +priority over the coming releases. + + +## Getting involved + +A good starting point is joining the WG Device Management +[Slack channel](https://kubernetes.slack.com/archives/C0409NGC1TK) and +[meetings](https://docs.google.com/document/d/1qxI87VqGtgN7EAJlqVfxx86HGKEAc2A3SKru8nJHNkQ/edit?tab=t.0#heading=h.tgg8gganowxq), +which happen at US/EU and EU/APAC friendly time slots. + +Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! +We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers. \ No newline at end of file