From a46086c76f41b2ab90ff01061d27efc95df2b499 Mon Sep 17 00:00:00 2001 From: Jon Huhn Date: Sun, 22 Feb 2026 20:42:14 -0600 Subject: [PATCH] Docs update for KEP-5729: DRA: ResourceClaim Support for Workloads --- .../dynamic-resource-allocation.md | 136 +++++++++++++++++- .../concepts/workloads/workload-api/_index.md | 33 +++++ .../DRAWorkloadResourceClaims.md | 18 +++ .../en/docs/reference/glossary/podgroup.md | 17 +++ .../glossary/resourceclaimtemplate.md | 13 +- 5 files changed, 209 insertions(+), 8 deletions(-) create mode 100644 content/en/docs/reference/command-line-tools-reference/feature-gates/DRAWorkloadResourceClaims.md create mode 100644 content/en/docs/reference/glossary/podgroup.md diff --git a/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md b/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md index 3166a86c7816b..ad13b7bf9216c 100644 --- a/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md +++ b/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md @@ -164,6 +164,15 @@ The method that you use depends on your requirements, as follows: separate, similarly-configured devices. Kubernetes generates ResourceClaims from the specification in the ResourceClaimTemplate. The lifetime of each generated ResourceClaim is bound to the lifetime of the corresponding Pod. +* [**PodGroup ResourceClaimTemplate**](#workload-resource-claims): you want + {{< glossary_tooltip text="PodGroups" term_id="podgroup" >}} to have + independent access to separate, similarly-configured devices that can be + shared by their Pods. Kubernetes generates one ResourceClaim for the PodGroup + from the specification in the ResourceClaimTemplate. The lifetime of each + generated ResourceClaim is bound to the lifetime of the corresponding + PodGroup. This requires the + [`DRAWorkloadResourceClaims`](/docs/reference/command-line-tools-reference/feature-gates/#DRAWorkloadResourceClaims) + feature to be enabled. When you define a workload, you can use {{< glossary_tooltip term_id="cel" text="Common Expression Language (CEL)" >}} @@ -178,7 +187,7 @@ references it. You can reference an auto-generated ResourceClaim in a Pod, but this isn't recommended because auto-generated ResourceClaims are bound to the lifetime of -the Pod that triggered the generation. +the Pod or PodGroup that triggered the generation. To learn how to claim resources using one of these methods, see [Allocate Devices to Workloads with DRA](/docs/tasks/configure-pod-container/assign-resources/allocate-devices-dra/). @@ -237,6 +246,128 @@ The decision is made on a per-Pod basis, so if the Pod is a member of a ReplicaS similar grouping, you cannot rely on all the members of the group having the same subrequest chosen. Your workload must be able to accommodate this. +#### Workload ResourceClaims {#workload-resource-claims} + +{{< feature-state feature_gate_name="DRAWorkloadResourceClaims" >}} + +When you organize Pods with the +[Workload API](/docs/concepts/workloads/workload-api/), +you can reserve ResourceClaims for entire +{{< glossary_tooltip text="PodGroups" term_id="podgroup" >}} +instead of individual Pods and generate ResourceClaimTemplates for a +PodGroup instead of a single Pod, allowing the Pods within a PodGroup to share +access to devices allocated to the generated ResourceClaim. + +This feature targets two problems: + +- The ResourceClaim API's `status.reservedFor` list can only contain 256 items. + Since kube-scheduler only records individual Pods in that list, only 256 Pods + can share a ResourceClaim. By allowing PodGroups to be recorded in + `status.reservedFor`, many more than 256 Pods can share a ResourceClaim. +- Pods can only share a ResourceClaim when its exact name is known. For complex + workloads that replicate _groups_ of Pods, ResourceClaims shared by the Pods + in each group need to be created and deleted explicitly when the set of + groups scales up and down. By generating ResourceClaims for each PodGroup, a + single ResourceClaimTemplate can form the basis for ResourceClaims that are + both replicated automatically and shareable among the Pods in a PodGroup. + +The PodGroup API defines a `spec.resourceClaims` field with the same structure +and similar meaning as the `spec.resourceClaims` field in the Pod API: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha2 +kind: PodGroup +metadata: + name: training-group + namespace: some-ns +spec: + ... + resourceClaims: + - name: pg-claim + resourceClaimName: my-pg-claim + - name: pg-claim-template + resourceClaimTemplateName: my-pg-template +``` + +Like claims made by Pods, claims for PodGroups defining a `resourceClaimName` +refer to a ResourceClaim by name. Claims defining a `resourceClaimTemplateName` +refer to a ResourceClaimTemplate which replicates into one ResourceClaim for the +entire PodGroup that can be shared amongst its Pods. + +When a Pod defines a claim with a `name`, `resourceClaimName`, and +`resourceClaimTemplateName` that all match one of its PodGroup's +`spec.resourceClaims`, then kube-scheduler reserves the ResourceClaim for the +PodGroup instead of the Pod. If the Pod's claim does not match one made by its +PodGroup, then kube-scheduler reserves the ResourceClaim for the Pod. In either +case, reservation is recorded in the ResourceClaim's `status.reservedFor`. +PodGroup reservations persist in the ResourceClaim until the PodGroup is +deleted, even if the group no longer has any Pods. + +When a Pod claim matching a PodGroup claim defines a +`resourceClaimTemplateName`, then one ResourceClaim is generated for the +PodGroup. Other Pods in the group defining the same claim will share that +generated ResourceClaim instead of prompting a new ResourceClaim to be generated +for each Pod. Whether or not a `resourceClaimTemplateName` claim matches a +PodGroup claim, the name of the generated ResourceClaim is recorded in the Pod's +`status.resourceClaimStatuses`. + +ResourceClaims generated from a ResourceClaimTemplate for a +PodGroup follow the lifecycle of the PodGroup. The ResourceClaim is first +created when both the PodGroup and its ResourceClaimTemplate exist. The +ResourceClaim is deleted after the PodGroup has been deleted and the +ResourceClaim is no longer reserved. + +Consider the following example: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha2 +kind: PodGroup +metadata: + name: training-group + namespace: some-ns +spec: + ... + resourceClaims: + - name: pg-claim + resourceClaimName: my-pg-claim + - name: pg-claim-template + resourceClaimTemplateName: my-pg-template +--- +apiVersion: v1 +kind: Pod +metadata: + name: training-group-pod-1 + namespace: some-ns +spec: + ... + schedulingGroup: + podGroupName: training-group + resourceClaims: + - name: pod-claim + resourceClaimName: my-pod-claim + - name: pod-claim-template + resourceClaimTemplateName: my-pod-template + - name: pg-claim + resourceClaimName: my-pg-claim + - name: pg-claim-template + resourceClaimTemplateName: my-pg-template +``` + +In this example, the `training-group` PodGroup has one Pod named `training-group-pod-1`. +The Pod's `pod-claim` and `pod-claim-template` claims do not match +any claim made by the PodGroup, so those claims are not affected by the +PodGroup: ResourceClaim `my-pod-claim` becomes reserved for the Pod and a +ResourceClaim is generated from ResourceClaimTemplate `my-pod-template` and also +becomes reserved for the Pod. The `pg-claim` and `pg-claim-template` do match +claims made by the PodGroup. ResourceClaim `my-pg-claim` becomes reserved for +the PodGroup and a ResourceClaim is generated from ResourceClaimTemplate +`my-pg-template` and also becomes reserved for the PodGroup. + +Associating ResourceClaims with Workload API resources is an *alpha feature* and +only enabled when the `DRAWorkloadResourceClaims` +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) +is enabled in the kube-apiserver, kube-controller-manager, kube-scheduler, and kubelet. + ### ResourceSlice {#resourceslice} Each ResourceSlice represents one or more @@ -333,8 +464,7 @@ dynamic resource allocation. references to ResourceClaimTemplates or to specific ResourceClaims. * If the workload uses a ResourceClaimTemplate, a controller named the - `resourceclaim-controller` generates ResourceClaims for every Pod in the - workload. + `resourceclaim-controller` generates ResourceClaims for the workload. * If the workload uses a specific ResourceClaim, Kubernetes checks whether that ResourceClaim exists in the cluster. If the ResourceClaim doesn't exist, the Pods won't deploy. diff --git a/content/en/docs/concepts/workloads/workload-api/_index.md b/content/en/docs/concepts/workloads/workload-api/_index.md index 80d8075d87dc8..1dcc0a2fda5c9 100644 --- a/content/en/docs/concepts/workloads/workload-api/_index.md +++ b/content/en/docs/concepts/workloads/workload-api/_index.md @@ -64,6 +64,39 @@ The `controllerRef` field links the Workload back to the specific high-level obj such as a [Job](/docs/concepts/workloads/controllers/job/) or a custom CRD. This is useful for observability and tooling. This data is not used to schedule or manage the Workload. +### Requesting DRA devices for a PodGroup + +{{< feature-state feature_gate_name="DRAWorkloadResourceClaims" >}} + +{{< glossary_tooltip text="Devices" term_id="device" >}} available through +{{< glossary_tooltip text="Dynamic Resource Allocation (DRA)" term_id="dra" >}} +can be requested by a PodGroup through its `spec.resourceClaims` field: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha2 +kind: PodGroup +metadata: + name: training-group + namespace: some-ns +spec: + ... + resourceClaims: + - name: pg-claim + resourceClaimName: my-pg-claim + - name: pg-claim-template + resourceClaimTemplateName: my-pg-template +``` + +{{< glossary_tooltip text="ResourceClaims" term_id="resourceclaim" >}} +associated with PodGroups can be shared by more than 256 Pods. +ResourceClaims can also be generated from +{{< glossary_tooltip text="ResourceClaimTemplates" term_id="resourceclaimtemplate" >}} +for each PodGroup, allowing the devices allocated to each generated +ResourceClaim to be shared by the Pods in each PodGroup. + +For more details and a more complete example, see the +[DRA documentation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#workload-resource-claims). + ## {{% heading "whatsnext" %}} * See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod. diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates/DRAWorkloadResourceClaims.md b/content/en/docs/reference/command-line-tools-reference/feature-gates/DRAWorkloadResourceClaims.md new file mode 100644 index 0000000000000..3266f6f86e7af --- /dev/null +++ b/content/en/docs/reference/command-line-tools-reference/feature-gates/DRAWorkloadResourceClaims.md @@ -0,0 +1,18 @@ +--- +title: DRAWorkloadResourceClaims +content_type: feature_gate +_build: + list: never + render: false + +stages: + - stage: alpha + defaultValue: false + fromVersion: "1.36" +--- + +Enables PodGroup resources from the +[Workload API](/docs/concepts/workloads/workload-api/) to make requests for +devices through +[Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) +that can be shared by their member Pods. diff --git a/content/en/docs/reference/glossary/podgroup.md b/content/en/docs/reference/glossary/podgroup.md new file mode 100644 index 0000000000000..1c5d4a29f79e3 --- /dev/null +++ b/content/en/docs/reference/glossary/podgroup.md @@ -0,0 +1,17 @@ +--- +title: PodGroup +id: podgroup +full_link: /docs/concepts/workloads/workload-api/#pod-groups +short_description: > + A PodGroup represents a set of Pods with common policy and configuration. + +aka: +tags: +- core-object +- workload +--- +A PodGroup is a runtime object that represents a group of Pods scheduled +together as a single unit. While the +[Workload API](/docs/concepts/workloads/workload-api/) defines scheduling policy +templates, PodGroups are the runtime counterparts that carry both the policy and +the scheduling status for a specific instance of that group. diff --git a/content/en/docs/reference/glossary/resourceclaimtemplate.md b/content/en/docs/reference/glossary/resourceclaimtemplate.md index abca60b16102d..065313b1530aa 100644 --- a/content/en/docs/reference/glossary/resourceclaimtemplate.md +++ b/content/en/docs/reference/glossary/resourceclaimtemplate.md @@ -4,20 +4,23 @@ id: resourceclaimtemplate full_link: /docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resourceclaims-templates short_description: > Defines a template for Kubernetes to create ResourceClaims. Used to provide - per-Pod access to separate, similar resources. + per-Pod or per-PodGroup access to separate, similar resources. tags: - workload --- Defines a template that Kubernetes uses to create -{{< glossary_tooltip text="ResourceClaims" term_id="resourceclaim" >}}. +{{< glossary_tooltip text="ResourceClaims" term_id="resourceclaim" >}}. ResourceClaimTemplates are used in [dynamic resource allocation (DRA)](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) -to provide _per-Pod access to separate, similar resources_. +to provide _per-Pod or per-{{< glossary_tooltip text="PodGroup" term_id="podgroup" >}} access to separate, similar resources_. When a ResourceClaimTemplate is referenced in a workload specification, Kubernetes automatically creates ResourceClaim objects based on the template. -Each ResourceClaim is bound to a specific Pod. When the Pod terminates, -Kubernetes deletes the corresponding ResourceClaim. +Each ResourceClaim is bound to a specific Pod or PodGroup. When the Pod +terminates or the PodGroup is deleted, Kubernetes deletes the corresponding +ResourceClaim. PodGroup ResourceClaimTemplates require the +[`DRAWorkloadResourceClaims`](/docs/reference/command-line-tools-reference/feature-gates/#DRAWorkloadResourceClaims) +feature to be enabled.