diff --git a/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md b/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md index 3166a86c7816b..d5499760e0778 100644 --- a/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md +++ b/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md @@ -530,6 +530,117 @@ Extended resource allocation by DRA is a *beta feature* and is enabled by defaul `DRAExtendedResource` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRAExtendedResource) in the kube-apiserver, kube-scheduler, kube-controller-manager, and kubelet. +### Device binding conditions + +{{< feature-state feature_gate_name="DRADeviceBindingConditions" >}} + +Device Binding Conditions allow the Kubernetes scheduler to delay Pod binding until +external resources, such as fabric-attached GPUs or reprogrammable FPGAs, are confirmed +to be ready. + +This waiting behavior is implemented in the +[PreBind phase](/docs/concepts/scheduling-eviction/scheduling-framework/#pre-bind) +of the scheduling framework. +During this phase, the scheduler checks whether all required device conditions are +satisfied before proceeding with binding. + +This improves scheduling reliability by avoiding premature binding and enables coordination +with external device controllers. + +To use this feature, device drivers (typically managed by driver owners) must publish the +following fields in the `Device` section of a `ResourceSlice`. Cluster administrators +must enable the `DRADeviceBindingConditions` and `DRAResourceClaimDeviceStatus` feature +gates for the scheduler to honor these fields. + +`bindingConditions` +: A list of _condition types_ that must be set to True (in the `.status.conditions` field of the associated ResourceClaim) before the Pod can be bound. These conditions typically represent readiness signals, such as DeviceAttached or DeviceInitialized. + +`bindingFailureConditions` +: A list of condition types that, if set to True in + status.conditions field of the associated ResourceClaim, indicate a failure state. + If any of these conditions are True, the scheduler will abort binding and reschedule the Pod. + +`bindsToNode` +: if set to `true`, the scheduler records the selected node name in the + `status.allocation.nodeSelector` field of the ResourceClaim. + This does not affect the Pod's `spec.nodeSelector`. Instead, it sets a node selector + inside the ResourceClaim, which external controllers can use to perform node-specific + operations such as device attachment or preparation. + +All condition types listed in bindingConditions and bindingFailureConditions are evaluated +from the `status.conditions` field of the ResourceClaim. +External controllers are responsible for updating these conditions using standard Kubernetes +condition semantics (`type`, `status`, `reason`, `message`, `lastTransitionTime`). + +The scheduler waits up to **600 seconds** (default) for all `bindingConditions` to become `True`. +If the timeout is reached or any `bindingFailureConditions` are `True`, the scheduler +clears the allocation and reschedules the Pod. +A cluster administration can configure this timeout duration by editing the kube-scheduler configuration file. + +An example of configuring this timeout in `KubeSchedulerConfiguration` is given below: + +```yaml +apiVersion: kubescheduler.config.k8s.io/v1 +kind: KubeSchedulerConfiguration +profiles: +- schedulerName: default-scheduler + pluginConfig: + - name: DynamicResources + args: + apiVersion: kubescheduler.config.k8s.io/v1 + kind: DynamicResourcesArgs + bindingTimeout: 60s +``` + +#### Example {#device-binding-conditions-example} + +Here is an example of a ResourceSlice that you might see in a cluster where there's a DRA driver in use, and that driver supports binding conditions: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: gpu-slice-1 +spec: + driver: dra.example.com + nodeSelector: + nodeSelectorTerms: + - matchExpressions: + - key: accelerator-type + operator: In + values: + - "high-performance" + pool: + name: gpu-pool + generation: 1 + resourceSliceCount: 1 + devices: + - name: gpu-1 + attributes: + vendor: + string: "example" + model: + string: "example-gpu" + bindsToNode: true + bindingConditions: + - dra.example.com/is-prepared + bindingFailureConditions: + - dra.example.com/preparing-failed +``` +This example ResourceSlice has the following properties: + +- The ResourceSlice targets nodes labeled with `accelerator-type=high-performance`, +so that the scheduler uses only a specific set of eligible nodes. +- The scheduler selects one node from the selected group (for example, `node-3`) and sets +the `status.allocation.nodeSelector` field in the ResourceClaim to that node name. +- The `dra.example.com/is-prepared` binding condition indicates that the device `gpu-1` +must be prepared (the `is-prepared` condition has a status of `True`) before binding. +- If the `gpu-1` device preparation fails (the `preparing-failed` condition has a status of `True`), the scheduler aborts binding. +- The scheduler waits up to 600 seconds (default) for the device to become ready. +- External controllers can use the node selector in the ResourceClaim to perform +node-specific setup on the selected node. + + ## DRA alpha features {#alpha-features} The following sections describe DRA features that are available in the Alpha @@ -939,109 +1050,6 @@ Resource pool status is an *alpha feature* and only enabled when the [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled in the kube-apiserver and kube-controller-manager. -### Device Binding Conditions {#device-binding-conditions} - -{{< feature-state feature_gate_name="DRADeviceBindingConditions" >}} - -Device Binding Conditions allow the Kubernetes scheduler to delay Pod binding until -external resources, such as fabric-attached GPUs or reprogrammable FPGAs, are confirmed -to be ready. - -This waiting behavior is implemented in the -[PreBind phase](/docs/concepts/scheduling-eviction/scheduling-framework/#pre-bind) -of the scheduling framework. -During this phase, the scheduler checks whether all required device conditions are -satisfied before proceeding with binding. - -This improves scheduling reliability by avoiding premature binding and enables coordination -with external device controllers. - -To use this feature, device drivers (typically managed by driver owners) must publish the -following fields in the `Device` section of a `ResourceSlice`. Cluster administrators -must enable the `DRADeviceBindingConditions` and `DRAResourceClaimDeviceStatus` feature -gates for the scheduler to honor these fields. - -- `bindingConditions`: A list of condition types that must be set to True in the - status.conditions field of the associated ResourceClaim before the Pod can be bound. - These typically represent readiness signals such as "DeviceAttached" or "DeviceInitialized". -- `bindingFailureConditions`: A list of condition types that, if set to True in - status.conditions field of the associated ResourceClaim, indicate a failure state. - If any of these conditions are True, the scheduler will abort binding and reschedule the Pod. -- `bindsToNode`: if set to `true`, the scheduler records the selected node name in the - `status.allocation.nodeSelector` field of the ResourceClaim. - This does not affect the Pod's `spec.nodeSelector`. Instead, it sets a node selector - inside the ResourceClaim, which external controllers can use to perform node-specific - operations such as device attachment or preparation. - -All condition types listed in bindingConditions and bindingFailureConditions are evaluated -from the `status.conditions` field of the ResourceClaim. -External controllers are responsible for updating these conditions using standard Kubernetes -condition semantics (`type`, `status`, `reason`, `message`, `lastTransitionTime`). - -The scheduler waits up to **600 seconds** (default) for all `bindingConditions` to become `True`. -If the timeout is reached or any `bindingFailureConditions` are `True`, the scheduler -clears the allocation and reschedules the Pod. -This timeout duration is configurable by the user through `KubeSchedulerConfiguration`. - -```yaml -apiVersion: resource.k8s.io/v1 -kind: ResourceSlice -metadata: - name: gpu-slice -spec: - driver: dra.example.com - nodeSelector: - nodeSelectorTerms: - - matchExpressions: - - key: accelerator-type - operator: In - values: - - "high-performance" - pool: - name: gpu-pool - generation: 1 - resourceSliceCount: 1 - devices: - - name: gpu-1 - attributes: - vendor: - string: "example" - model: - string: "example-gpu" - bindsToNode: true - bindingConditions: - - dra.example.com/is-prepared - bindingFailureConditions: - - dra.example.com/preparing-failed -``` -This example ResourceSlice has the following properties: - -- The ResourceSlice targets nodes labeled with `accelerator-type=high-performance`, -so that the scheduler uses only a specific set of eligible nodes. -- The scheduler selects one node from the selected group (for example, `node-3`) and sets -the `status.allocation.nodeSelector` field in the ResourceClaim to that node name. -- The `dra.example.com/is-prepared` binding condition indicates that the device `gpu-1` -must be prepared (the `is-prepared` condition has a status of `True`) before binding. -- If the `gpu-1` device preparation fails (the `preparing-failed` condition has a status of `True`), the scheduler aborts binding. -- The scheduler waits up to 600 seconds (default) for the device to become ready. -- External controllers can use the node selector in the ResourceClaim to perform -node-specific setup on the selected node. - -An example of configuring this timeout in `KubeSchedulerConfiguration` is given below: - -```yaml -apiVersion: kubescheduler.config.k8s.io/v1 -kind: KubeSchedulerConfiguration -profiles: -- schedulerName: default-scheduler - pluginConfig: - - name: DynamicResources - args: - apiVersion: kubescheduler.config.k8s.io/v1 - kind: DynamicResourcesArgs - bindingTimeout: 60s -``` - ## {{% heading "whatsnext" %}} - [Set Up DRA in a Cluster](/docs/tasks/configure-pod-container/assign-resources/set-up-dra-cluster/) diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates/DRADeviceBindingConditions.md b/content/en/docs/reference/command-line-tools-reference/feature-gates/DRADeviceBindingConditions.md index cb405f2e4f776..9505d576c88da 100644 --- a/content/en/docs/reference/command-line-tools-reference/feature-gates/DRADeviceBindingConditions.md +++ b/content/en/docs/reference/command-line-tools-reference/feature-gates/DRADeviceBindingConditions.md @@ -9,6 +9,10 @@ stages: - stage: alpha defaultValue: false fromVersion: "1.34" + toVersion: "1.35" + - stage: beta + defaultValue: true + fromVersion: "1.36" --- Enables support for DeviceBindingConditions in the DRA related fields. This allows for thorough device readiness checks and attachment processes before Bind phase. \ No newline at end of file