-
Notifications
You must be signed in to change notification settings - Fork 15.4k
Document DRA Device Binding Conditions in v1.36 #54541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev-1.36
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -530,6 +530,117 @@ Extended resource allocation by DRA is a *beta feature* and is enabled by defaul | |||||||||||
| `DRAExtendedResource` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRAExtendedResource) | ||||||||||||
| in the kube-apiserver, kube-scheduler, kube-controller-manager, and kubelet. | ||||||||||||
|
|
||||||||||||
| ### Device binding conditions | ||||||||||||
|
|
||||||||||||
| {{< feature-state feature_gate_name="DRADeviceBindingConditions" >}} | ||||||||||||
|
|
||||||||||||
| Device Binding Conditions allow the Kubernetes scheduler to delay Pod binding until | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sentence is now rewritten for DRA driver developers, and I’d like to discuss whether we should proceed this way. |
||||||||||||
| external resources, such as fabric-attached GPUs or reprogrammable FPGAs, are confirmed | ||||||||||||
| to be ready. | ||||||||||||
|
|
||||||||||||
| This waiting behavior is implemented in the | ||||||||||||
| [PreBind phase](/docs/concepts/scheduling-eviction/scheduling-framework/#pre-bind) | ||||||||||||
| of the scheduling framework. | ||||||||||||
| During this phase, the scheduler checks whether all required device conditions are | ||||||||||||
| satisfied before proceeding with binding. | ||||||||||||
|
|
||||||||||||
| This improves scheduling reliability by avoiding premature binding and enables coordination | ||||||||||||
| with external device controllers. | ||||||||||||
|
|
||||||||||||
| To use this feature, device drivers (typically managed by driver owners) must publish the | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this sentence, “you” refers to DRA driver developers, which means this is also written with driver authors in mind. |
||||||||||||
| following fields in the `Device` section of a `ResourceSlice`. Cluster administrators | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this sentence, “you” refers to DRA driver developers, which means this is also written with driver authors in mind. |
||||||||||||
| must enable the `DRADeviceBindingConditions` and `DRAResourceClaimDeviceStatus` feature | ||||||||||||
| gates for the scheduler to honor these fields. | ||||||||||||
|
|
||||||||||||
| `bindingConditions` | ||||||||||||
| : A list of _condition types_ that must be set to True (in the `.status.conditions` field of the associated ResourceClaim) before the Pod can be bound. These conditions typically represent readiness signals, such as DeviceAttached or DeviceInitialized. | ||||||||||||
|
|
||||||||||||
| `bindingFailureConditions` | ||||||||||||
| : A list of condition types that, if set to True in | ||||||||||||
| status.conditions field of the associated ResourceClaim, indicate a failure state. | ||||||||||||
| If any of these conditions are True, the scheduler will abort binding and reschedule the Pod. | ||||||||||||
|
|
||||||||||||
| `bindsToNode` | ||||||||||||
| : if set to `true`, the scheduler records the selected node name in the | ||||||||||||
| `status.allocation.nodeSelector` field of the ResourceClaim. | ||||||||||||
| This does not affect the Pod's `spec.nodeSelector`. Instead, it sets a node selector | ||||||||||||
| inside the ResourceClaim, which external controllers can use to perform node-specific | ||||||||||||
| operations such as device attachment or preparation. | ||||||||||||
|
|
||||||||||||
| All condition types listed in bindingConditions and bindingFailureConditions are evaluated | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you assuming that binding conditions in the ResourceSlice are compared against and evaluated together with the binding conditions in the ResourceClaim? In practice, I believe an external controller would evaluate only the binding conditions in the ResourceClaim. In addition, based on our experience, the controller that sets binding conditions is not necessarily limited to the control plane. There are also designs where such controllers are distributed and run on each node. For this reason, rather than explicitly referring to the control plane, I thought it might be better to use a more general term such as an external controller. |
||||||||||||
| from the `status.conditions` field of the ResourceClaim. | ||||||||||||
| External controllers are responsible for updating these conditions using standard Kubernetes | ||||||||||||
| condition semantics (`type`, `status`, `reason`, `message`, `lastTransitionTime`). | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Driver-author–focused text has also been added here, and I would like to discuss whether this is necessary. |
||||||||||||
|
|
||||||||||||
| The scheduler waits up to **600 seconds** (default) for all `bindingConditions` to become `True`. | ||||||||||||
| If the timeout is reached or any `bindingFailureConditions` are `True`, the scheduler | ||||||||||||
| clears the allocation and reschedules the Pod. | ||||||||||||
| A cluster administration can configure this timeout duration by editing the kube-scheduler configuration file. | ||||||||||||
|
|
||||||||||||
| An example of configuring this timeout in `KubeSchedulerConfiguration` is given below: | ||||||||||||
|
|
||||||||||||
| ```yaml | ||||||||||||
| apiVersion: kubescheduler.config.k8s.io/v1 | ||||||||||||
| kind: KubeSchedulerConfiguration | ||||||||||||
| profiles: | ||||||||||||
| - schedulerName: default-scheduler | ||||||||||||
| pluginConfig: | ||||||||||||
| - name: DynamicResources | ||||||||||||
| args: | ||||||||||||
| apiVersion: kubescheduler.config.k8s.io/v1 | ||||||||||||
| kind: DynamicResourcesArgs | ||||||||||||
| bindingTimeout: 60s | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| #### Example {#device-binding-conditions-example} | ||||||||||||
|
|
||||||||||||
| Here is an example of a ResourceSlice that you might see in a cluster where there's a DRA driver in use, and that driver supports binding conditions: | ||||||||||||
|
|
||||||||||||
| ```yaml | ||||||||||||
| apiVersion: resource.k8s.io/v1 | ||||||||||||
| kind: ResourceSlice | ||||||||||||
| metadata: | ||||||||||||
| name: gpu-slice-1 | ||||||||||||
| spec: | ||||||||||||
| driver: dra.example.com | ||||||||||||
| nodeSelector: | ||||||||||||
| nodeSelectorTerms: | ||||||||||||
| - matchExpressions: | ||||||||||||
| - key: accelerator-type | ||||||||||||
| operator: In | ||||||||||||
| values: | ||||||||||||
| - "high-performance" | ||||||||||||
| pool: | ||||||||||||
| name: gpu-pool | ||||||||||||
| generation: 1 | ||||||||||||
| resourceSliceCount: 1 | ||||||||||||
| devices: | ||||||||||||
| - name: gpu-1 | ||||||||||||
| attributes: | ||||||||||||
| vendor: | ||||||||||||
| string: "example" | ||||||||||||
| model: | ||||||||||||
| string: "example-gpu" | ||||||||||||
| bindsToNode: true | ||||||||||||
| bindingConditions: | ||||||||||||
| - dra.example.com/is-prepared | ||||||||||||
| bindingFailureConditions: | ||||||||||||
| - dra.example.com/preparing-failed | ||||||||||||
| ``` | ||||||||||||
| This example ResourceSlice has the following properties: | ||||||||||||
|
|
||||||||||||
| - The ResourceSlice targets nodes labeled with `accelerator-type=high-performance`, | ||||||||||||
| so that the scheduler uses only a specific set of eligible nodes. | ||||||||||||
| - The scheduler selects one node from the selected group (for example, `node-3`) and sets | ||||||||||||
| the `status.allocation.nodeSelector` field in the ResourceClaim to that node name. | ||||||||||||
| - The `dra.example.com/is-prepared` binding condition indicates that the device `gpu-1` | ||||||||||||
| must be prepared (the `is-prepared` condition has a status of `True`) before binding. | ||||||||||||
| - If the `gpu-1` device preparation fails (the `preparing-failed` condition has a status of `True`), the scheduler aborts binding. | ||||||||||||
| - The scheduler waits up to 600 seconds (default) for the device to become ready. | ||||||||||||
| - External controllers can use the node selector in the ResourceClaim to perform | ||||||||||||
| node-specific setup on the selected node. | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| ## DRA alpha features {#alpha-features} | ||||||||||||
|
|
||||||||||||
| The following sections describe DRA features that are available in the Alpha | ||||||||||||
|
|
@@ -939,109 +1050,6 @@ Resource pool status is an *alpha feature* and only enabled when the | |||||||||||
| [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) | ||||||||||||
| is enabled in the kube-apiserver and kube-controller-manager. | ||||||||||||
|
|
||||||||||||
| ### Device Binding Conditions {#device-binding-conditions} | ||||||||||||
|
|
||||||||||||
| {{< feature-state feature_gate_name="DRADeviceBindingConditions" >}} | ||||||||||||
|
|
||||||||||||
| Device Binding Conditions allow the Kubernetes scheduler to delay Pod binding until | ||||||||||||
| external resources, such as fabric-attached GPUs or reprogrammable FPGAs, are confirmed | ||||||||||||
| to be ready. | ||||||||||||
|
|
||||||||||||
| This waiting behavior is implemented in the | ||||||||||||
| [PreBind phase](/docs/concepts/scheduling-eviction/scheduling-framework/#pre-bind) | ||||||||||||
| of the scheduling framework. | ||||||||||||
| During this phase, the scheduler checks whether all required device conditions are | ||||||||||||
| satisfied before proceeding with binding. | ||||||||||||
|
|
||||||||||||
| This improves scheduling reliability by avoiding premature binding and enables coordination | ||||||||||||
| with external device controllers. | ||||||||||||
|
|
||||||||||||
| To use this feature, device drivers (typically managed by driver owners) must publish the | ||||||||||||
| following fields in the `Device` section of a `ResourceSlice`. Cluster administrators | ||||||||||||
| must enable the `DRADeviceBindingConditions` and `DRAResourceClaimDeviceStatus` feature | ||||||||||||
| gates for the scheduler to honor these fields. | ||||||||||||
|
|
||||||||||||
| - `bindingConditions`: A list of condition types that must be set to True in the | ||||||||||||
| status.conditions field of the associated ResourceClaim before the Pod can be bound. | ||||||||||||
| These typically represent readiness signals such as "DeviceAttached" or "DeviceInitialized". | ||||||||||||
| - `bindingFailureConditions`: A list of condition types that, if set to True in | ||||||||||||
| status.conditions field of the associated ResourceClaim, indicate a failure state. | ||||||||||||
| If any of these conditions are True, the scheduler will abort binding and reschedule the Pod. | ||||||||||||
| - `bindsToNode`: if set to `true`, the scheduler records the selected node name in the | ||||||||||||
| `status.allocation.nodeSelector` field of the ResourceClaim. | ||||||||||||
| This does not affect the Pod's `spec.nodeSelector`. Instead, it sets a node selector | ||||||||||||
| inside the ResourceClaim, which external controllers can use to perform node-specific | ||||||||||||
| operations such as device attachment or preparation. | ||||||||||||
|
|
||||||||||||
| All condition types listed in bindingConditions and bindingFailureConditions are evaluated | ||||||||||||
| from the `status.conditions` field of the ResourceClaim. | ||||||||||||
| External controllers are responsible for updating these conditions using standard Kubernetes | ||||||||||||
| condition semantics (`type`, `status`, `reason`, `message`, `lastTransitionTime`). | ||||||||||||
|
|
||||||||||||
| The scheduler waits up to **600 seconds** (default) for all `bindingConditions` to become `True`. | ||||||||||||
| If the timeout is reached or any `bindingFailureConditions` are `True`, the scheduler | ||||||||||||
| clears the allocation and reschedules the Pod. | ||||||||||||
| This timeout duration is configurable by the user through `KubeSchedulerConfiguration`. | ||||||||||||
|
|
||||||||||||
| ```yaml | ||||||||||||
| apiVersion: resource.k8s.io/v1 | ||||||||||||
| kind: ResourceSlice | ||||||||||||
| metadata: | ||||||||||||
| name: gpu-slice | ||||||||||||
| spec: | ||||||||||||
| driver: dra.example.com | ||||||||||||
| nodeSelector: | ||||||||||||
| nodeSelectorTerms: | ||||||||||||
| - matchExpressions: | ||||||||||||
| - key: accelerator-type | ||||||||||||
| operator: In | ||||||||||||
| values: | ||||||||||||
| - "high-performance" | ||||||||||||
| pool: | ||||||||||||
| name: gpu-pool | ||||||||||||
| generation: 1 | ||||||||||||
| resourceSliceCount: 1 | ||||||||||||
| devices: | ||||||||||||
| - name: gpu-1 | ||||||||||||
| attributes: | ||||||||||||
| vendor: | ||||||||||||
| string: "example" | ||||||||||||
| model: | ||||||||||||
| string: "example-gpu" | ||||||||||||
| bindsToNode: true | ||||||||||||
| bindingConditions: | ||||||||||||
| - dra.example.com/is-prepared | ||||||||||||
| bindingFailureConditions: | ||||||||||||
| - dra.example.com/preparing-failed | ||||||||||||
| ``` | ||||||||||||
| This example ResourceSlice has the following properties: | ||||||||||||
|
|
||||||||||||
| - The ResourceSlice targets nodes labeled with `accelerator-type=high-performance`, | ||||||||||||
| so that the scheduler uses only a specific set of eligible nodes. | ||||||||||||
| - The scheduler selects one node from the selected group (for example, `node-3`) and sets | ||||||||||||
| the `status.allocation.nodeSelector` field in the ResourceClaim to that node name. | ||||||||||||
| - The `dra.example.com/is-prepared` binding condition indicates that the device `gpu-1` | ||||||||||||
| must be prepared (the `is-prepared` condition has a status of `True`) before binding. | ||||||||||||
| - If the `gpu-1` device preparation fails (the `preparing-failed` condition has a status of `True`), the scheduler aborts binding. | ||||||||||||
| - The scheduler waits up to 600 seconds (default) for the device to become ready. | ||||||||||||
| - External controllers can use the node selector in the ResourceClaim to perform | ||||||||||||
| node-specific setup on the selected node. | ||||||||||||
|
|
||||||||||||
| An example of configuring this timeout in `KubeSchedulerConfiguration` is given below: | ||||||||||||
|
|
||||||||||||
| ```yaml | ||||||||||||
| apiVersion: kubescheduler.config.k8s.io/v1 | ||||||||||||
| kind: KubeSchedulerConfiguration | ||||||||||||
| profiles: | ||||||||||||
| - schedulerName: default-scheduler | ||||||||||||
| pluginConfig: | ||||||||||||
| - name: DynamicResources | ||||||||||||
| args: | ||||||||||||
| apiVersion: kubescheduler.config.k8s.io/v1 | ||||||||||||
| kind: DynamicResourcesArgs | ||||||||||||
| bindingTimeout: 60s | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| ## {{% heading "whatsnext" %}} | ||||||||||||
|
|
||||||||||||
| - [Set Up DRA in a Cluster](/docs/tasks/configure-pod-container/assign-resources/set-up-dra-cluster/) | ||||||||||||
|
|
||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: when we document (sub)features for DRA, we should place them where they would belong if they were stable.
If we do that, then when features graduate, the docs remain easy to find and use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to pursue this, I feel that the current sections such as “DRA beta features” and “DRA alpha features” would no longer be appropriate, and that we would need to reconsider the overall structure of this chapter.