Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: when we document (sub)features for DRA, we should place them where they would belong if they were stable.

If we do that, then when features graduate, the docs remain easy to find and use.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to pursue this, I feel that the current sections such as “DRA beta features” and “DRA alpha features” would no longer be appropriate, and that we would need to reconsider the overall structure of this chapter.

Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,117 @@ Extended resource allocation by DRA is a *beta feature* and is enabled by defaul
`DRAExtendedResource` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRAExtendedResource)
in the kube-apiserver, kube-scheduler, kube-controller-manager, and kubelet.

### Device binding conditions

{{< feature-state feature_gate_name="DRADeviceBindingConditions" >}}

Device Binding Conditions allow the Kubernetes scheduler to delay Pod binding until
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Device Binding Conditions allow the Kubernetes scheduler to delay Pod binding until
As the author of a DRA driver, you can use
_device binding conditions_ to defer Pod binding
until

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is now rewritten for DRA driver developers, and I’d like to discuss whether we should proceed this way.

external resources, such as fabric-attached GPUs or reprogrammable FPGAs, are confirmed
to be ready.

This waiting behavior is implemented in the
[PreBind phase](/docs/concepts/scheduling-eviction/scheduling-framework/#pre-bind)
of the scheduling framework.
During this phase, the scheduler checks whether all required device conditions are
satisfied before proceeding with binding.

This improves scheduling reliability by avoiding premature binding and enables coordination
with external device controllers.

To use this feature, device drivers (typically managed by driver owners) must publish the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To use this feature, device drivers (typically managed by driver owners) must publish the
To use this ability to delay binding, the DRA driver that
you are writing needs to publish all of the

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this sentence, “you” refers to DRA driver developers, which means this is also written with driver authors in mind.

following fields in the `Device` section of a `ResourceSlice`. Cluster administrators
Copy link
Copy Markdown
Member

@lmktfy lmktfy Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
following fields in the `Device` section of a `ResourceSlice`. Cluster administrators
following fields in the `device` section of a ResourceSlice. Because this is relies on a beta feature, you should also clearly document that cluster administrators

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this sentence, “you” refers to DRA driver developers, which means this is also written with driver authors in mind.

must enable the `DRADeviceBindingConditions` and `DRAResourceClaimDeviceStatus` feature
gates for the scheduler to honor these fields.

`bindingConditions`
: A list of _condition types_ that must be set to True (in the `.status.conditions` field of the associated ResourceClaim) before the Pod can be bound. These conditions typically represent readiness signals, such as DeviceAttached or DeviceInitialized.

`bindingFailureConditions`
: A list of condition types that, if set to True in
status.conditions field of the associated ResourceClaim, indicate a failure state.
If any of these conditions are True, the scheduler will abort binding and reschedule the Pod.

`bindsToNode`
: if set to `true`, the scheduler records the selected node name in the
`status.allocation.nodeSelector` field of the ResourceClaim.
This does not affect the Pod's `spec.nodeSelector`. Instead, it sets a node selector
inside the ResourceClaim, which external controllers can use to perform node-specific
operations such as device attachment or preparation.

All condition types listed in bindingConditions and bindingFailureConditions are evaluated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All condition types listed in bindingConditions and bindingFailureConditions are evaluated
The control plane discovers all the binding conditions (from `bindingConditions` and `bindingFailureConditions`) and evaluates those against the list of observed conditions, taken

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you assuming that binding conditions in the ResourceSlice are compared against and evaluated together with the binding conditions in the ResourceClaim? In practice, I believe an external controller would evaluate only the binding conditions in the ResourceClaim.

In addition, based on our experience, the controller that sets binding conditions is not necessarily limited to the control plane. There are also designs where such controllers are distributed and run on each node. For this reason, rather than explicitly referring to the control plane, I thought it might be better to use a more general term such as an external controller.

from the `status.conditions` field of the ResourceClaim.
External controllers are responsible for updating these conditions using standard Kubernetes
condition semantics (`type`, `status`, `reason`, `message`, `lastTransitionTime`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
condition semantics (`type`, `status`, `reason`, `message`, `lastTransitionTime`).
condition semantics (`type`, `status`, `reason`, `message`, `lastTransitionTime`).
If you are the driver author, you may prefer to
provide your own controller, that is custom to the
hardware or other dynamic resource that the driver works with.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Driver-author–focused text has also been added here, and I would like to discuss whether this is necessary.


The scheduler waits up to **600 seconds** (default) for all `bindingConditions` to become `True`.
If the timeout is reached or any `bindingFailureConditions` are `True`, the scheduler
clears the allocation and reschedules the Pod.
A cluster administration can configure this timeout duration by editing the kube-scheduler configuration file.

An example of configuring this timeout in `KubeSchedulerConfiguration` is given below:

```yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: DynamicResources
args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: DynamicResourcesArgs
bindingTimeout: 60s
```

#### Example {#device-binding-conditions-example}

Here is an example of a ResourceSlice that you might see in a cluster where there's a DRA driver in use, and that driver supports binding conditions:

```yaml
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: gpu-slice-1
spec:
driver: dra.example.com
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator-type
operator: In
values:
- "high-performance"
pool:
name: gpu-pool
generation: 1
resourceSliceCount: 1
devices:
- name: gpu-1
attributes:
vendor:
string: "example"
model:
string: "example-gpu"
bindsToNode: true
bindingConditions:
- dra.example.com/is-prepared
bindingFailureConditions:
- dra.example.com/preparing-failed
```
This example ResourceSlice has the following properties:

- The ResourceSlice targets nodes labeled with `accelerator-type=high-performance`,
so that the scheduler uses only a specific set of eligible nodes.
- The scheduler selects one node from the selected group (for example, `node-3`) and sets
the `status.allocation.nodeSelector` field in the ResourceClaim to that node name.
- The `dra.example.com/is-prepared` binding condition indicates that the device `gpu-1`
must be prepared (the `is-prepared` condition has a status of `True`) before binding.
- If the `gpu-1` device preparation fails (the `preparing-failed` condition has a status of `True`), the scheduler aborts binding.
- The scheduler waits up to 600 seconds (default) for the device to become ready.
- External controllers can use the node selector in the ResourceClaim to perform
node-specific setup on the selected node.


## DRA alpha features {#alpha-features}

The following sections describe DRA features that are available in the Alpha
Expand Down Expand Up @@ -939,109 +1050,6 @@ Resource pool status is an *alpha feature* and only enabled when the
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
is enabled in the kube-apiserver and kube-controller-manager.

### Device Binding Conditions {#device-binding-conditions}

{{< feature-state feature_gate_name="DRADeviceBindingConditions" >}}

Device Binding Conditions allow the Kubernetes scheduler to delay Pod binding until
external resources, such as fabric-attached GPUs or reprogrammable FPGAs, are confirmed
to be ready.

This waiting behavior is implemented in the
[PreBind phase](/docs/concepts/scheduling-eviction/scheduling-framework/#pre-bind)
of the scheduling framework.
During this phase, the scheduler checks whether all required device conditions are
satisfied before proceeding with binding.

This improves scheduling reliability by avoiding premature binding and enables coordination
with external device controllers.

To use this feature, device drivers (typically managed by driver owners) must publish the
following fields in the `Device` section of a `ResourceSlice`. Cluster administrators
must enable the `DRADeviceBindingConditions` and `DRAResourceClaimDeviceStatus` feature
gates for the scheduler to honor these fields.

- `bindingConditions`: A list of condition types that must be set to True in the
status.conditions field of the associated ResourceClaim before the Pod can be bound.
These typically represent readiness signals such as "DeviceAttached" or "DeviceInitialized".
- `bindingFailureConditions`: A list of condition types that, if set to True in
status.conditions field of the associated ResourceClaim, indicate a failure state.
If any of these conditions are True, the scheduler will abort binding and reschedule the Pod.
- `bindsToNode`: if set to `true`, the scheduler records the selected node name in the
`status.allocation.nodeSelector` field of the ResourceClaim.
This does not affect the Pod's `spec.nodeSelector`. Instead, it sets a node selector
inside the ResourceClaim, which external controllers can use to perform node-specific
operations such as device attachment or preparation.

All condition types listed in bindingConditions and bindingFailureConditions are evaluated
from the `status.conditions` field of the ResourceClaim.
External controllers are responsible for updating these conditions using standard Kubernetes
condition semantics (`type`, `status`, `reason`, `message`, `lastTransitionTime`).

The scheduler waits up to **600 seconds** (default) for all `bindingConditions` to become `True`.
If the timeout is reached or any `bindingFailureConditions` are `True`, the scheduler
clears the allocation and reschedules the Pod.
This timeout duration is configurable by the user through `KubeSchedulerConfiguration`.

```yaml
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: gpu-slice
spec:
driver: dra.example.com
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator-type
operator: In
values:
- "high-performance"
pool:
name: gpu-pool
generation: 1
resourceSliceCount: 1
devices:
- name: gpu-1
attributes:
vendor:
string: "example"
model:
string: "example-gpu"
bindsToNode: true
bindingConditions:
- dra.example.com/is-prepared
bindingFailureConditions:
- dra.example.com/preparing-failed
```
This example ResourceSlice has the following properties:

- The ResourceSlice targets nodes labeled with `accelerator-type=high-performance`,
so that the scheduler uses only a specific set of eligible nodes.
- The scheduler selects one node from the selected group (for example, `node-3`) and sets
the `status.allocation.nodeSelector` field in the ResourceClaim to that node name.
- The `dra.example.com/is-prepared` binding condition indicates that the device `gpu-1`
must be prepared (the `is-prepared` condition has a status of `True`) before binding.
- If the `gpu-1` device preparation fails (the `preparing-failed` condition has a status of `True`), the scheduler aborts binding.
- The scheduler waits up to 600 seconds (default) for the device to become ready.
- External controllers can use the node selector in the ResourceClaim to perform
node-specific setup on the selected node.

An example of configuring this timeout in `KubeSchedulerConfiguration` is given below:

```yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: DynamicResources
args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: DynamicResourcesArgs
bindingTimeout: 60s
```

## {{% heading "whatsnext" %}}

- [Set Up DRA in a Cluster](/docs/tasks/configure-pod-container/assign-resources/set-up-dra-cluster/)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ stages:
- stage: alpha
defaultValue: false
fromVersion: "1.34"
toVersion: "1.35"
- stage: beta
defaultValue: true
fromVersion: "1.36"
---
Enables support for DeviceBindingConditions in the DRA related fields.
This allows for thorough device readiness checks and attachment processes before Bind phase.