Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ api_metadata:
kind: "DeviceClass"
- apiVersion: "resource.k8s.io/v1beta1"
kind: "ResourceSlice"
- apiVersion: "resource.k8s.io/v1beta2"
kind: "DeviceTaintRule"
- apiVersion: "resource.k8s.io/v1beta2"
kind: "ResourceClaim"
- apiVersion: "resource.k8s.io/v1beta2"
Expand Down Expand Up @@ -386,8 +388,8 @@ The accuracy of the information that a driver adds to a ResourceClaim
`status.devices` field depends on the driver. Evaluate drivers to decide whether
you can rely on this field as the only source of device information.

If you disable the `DRAResourceClaimDeviceStatus`
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/), the
If you disable the
[`DRAResourceClaimDeviceStatus` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRAResourceClaimDeviceStatus), the
`status.devices` field automatically gets cleared when storing the ResourceClaim.
A ResourceClaim device status is supported when it is possible, from a DRA
driver, to update an existing ResourceClaim where the `status.devices` field is
Expand All @@ -404,7 +406,7 @@ As an alpha feature, Kubernetes provides a mechanism for monitoring and reportin
For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy.
It is also helpful to find out if the device recovers.

To enable this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus/)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link was broken.

To enable this functionality, the [`ResourceHealthStatus` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#ResourceHealthStatus)
must be enabled, and the DRA driver must implement the `DRAResourceHealth` gRPC service.

When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet.
Expand Down Expand Up @@ -449,10 +451,16 @@ spec:
You may also be able to mutate the incoming Pod, at admission time, to unset
the `.spec.nodeName` field and to use a node selector instead.

## DRA beta features {#beta-features}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this separation into "GA/beta/alpha" features is not useful and leads to unnecessary churn when features graduate from alpha to beta. It's also misleading because not all beta features are on by default, or on-by-default features could be turned off. "Optional DRA features" looks like a better description.

We still need to move features out of this section when they graduation to GA, though. So perhaps we should instead use "Additional DRA features", which then can include GA features?

/cc @ritazh

#54599

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @pohly.

  1. I personally +1 and appreciated the Alpha/Beta sections because for a team/user that can only use GA things, that section makes it easy to skip. And for a team/user who is an early adopter, it's easier to find and pay more attention to those to ensure these alpha/beta features work in their environment before the features are GAed.
  2. A generic "Optional and Additional DRA features" section is hard to gate and scale in the future. e.g. which features are considered optional and additional? who makes that decision?
  3. I'm also +1 if we want each feature to just have their own section as long as their graduation status is right below it, it's easier to understand how to use it and easier for the feature owners to go back and update.

Thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with "additional", we can add an introduction like this:

Additional features add advanced functionality to core DRA; usage of them is optional and/or may only be relevant with certain DRA drivers.

Some of the features are in the Alpha or Beta
feature stage.
...

I still find that better than categorizing them by their state. An explicit "this is an alpha feature" in the section is clearer than having to remember how far down the page one has scrolled.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting something like the following?

...

## Additional DRA features
Additional features add advanced functionality to core DRA; usage of them is optional and/or may only be relevant with certain DRA drivers. Some of the features are in the Alpha or Beta feature stage.

### Feature one (GA)

### Feature two (alpha)

### Feature three (beta)
...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a tabular form?

DRA Features Status
Feature Status Kubernetes Version FG Default Notes
Structured Parameters Alpha v1.30+ Off Moves parameter logic from external drivers into the scheduler.
Device Taints Beta v1.36+ On Allows nodes with specific devices to be tainted dynamically.

Copy link
Copy Markdown
Contributor Author

@pohly pohly Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at current content makes it clear that "Beta Features" and "Alpha Features" are not useful sub-sections because not all alpha/beta features are described there. For example, prioritized list is described further up in the section about requesting devices. I think we should follow that pattern and describe features where it makes sense, not based on their status.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature one (GA)

That can work, as long as we explicitly set the anchor to not include the GA part.

We don't need to include the (GA/Alpha/Beta) part because it gets rendered for us automatically:

https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#admin-access

I don't think we need to be even more explicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to "Additional features" and did a pass over how the other features are described. Linking to feature gates was inconsistent or even broken. I'm now linking to the specific feature gate anchor.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the change. As features graduate to stable, we should try to find a way to include information about it into the regular flow of the doc, so we don't end up with a long list of "additional features" separate from the overall structure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a better place than "Additional features", then a new feature should be described there immediately. We did that for prioritized list.

"Additional features" then are the things whose usage is less common (the "advanced use cases").

## Additional DRA features

The following sections describe DRA features that are available in the Beta
The following sections describe DRA features that support advanced use
cases. Usage of them is optional and may only be relevant with DRA
drivers that support them.

Some of them are available in the Alpha or Beta
[feature stage](/docs/reference/command-line-tools-reference/feature-gates/#feature-stages).
Those depend on feature gates and may depend on additional
{{< glossary_tooltip text="API groups" term_id="api-group" >}}.
For more information, see
[Set up DRA in the cluster](/docs/tasks/configure-pod-container/assign-resources/set-up-dra-cluster/).

Expand Down Expand Up @@ -491,6 +499,10 @@ create ResourceClaim or ResourceClaimTemplate objects in namespaces labeled with
This ensures that non-admin users cannot misuse the feature.
Starting with Kubernetes v1.34, this label has been updated to `resource.kubernetes.io/admin-access: "true"`.

Admin access is a *beta feature* and is enabled by default with the
[`DRAAdminAccess` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRAAdminAccess)
in the kube-apiserver, kube-scheduler, and kubelet.

### Extended resource allocation by DRA {#extended-resource}

{{< feature-state feature_gate_name="DRAExtendedResource" >}}
Expand Down Expand Up @@ -527,18 +539,9 @@ The resulting ResourceClaim will contain a request for an `ExactCount` of the
specified number of devices of that DeviceClass.

Extended resource allocation by DRA is a *beta feature* and is enabled by default with the
`DRAExtendedResource` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRAExtendedResource)
[`DRAExtendedResource` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRAExtendedResource)
in the kube-apiserver, kube-scheduler, kube-controller-manager, and kubelet.

## DRA alpha features {#alpha-features}

The following sections describe DRA features that are available in the Alpha
[feature stage](/docs/reference/command-line-tools-reference/feature-gates/#feature-stages).
They depend on enabling feature gates and may depend on additional
{{< glossary_tooltip text="API groups" term_id="api-group" >}}.
For more information, see
[Set up DRA in the cluster](/docs/tasks/configure-pod-container/assign-resources/set-up-dra-cluster/).

### Partitionable devices {#partitionable-devices}

{{< feature-state feature_gate_name="DRAPartitionableDevices" >}}
Expand Down Expand Up @@ -607,8 +610,8 @@ spec:
value: 6Gi
```

Partitionable devices is an *alpha feature* and only enabled when the `DRAPartitionableDevices`
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
Partitionable devices is an *alpha feature* and only enabled when the
[`DRAPartitionableDevices` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRAPartitionableDevices)
is enabled in the kube-apiserver and kube-scheduler.

## Consumable capacity
Expand Down Expand Up @@ -736,10 +739,13 @@ Allocating a device with admin access (described [above](#admin-access))
is not exempt either. An admin using that mode must explicitly tolerate all taints
to access tainted devices.

Device taints and tolerations is an *alpha feature* and only enabled when the
`DRADeviceTaints` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
is enabled in the kube-apiserver, kube-controller-manager and kube-scheduler.
To use DeviceTaintRules, the `resource.k8s.io/v1alpha3` API version must be enabled.
Device taints and tolerations is a *beta feature* and enabled when the
[`DRADeviceTaints` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRADeviceTaints)
is kept enabled in the kube-apiserver, kube-controller-manager and kube-scheduler.
To use DeviceTaintRules, the `resource.k8s.io/v1beta2` API version must be
enabled together with the [`DRADeviceTaintRules` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRADeviceTaintRules).
In contrast to `DRADeviceTaints`, `DRADeviceTaintRules` is off by default because of this dependency
on the beta API group, which has to be off by default.

You can add taints to devices in the following ways, by using the DeviceTaintRule API kind.

Expand Down Expand Up @@ -779,7 +785,7 @@ It can be modified and and removed at any time.
Here is one example of a DeviceTaintRule for a fictional DRA driver:

```yaml
apiVersion: resource.k8s.io/v1alpha3
apiVersion: resource.k8s.io/v1beta2
kind: DeviceTaintRule
metadata:
name: example
Expand All @@ -795,8 +801,14 @@ spec:
effect: NoExecute
```

The apiserver automatically tracks when this taint was created and the eviction
controller adds a condition with some information:
The kube-apiserver automatically tracks when this taint was created by setting the
`timeAdded` field in the `spec`. The toleration period starts at that time
stamp. During updates which change the effect (see simulated eviction flow
below), the kube-apiserver automatically updates the time stamp. Users can control
the time stamp explicitly by setting the field when creating a DeviceTaintRule and
by changing it to some different value when updating.

The status contains a condition added by the eviction controller:

```
kubectl describe devicetaintrules
Expand Down Expand Up @@ -877,7 +889,7 @@ To check resource pool status:
optionally a limit on the number of pools returned. You can also limit it to a single pool by specifying a pool name:

```yaml
apiVersion: resource.k8s.io/v1alpha3
apiVersion: resource.k8s.io/v1beta2
kind: ResourcePoolStatusRequest
metadata:
name: check-gpus
Expand Down Expand Up @@ -935,8 +947,7 @@ This feature requires explicit RBAC permissions on the ResourcePoolStatusRequest
resource. No default ClusterRoles include this permission.

Resource pool status is an *alpha feature* and only enabled when the
`DRAResourcePoolStatus`
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
[`DRAResourcePoolStatus` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRAResourcePoolStatus)
is enabled in the kube-apiserver and kube-controller-manager.

### Device Binding Conditions {#device-binding-conditions}
Expand Down Expand Up @@ -1042,6 +1053,10 @@ profiles:
bindingTimeout: 60s
```

Device binding conditions is an *alpha feature* and only enabled when the
[`DRADeviceBindingConditions` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#DRADeviceBindingConditions)
is enabled in the kube-apiserver and kube-scheduler.

## {{% heading "whatsnext" %}}

- [Set Up DRA in a Cluster](/docs/tasks/configure-pod-container/assign-resources/set-up-dra-cluster/)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ stages:
- stage: alpha
defaultValue: false
fromVersion: "1.35"
toVersion: "1.35"
- stage: beta
defaultValue: false
fromVersion: "1.36"
---
Enables support for
[tainting devices through DeviceTaintRule objects](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ stages:
- stage: alpha
defaultValue: false
fromVersion: "1.33"
toVersion: "1.35"
- stage: beta
defaultValue: true
fromVersion: "1.36"
---
Enables support for
[tainting devices and selectively tolerating those taints](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations)
Expand Down