-
Notifications
You must be signed in to change notification settings - Fork 15.4k
KEP-5526: Pod Level Resource Managers blog 1.36 #53001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
KevinTMtz
wants to merge
1
commit into
kubernetes:main
Choose a base branch
from
KevinTMtz:blog-entry-KEP-5526
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+165
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
165 changes: 165 additions & 0 deletions
165
content/en/blog/_posts/2026/pod-level-resource-managers.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,165 @@ | ||
| --- | ||
| layout: blog | ||
| title: "Kubernetes 1.36: Pod-Level Resource Managers (Alpha)" | ||
| date: 2026-03-31 | ||
| slug: kubernetes-1-36-feature-pod-level-resource-managers-alpha | ||
| author: Kevin Torres (Google) | ||
| --- | ||
|
|
||
| This blog post describes the Pod-Level Resource Managers, a new alpha feature | ||
| introduced in Kubernetes v1.36. This enhancement extends the Kubelet's Topology, | ||
| CPU, and Memory Managers to support pod-level resource specifications. | ||
|
|
||
| This feature evolves the resource managers from a strictly per-container | ||
| allocation model to a pod-centric one. It enables them to use | ||
| `pod.spec.resources` to perform NUMA alignment for the pod as a whole, and | ||
| introduces a partitioning scheme to manage resources for containers within that | ||
| pod-level grouping. This change introduces a more flexible and powerful resource | ||
| management model, particularly for performance-sensitive workloads, allowing you | ||
| to define hybrid allocation models where some containers receive exclusive | ||
| resources while others share the remaining resources from a pod shared pool. | ||
|
|
||
| This blog post covers: | ||
|
|
||
| 1. [Why do we need Pod-Level Resource Managers?](#why-do-we-need-pod-level-resource-managers) | ||
| 2. [Glossary](#glossary) | ||
| 3. [How do Pod-Level Resource Managers work?](#how-do-pod-level-resource-managers-work) | ||
| 4. [Current limitations and caveats](#current-limitations-and-caveats) | ||
|
|
||
| ## Why do we need Pod-Level Resource Managers? | ||
|
|
||
| When working with performance-critical workloads (like AI/ML, High-Performance | ||
| Computing, or others), you often need exclusive, NUMA-aligned resources for your | ||
| primary application containers. However, modern Kubernetes pods frequently | ||
| include sidecar containers (e.g., for logging, monitoring, or data ingestion). | ||
|
|
||
| Historically, you either had to allocate exclusive, NUMA-aligned resources to | ||
| every container in a Guaranteed pod (which is wasteful for lightweight sidecars) | ||
| or forfeit the pod-level Guaranteed QoS class entirely. | ||
|
|
||
| By enabling the `PodLevelResourceManagers` feature (which also requires the | ||
| `PodLevelResources` feature gate), the kubelet can create hybrid resource | ||
| allocation models, bringing flexibility and efficiency to high-performance | ||
| workloads without sacrificing NUMA alignment. | ||
|
|
||
| ## Glossary | ||
|
|
||
| To fully understand this new feature, it helps to define a few key terms: | ||
|
|
||
| - **Pod Level Resources**: The resource budget defined at the pod level in | ||
| `pod.spec.resources`, which specifies the collective requests and limits for | ||
| the entire pod. | ||
| - **Guaranteed Container**: Within the context of this feature, a container is | ||
| considered `Guaranteed` if it specifies resource requests equal to its | ||
| limits for both CPU (exclusive CPU allocation requires a positive integer | ||
| value) and Memory. This status makes it eligible for exclusive resource | ||
| allocation from the resource managers. | ||
| - **Pod Shared Pool**: The subset of a pod's allocated resources that remains | ||
| after all exclusive slices have been reserved. These resources are shared by | ||
| all containers in the pod that do not receive an exclusive allocation. While | ||
| containers in this pool share resources with each other, they are strictly | ||
| isolated from the exclusive slices and the general node-wide shared pool. | ||
| - **Exclusive Slice**: A dedicated portion of resources (e.g., specific CPUs | ||
| or memory pages) allocated solely to a single container, ensuring isolation | ||
| from other containers. | ||
|
|
||
| ## How do Pod-Level Resource Managers work? | ||
|
|
||
| The resource managers operate differently depending on the configured Topology | ||
| Manager scope: | ||
|
|
||
| ### Pod Scope | ||
|
|
||
| When the Topology Manager scope is set to `pod`, the Kubelet performs a single | ||
| NUMA alignment for the entire pod based on the resource budget defined in | ||
| `pod.spec.resources`. | ||
|
|
||
| The resulting NUMA-aligned resource pool is then partitioned: | ||
|
|
||
| 1. **Exclusive Slices:** Containers that specify `Guaranteed` resources are | ||
| allocated exclusive slices from the pod's total allocation. | ||
| 2. **Pod Shared Pool:** The remaining resources form a shared pool that is | ||
| shared among all other non-Guaranteed containers in the pod. While | ||
| containers in this pool share resources with each other, they are strictly | ||
| isolated from the exclusive slices and the general node-wide shared pool. | ||
|
|
||
| Note that when standard init containers run to completion, their resources are | ||
| added to a per-pod reusable set, rather than being returned to the node's | ||
| resource pool. Because they run sequentially, these resources are made reusable | ||
| for subsequent app containers (either for their own exclusive slices or for the | ||
| shared pool). | ||
|
|
||
| This allows you to co-locate containers that require exclusive resources with | ||
| those that do not, all within a single NUMA-aligned pod. | ||
|
|
||
| **Important Pod Scope considerations:** | ||
|
|
||
| - Empty Shared Pool Rejection: If the sum of all exclusive container requests | ||
| exactly matches the pod's total budget, but there is another container that | ||
| requires the shared pool, the pod will be rejected at admission. For | ||
| example, a pod asking for a pod-level budget of 4 CPUs, where `container-1` | ||
| requires an exclusive 1 CPU and `container-2` requires an exclusive 3 CPUs. | ||
| Because there are 0 CPUs left in the shared pool for `container-3`, this pod | ||
| is rejected. | ||
|
|
||
| ### Container Scope | ||
|
|
||
| When the Topology Manager scope is set to `container`, the Kubelet evaluates | ||
| each container individually for exclusive allocation. | ||
|
|
||
| If the overall pod achieves a `Guaranteed` QoS class via `pod.spec.resources`, | ||
| you can mix and match containers: | ||
|
|
||
| - Containers with their own `Guaranteed` requests receive exclusive | ||
| NUMA-aligned resources. | ||
| - Other non-Guaranteed containers in the pod run in the node's general shared | ||
| pool. | ||
| - The collective resource consumption of all containers is still enforced by | ||
| the pod's `pod.spec.resources` limits. | ||
|
|
||
| This scope is extremely useful when an infrastructure sidecar needs to be | ||
| aligned to a specific NUMA node for device access, while the main workload can | ||
| run in the general node shared pool. | ||
|
|
||
| ### Under-the-hood: CPU Quotas (CFS) | ||
|
|
||
| When running mixed workloads within a pod, isolation is enforced differently | ||
| depending on the allocation: | ||
|
|
||
| - **Exclusive Containers:** Containers granted exclusive CPU slices have their | ||
| CPU CFS quota enforcement disabled (`ResourceIsolationContainer`), allowing | ||
| them to run without being throttled by the Linux scheduler. | ||
| - **Pod Shared Pool Containers:** Containers falling into the pod shared pool | ||
| have CPU CFS quotas enabled (`ResourceIsolationPod`), ensuring they do not | ||
| consume more than the leftover pod budget. | ||
|
|
||
| ## Current limitations and caveats | ||
|
|
||
| - The functionality is currently implemented only for the `static` CPU Manager | ||
| policy and the `Static` Memory Manager policy. | ||
| - This feature is only supported on Linux nodes. On Windows nodes, the | ||
| resource managers will act as a no-op for pod-level allocations. | ||
| - As a fundamental requirement of using pod.spec.resources, the sum of all | ||
| container-level resource requests must not exceed the pod-level resource | ||
| budget. | ||
| - If you downgrade the Kubelet to a version that does not support this | ||
| feature, the older Kubelet will fail to read the newer checkpoint files. | ||
| This incompatibility occurs because the newer schema introduces new | ||
| top-level fields to store pod-level allocations, which older Kubelet | ||
| versions cannot parse. | ||
|
Comment on lines
+147
to
+149
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIRC the problem for us was the checksum not matching because was computed including fields an older kubelet would ignore. If my memory serves, this is an operational/internal detail which makes little sense for users: is not actionable. |
||
|
|
||
| ## Getting started and providing feedback | ||
|
|
||
| You can read the | ||
| [Assign Pod-level CPU and memory resources](/docs/tasks/configure-pod-container/assign-pod-level-resources/) | ||
| to understand how to use the overall Pod Level Resource feature, and | ||
| [Use Pod-level Resources with Resource Managers](/docs/tasks/administer-cluster/pod-level-resource-managers/) | ||
| documentation to learn more about how to use this feature! | ||
|
|
||
| As this feature moves through Alpha, your feedback is invaluable. Please report | ||
| any issues or share your experiences via the standard Kubernetes communication | ||
| channels: | ||
|
|
||
| * Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node) | ||
| * [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node) | ||
| * [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode) | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC
ResourceIsolationis an internal concept. If I'm right, let's not surface the naming to users. It may beneficial to describe the concept (e.g. "resource isolation is at XYZ level") but using the specific term incodehints at something we should not expose.