Skip to content

Fix CPU sets reconcile for in place vertical scaling with exclusive CPUs#11

Open
lukaszwojciechowski wants to merge 15 commits intoesotsal:policy_staticfrom
lukaszwojciechowski:fix-cpuset-reconcile
Open

Fix CPU sets reconcile for in place vertical scaling with exclusive CPUs#11
lukaszwojciechowski wants to merge 15 commits intoesotsal:policy_staticfrom
lukaszwojciechowski:fix-cpuset-reconcile

Conversation

@lukaszwojciechowski
Copy link
Copy Markdown

@lukaszwojciechowski lukaszwojciechowski commented Feb 11, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR modifies the algorithm for CPU Set application to containers runtime in CPU manager's reconciliation loop.
It allows to solve conflicts of CPUs usage that might appear in scenarios of exclusive CPU resources resizing without restart with Guaranteed QoS Class, static CPU policy and InPlacePodVerticalScaling enabled.

For details about conflicting scenarios see Which issue(s) this PR fixes section below.

PR structure

PR consists of 5 commits:

  • enhance TestReconcileState:
    -- verification if lastUpdateState contains expected values
    -- enabling verification of how the reconcile process completed for multiple containers, not just a single one
    -- extending test cases to cover the above changes and add a simple multi-container test
  • verify CPUSets set in mock runtime:
    -- extend mockRuntimeService in the CPU manager tests to store CPUSets that were applied to the runtime
    -- verify if exclusive CPUs are assigned to a single container only
    -- verified if the runtime state matches expectations
  • add resize testcases for reconcileState: (NOTE: added testcases fail because of issues in algorithm)
    -- add test cases for verification of CPUSets reconcilation for containers using exclusive CPUs
  • rework CPUSet reconciliation algorithm:
    -- modification of the algorithm (NOTE: test cases pass again)
  • add test cases to cover multi-pass reconcile failures
    -- adds testcases covering UpdateContainerResources failure at different passes
Algorithm:
  1. Modifies the loop iterating over all containers in all pods to act in the critical section controlled by CPU manager's lock (same that is used during Allocate).
  2. During the iteration CPU Sets are not yet applied but only collected to local variables:
    exclusiveCPUContainers and nonExclusiveCPUContainers. Usage of the lock guarantees consistent state.
  3. After collection and outside critical section CPU Sets are applied to runtime in three steps:

Step 3.1. remove scaled down exclusive CPUs from containers:

  • as containers using exclusive CPUs cannot be scaled down to empty CPU Set, because they need to keep Original CPU Set, it is safe operation that won't try to set an empty set in runtime
  • after this step, all CPUs that belong now to default CPU Set are no longer used exclusively by any container

Step 3.2. apply CPU Sets for all non-exclusive containers:

  • these containers uses default CPUSet
  • if default CPUSet shrank since last reconcileState call due to allocation of the CPUs as exclusive, the CPUs removed from it are no longer used by non-exclusive containers after completing this step

Step 3.3. set final CPU Sets for containers using exclusive CPUs

Which issue(s) this PR fixes:

Resize of integer CPU resources of pods with Guaranteed QoS Class without restarting containers when CPU policy is static and InPlacePodVerticalScaling enabled requires handling of more complex scenarios when applying the allocated CPUs to containers' runtime.

The algorithm used in reconcileState in CPU manager has few drawbacks that may lead to temporary conflicts of CPU usage (exclusive CPUs used by multiple containers):

Consistent state application

The reconcileState function does not apply CPU sets of all containers as a consistent state. The loop for all pods and containers applies CPU sets one by one without any critical section. During iteration over loop, the allocated CPU sets and the default set can be changed by parallel resize. Such operation executes Allocate in another go routine changing allocations, e.g.

  1. reconcileState applies default CPU set to container's A runtime
  2. (meanwhile) container B is being resized. Allocate removes some CPUs from default CPU set and use them as additional exclusive CPUs for container B
  3. reconcileState loop continues and applies new CPU Set to container's B runtime

Since this moment CPUs that were allocated in step 2. are now applied to runtime of both containers A and B.
The situation will fix by itself next time a reconcileState is executed (after 10s).

Temporary conflicts when moving CPUs between containers

Consider following scenario:

  1. container A uses CPUs [1, 2]; container B uses [3, 4]
  2. container B is resized down to original size, so its allocation changes to [3]
  3. container A is resized up and gets additional CPU 4, so its allocation is now [1, 2, 4]
  4. reconcileState applies new CPU set to container's A runtime
    (so container's A runtime uses [1, 2, 4], while container B still uses [3, 4])
  5. reconcileState applies new CPU set to container's B runtime

Between steps 4. and 5. CPU 4 is applied in both containers. Situation is temporary and usually very short, but when it is coincided with kubelet restart or some delay, it might affect operation of containers.

Special notes for your reviewer:

Only patch "rework CPUSet reconciliation algorithm" contains changes to the code.
Other patches extend tests and add test cases covering 100% of changes made and scenarios that new algorithm fixes.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


err := m.updateContainerCPUSet(ctx, rcat.containerID, rcat.allocatedSet)
if err != nil {
logger.Error(err, "failed to update container", "containerID", rcat.containerID, "cpuSet", rcat.allocatedSet)
failure = append(failure, rcat.reconciledContainer)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not update ok= false in this err case, do you have other consideration?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right, The line is missing. I will add it.

We cannot allow 3rd pass to be executed if failure happens in 2nd pass.
Because some of the CPUs currently used by nonExclusiveCPUContainers might not be released and they might be required by exclusiveCPUContainers

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error relationship between phases is not absolute. If a scale down exclusive container fails in the 1st pass , a container in other pass stage may still applied the cpuset successfully. e.g.
1)1st pass :exclusive container scale down: container #0: {1, 2, 3, 4} -> {1, 2} container#1:{5, 6 } ->{5}
2) 2nd pass: non-exclusive cpuset is scale up as {7, 8}->{7, 8, 6},
3)3rd pass: exclusive scale up: Container#2 is {9}->{ 9, 3}, container#3 is {10} ->{10, 4}

If container#0 ok, container#1 fail, container 2 and container 3 can still ok, that means if there is one fail in 1st pass stage, 3rd pass container still ok.

If use ok check before each pass stage, one container update failure can affect all containers in the following pass stage, even if there is no cpuset conflict between them.

How about delete the ok check before every stage? And then change process like below:

  1. 1st pass : if one exclusive container scale down fail, add this container to failure list and then continue

  2. 2nd pass: before updateContainerCPUSet, check intersection(non-exclusive cset, Union(containers lcset in failure list))
    2-1) If it is not null, append all non-exclusive containers to failure list
    2-2) If it is null, there is no cpuset conflict, continue to updateContainerCPUSet for each container. updateContainerCPUSet failed container append to failure list and then continue.

  3. 3rd pass: Loop 3rd pass stage container, if intersection(container cset, Union(container lcset in failure list)),
    2-1) If it is not null, append this container to failure list and continue.
    2-2) If it is null, there is no cpuset conflict, updateContainerCPUSet for each container and continue.

With this design, failed containers in the before process only affect the containers which have conflict cpuset between them. And cpuset update failed container is appended to failure list.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiya0705 Thank you for your design proposal, I think it will be good to implement it.

However I think we need to think about one more scenario:
What if this is first run of reconcile loop after kubelet restart?
In such situation we don't know what is set in runtime (lastUpdateState will be an empty set for all containers). If calling updateContainerCPUSet fails for any of containers in 2nd pass, and we will add shared-CPU containers to failed list, the union of their lcsets will remain empty. Because of that intersection of container cset and this union will also be empty and we will try to continue all exclusive-CPU containers possible leading to conflicts, e.g.

  1. initial state:
    defaultSet={0,1,2,3}, allocations={A:{4,5}},
    lastUpdateState={A:{4,5}, B:{0,1,2,3}},
    runtime={A:{4,5}, B:{0,1,2,3}}
  2. scale up of A (just allocation):
    defaultSet={0,1}, allocations={A:{2,3,4,5}},
    lastUpdateState={A:{4,5}, B:{0,1,2,3}},
    runtime={A:{4,5}, B:{0,1,2,3}}
  3. kubelet restart
    defaultSet={0,1}, allocations={A:{2,3,4,5}},
    lastUpdateState={},
    runtime={A:{4,5}, B:{0,1,2,3}}
  4. reconcile of B fails, we continue with setting up A:
    defaultSet={0,1}, allocations={A:{2,3,4,5}},
    lastUpdateState={A:{2,3,4,5}},
    runtime={A:{2,3,4,5}, B:{0,1,2,3}} <-- conflict on CPUs: 2,3

@shiya0705 How do you propose to cover this scenario?

Copy link
Copy Markdown

@shiya0705 shiya0705 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lastUpdateState is empty, how about read cpuset from runtime using function func (m *kubeGenericRuntimeManager) GetContainerStatus, use this cpuset to replace lastUpdateState.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I miss something, so please correct me if i don't see the valid possibility, but to me it looks like:

That would be a huge modification and we probably need to get community opinion first.
The communication is intended to be one direction only: kubelet -> runtime
with exception of requesting status (but limited to only several basic fields)

Currently getting CPUSet from runtime is not implemented on any layer, so there will be a lot to do to implement it through all the layers.
Also the runtime is completely separate from kubelet itself, so different runtimes might not want to implement that and even if they do that's an API change and needs to be introduced a feature.


So without reading the current runtime state, we cannot know what is set and cannot avoid collisions.
Accepting that, I think your approach @shiya0705 is fine. I will implement it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In func (m *kubeGenericRuntimeManager) GetContainerStatus,
resp, err := m.runtimeService.ContainerStatus(ctx, id.ID, false) read container status from runtime , cpuset has been included in resp (type of ContainerStatusResponse).
ContainerStatusResponse
-> ContainerStatus
-> ContainerResources
-> LinuxContainerResources
-> CpusetCpus string `protobuf:"
So, I think it is possible to read cpuset from runtime, but need some modification.

Another thing we need to consider is if updateContainerCPUSet return error, even though it is possible to Get cpuset from runtime, but get cpuset may also fail and lead to conflict.

If accepting that, I think this approach is fine.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shiya0705

I modified the algorithm as you suggested at the beginning. I also added some more test cases to verify failures during multi-pass reconciliation.

I think we cannot go with adding CPUSetCPUs to runtime response. That would change the protocol between kubernetes and runtime and is a huge change that requires separate FeatureGate and consultation with runtime implementing projects.

Please make another round of review

continue
}
m.lastUpdateState.SetCPUSet(rca.podUID, rca.containerName, iset)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container is not appended to success list in this loop when it updateContainerCPUSet success.
Even though it could be updated in the third loop(step 3-3), however, the prerequisite is that there is no container fail in the previous loop.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's intentional, because it is not fully set yet. The exclusiveCPUContainers will get to success or failure list in the 3rd pass.

However if you open the discussion about it, maybe we can chat about it:

  • So first of all please notice that the success and failure lists are used only in test
  • Second, there is no definition what success or failure means.
    (before my patch, it was easier, because every container ended in one of the categories)

So here are possible solution to how we can handle that:

Option 1
Define success as a successful application of final state to the container.
So containers can be on success list if application of CPU sets is complete, on failure list if an error happened or not present on any of lists if it was not processed at all or just partially.

Results:

  • if there is an error in 1st pass - only failed containers from exclusiveCPUContainers will be reported on failure list; nonExclusiveCPUContainers won't be reported.
  • if there is an error in 2nd pass (first pass no errors) all nonExclusiveCPUContainers will be returned as success or failure; none of exclusiveCPUContainers will be reported.
  • if we reach 3rd pass (no errors in passes 1 and 2) all containers will be reported in one of two categories.

Option 2
Make every container to appear on one of the returned lists. That would require adding 2 more lists to be returned: unprocessed, partial.

Results:

  • if there is an error in 1st pass - failed containers from exclusiveCPUContainers will be reported on failure list; partially set up containers from exclusiveCPUContainers will be reported on partial list; nonExclusiveCPUContainers will be reported on unprocessed list.
  • if there is an error in 2nd pass (first pass no errors) all nonExclusiveCPUContainers will be returned as success or failure; all of exclusiveCPUContainers will be reported as partial.
  • if we reach 3rd pass (no errors in passes 1 and 2) all containers will be reported in one of two categories: success, failure; the partial and unprocessed lists will be empty.

@shiya0705 , @esotsal please let me know your opinion.
After we decide I will apply changes to the code (or not) and add tests to cover the scenarios.

My opinion is to leave this code as is (so choosing Option 1), but add unit tests to cover all execution paths.

Copy link
Copy Markdown

@shiya0705 shiya0705 Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these two options are based on if there is one container failed in the previous stage, all of the containers in the following stage will stop cpuset update. It may not be very friendly.
Option 1 only reported updateContainerCPUSet failed container,unprocessed and partial container cpuset not been updated but been ignored, the option 2 seems to be covered comprehensively.

I added a new design option to comment https://github.com/esotsal/kubernetes/pull/11#discussion_r2870674556,and let's check if it helps.

@esotsal esotsal force-pushed the policy_static branch 2 times, most recently from 7a6a74b to ff01a41 Compare February 16, 2026 14:09
@esotsal esotsal force-pushed the policy_static branch 2 times, most recently from a1182a8 to 7321dd1 Compare February 23, 2026 08:11
@esotsal esotsal force-pushed the policy_static branch 4 times, most recently from 5bd954d to 3452089 Compare February 27, 2026 12:49
@esotsal esotsal force-pushed the policy_static branch 2 times, most recently from 7bbfcdd to 3e66885 Compare March 3, 2026 13:00
@esotsal esotsal force-pushed the policy_static branch 8 times, most recently from 044d951 to f52feef Compare March 6, 2026 10:56
@esotsal esotsal force-pushed the policy_static branch 2 times, most recently from 212ea47 to 2411032 Compare March 10, 2026 08:27
esotsal and others added 9 commits March 10, 2026 15:48
Use new topology.Allocation struct (a CPU set plus
alignment metadata) instead of CPU set, due to rebase.

Remove duplicate unecessary SetDefaultCPUSet call as per
review comment.
- Revert introduction of API env mustKeepCPUs
- Replace mustKeepCPUs with local checkpoint "Original"
- Introduce "Original" / "Resized" in CPUManagerCheckpointV3 format
- Add logic, refactor with Beta candidate
- Fix lint issues
- Fail if mustKeepCPUs are not subset of resulted CPUs
- Fail if reusableCPUsForResize, mustKeepCPUs are not a subset
  of aligned CPUs
- Fail if mustKeepCPUs are not a subset of reusable CPUs
- TODO improve align resize tests, go through testing, corner cases
       refactor using cpumanager_test.go
- TODO improve CPUManagerCheckpointV3 tests
- TODO address code review/feedback to try different approach to allocate
       stepwise instead of once off when resizing
- TODO check init-containers
- TODO check migration from v2 to v3 CPU Manager checkpoint
- TODO check kubectl failure when prohibited can this be done earlier?
- WIP  update CPU Manager tests to use refactored cpu_manager_test
- TODO update topologymanager,cpumanager,memorymanager documentation
To implement the design approved by KEP, update admit handler
for topology/cpu manager to perform the appropriate feasibility checks
on lifecycle.ResizeOperation.
…rror

Handle first review round, not ready yet
Enhance CPU manager's test of reconcileState function
(the one that actuates allocated CPU sets in runtime).

The improvement involves three elements:
1) verification if lastUpdateState contains expected values;
2) enabling verification of how the reconcile process completed
   for multiple containers, not just a single one;
3) extending test cases to cover the above changes
   and add a simple multi-container test.

Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
Extend mockRuntimeService in the CPU manager tests to store CPUSets that
were applied to the runtime. Additionally, after each update of container
resources (if the testCPUConflicts flag is enabled), the mock runtime
verifies if exclusive CPUs are assigned to a single container only.

The extended mockRuntimeService is applied to TestReconcileState, so it
verifies conflicts after each update and after reconciliation is
completed, it is verified if the runtime state matches expectations.
Proper fields are added to each of the existing test cases.

Additionally, the mockRuntimeService has been enhanced to support returning
a sequence of errors from UpdateContainerResources calls. The err field was
changed from a single error to a slice of errors ([]error), and the
UpdateContainerResources function now returns the first error from the slice
and removes it, allowing different errors to be returned for successive calls.
This enhancement enables more sophisticated testing scenarios where multiple
container resource updates may fail at different points in the test sequence.

Proper fields are added to each of the existing test cases to accommodate
the new error slice functionality and CPU set tracking capabilities.

Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
Add test cases for verification of CPUSets reconcilation
for containers using exclusive CPUs.

These test cases verify behavior of CPUs scaling with
InPlacePodVerticalScalingExclusiveCPUs enabled.

Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
The former implementation of reconcileState in CPU manager had two
issues:
1) It didn't apply CPU sets of all containers as a consistent state.
The loop for all pods and containers applied CPUSets one by one without
any critical section. During iteration over loop, the allocated CPUSets
and default set could have been changed by executing Allocate for needs
of resize or appearance of new container. Such situation could lead to
conflicts of exclusive CPUs, e.g.

a) reconcileState applies default CPU Set to container A runtime
b) Allocate removes some CPUs from default CPU Set and use them as
additional exclusive CPUS for container B which resizes up
c) reconcileState loop continues and applies new CPU Set to container B
runtime
The CPUs that were allocated in step b) are now assigned to both
containers A and B in runtime.

2) It didn't consider temporary conflicts when moving CPUs from one
   container to another. For example:
a) container A uses CPUs: 1, 2; container B uses CPUs: 3, 4
b) container B scales down by one cpu, and now only CPU: 3 is allocated
for it
c) container A scales up and receives CPU: 4 during allocation (so now
it has CPUs: 1, 2, 4 allocated)
d) reconcileState applies new CPU Set to container A runtime: 1, 2, 4
e) reconcileState applies new CPU Set to container B runtime: 3
Between steps d) and e) CPU: 4 is assigned to both container A and
container B. If kubelet is restarted that time, the situation will hold
for some time.

The new algorithm:
1) Modifies the loop iterating over all containers in all pods to act in
a critical section controlled by CPU manager's lock - same that is
used during Allocate. During the iteration CPU Sets are not yet applied
but only collected to local variables: exclusiveCPUContainers and
nonExclusiveCPUContainers. Usage of the lock guarantees consistent
state.

2) After collection and outside critical section CPU Sets are applied to
runtime in three steps:
2.1) remove scaled down exclusive CPUs from containers
* as containers using exclusive CPUs cannot be scaled down to 0, because
they need to retain Original CPU Set, it is safe operation that won't
try to set an empty set in runtime
* after this operation all CPUs that belong now to default CPU Set are
no longer used exclusively by any container, so next step can be applied
2.2) apply CPU Sets for all non-exclusive containers
* these containers will use default CPUSet
* if default CPUSet shrank since last reconcileState call due to
allocation of the CPUs as exclusive, the CPUs removed from it are now
no longer used by non-exclusive containers, so next step can be applied
2.3) set final CPU Sets for containers using exclusive CPUs

3) Improves conflict detection by tracking failed container updates and
preventing further updates that would conflict with previously failed ones.

Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
This commit adds extensive test coverage for the new multi-pass reconciliation
algorithm in the CPU manager, covering various failure scenarios
of UpdateContainerResources function during the reconciliation process
to ensure proper handling of CPU conflicts.

Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
// Determine the CPU set to use based on the pass
var targetCPUSet cpuset.CPUSet
if preliminary {
targetCPUSet = rca.allocatedSet.Intersection(lcset)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preliminary as true only in case remove CPUs from containers using exclusive CPUs, in this case
targetCPUSet = rca.allocatedSet.Intersection(lcset) and targetCPUSet = rca.allocatedSet are same. No need to add a preliminary condition check here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think also about a case when the set of CPUs is not only scaled down, but changed.

Let's consider the following scenario for a container:

  1. initial scenario state: allocated: {original: {1}, resized: {1, 2, 3} }, lastUpdateState: {1, 2, 3}}
  2. scale down 3--> 1 : allocated: {original: {1}, resized: {1} }, lastUpdateState: {1, 2, 3}}
  3. meanwhile some other container is resized and got allocated CPUs 2 and 3.
  4. scale up 1 --> 2 : allocated: {original: {1}, resized: {1, 4} }, lastUpdateState: {1, 2, 3}}
  5. only now is reconcileState launched
    The targetCPUSet for the first pass must be {1} - the intersection of allocated and lastUpdateState.
    We don't want to use {1, 4} as CPU 4 might be still used by some other containers.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your design idea, It seems reasonable.

}

// Check if update is needed
if !targetCPUSet.Equals(lcset) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first pass updateContainers(exclusiveCPUContainers, true), all the containers in the exclusiveCPUContainers which has been resized will upate cpuset, not only the CPUs removed container.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they update lastUpdateState each time changes to runtime are made.
However, the change in the first pass (updateContainers(exclusiveCPUContainers, true)) might not be final - only some CPUs were removed, so in the 3rd pass we still need to check if another update is required.

Look at the scenario from above comment.
reconcileState is run with following state: allocated: {original: {1}, resized: {1, 4} }, lastUpdateState: {1, 2, 3}}
So

  • in the 1st pass there will be update to {1} - only removal of CPUs: 2 and 3
  • in the 3rd pass there will be update to {1, 4} - final set after we have the guarantee that CPU 4 is no longer used by any other container.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right~

@esotsal esotsal force-pushed the policy_static branch 6 times, most recently from aa5c826 to ecd92e6 Compare March 14, 2026 07:31
@esotsal
Copy link
Copy Markdown
Owner

esotsal commented Mar 18, 2026

Since we missed the deadline of v1.36 and aiming now for v1.37, i think it makes sense to merge this to the main PR and continue the discussion there. Please @lukaszwojciechowski rebase to merge your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants