Fix CPU sets reconcile for in place vertical scaling with exclusive CPUs#11
Fix CPU sets reconcile for in place vertical scaling with exclusive CPUs#11lukaszwojciechowski wants to merge 15 commits intoesotsal:policy_staticfrom
Conversation
| err := m.updateContainerCPUSet(ctx, rcat.containerID, rcat.allocatedSet) | ||
| if err != nil { | ||
| logger.Error(err, "failed to update container", "containerID", rcat.containerID, "cpuSet", rcat.allocatedSet) | ||
| failure = append(failure, rcat.reconciledContainer) |
There was a problem hiding this comment.
Why not update ok= false in this err case, do you have other consideration?
There was a problem hiding this comment.
You are absolutely right, The line is missing. I will add it.
We cannot allow 3rd pass to be executed if failure happens in 2nd pass.
Because some of the CPUs currently used by nonExclusiveCPUContainers might not be released and they might be required by exclusiveCPUContainers
There was a problem hiding this comment.
The error relationship between phases is not absolute. If a scale down exclusive container fails in the 1st pass , a container in other pass stage may still applied the cpuset successfully. e.g.
1)1st pass :exclusive container scale down: container #0: {1, 2, 3, 4} -> {1, 2} container#1:{5, 6 } ->{5}
2) 2nd pass: non-exclusive cpuset is scale up as {7, 8}->{7, 8, 6},
3)3rd pass: exclusive scale up: Container#2 is {9}->{ 9, 3}, container#3 is {10} ->{10, 4}
If container#0 ok, container#1 fail, container 2 and container 3 can still ok, that means if there is one fail in 1st pass stage, 3rd pass container still ok.
If use ok check before each pass stage, one container update failure can affect all containers in the following pass stage, even if there is no cpuset conflict between them.
How about delete the ok check before every stage? And then change process like below:
-
1st pass : if one exclusive container scale down fail, add this container to failure list and then continue
-
2nd pass: before updateContainerCPUSet, check intersection(non-exclusive cset, Union(containers lcset in failure list))
2-1) If it is not null, append all non-exclusive containers to failure list
2-2) If it is null, there is no cpuset conflict, continue to updateContainerCPUSet for each container. updateContainerCPUSet failed container append to failure list and then continue. -
3rd pass: Loop 3rd pass stage container, if intersection(container cset, Union(container lcset in failure list)),
2-1) If it is not null, append this container to failure list and continue.
2-2) If it is null, there is no cpuset conflict, updateContainerCPUSet for each container and continue.
With this design, failed containers in the before process only affect the containers which have conflict cpuset between them. And cpuset update failed container is appended to failure list.
There was a problem hiding this comment.
@shiya0705 Thank you for your design proposal, I think it will be good to implement it.
However I think we need to think about one more scenario:
What if this is first run of reconcile loop after kubelet restart?
In such situation we don't know what is set in runtime (lastUpdateState will be an empty set for all containers). If calling updateContainerCPUSet fails for any of containers in 2nd pass, and we will add shared-CPU containers to failed list, the union of their lcsets will remain empty. Because of that intersection of container cset and this union will also be empty and we will try to continue all exclusive-CPU containers possible leading to conflicts, e.g.
- initial state:
defaultSet={0,1,2,3}, allocations={A:{4,5}},
lastUpdateState={A:{4,5}, B:{0,1,2,3}},
runtime={A:{4,5}, B:{0,1,2,3}} - scale up of A (just allocation):
defaultSet={0,1}, allocations={A:{2,3,4,5}},
lastUpdateState={A:{4,5}, B:{0,1,2,3}},
runtime={A:{4,5}, B:{0,1,2,3}} - kubelet restart
defaultSet={0,1}, allocations={A:{2,3,4,5}},
lastUpdateState={},
runtime={A:{4,5}, B:{0,1,2,3}} - reconcile of B fails, we continue with setting up A:
defaultSet={0,1}, allocations={A:{2,3,4,5}},
lastUpdateState={A:{2,3,4,5}},
runtime={A:{2,3,4,5}, B:{0,1,2,3}} <-- conflict on CPUs: 2,3
@shiya0705 How do you propose to cover this scenario?
There was a problem hiding this comment.
If lastUpdateState is empty, how about read cpuset from runtime using function func (m *kubeGenericRuntimeManager) GetContainerStatus, use this cpuset to replace lastUpdateState.
There was a problem hiding this comment.
Maybe I miss something, so please correct me if i don't see the valid possibility, but to me it looks like:
That would be a huge modification and we probably need to get community opinion first.
The communication is intended to be one direction only: kubelet -> runtime
with exception of requesting status (but limited to only several basic fields)
Currently getting CPUSet from runtime is not implemented on any layer, so there will be a lot to do to implement it through all the layers.
Also the runtime is completely separate from kubelet itself, so different runtimes might not want to implement that and even if they do that's an API change and needs to be introduced a feature.
So without reading the current runtime state, we cannot know what is set and cannot avoid collisions.
Accepting that, I think your approach @shiya0705 is fine. I will implement it.
There was a problem hiding this comment.
In func (m *kubeGenericRuntimeManager) GetContainerStatus,
resp, err := m.runtimeService.ContainerStatus(ctx, id.ID, false) read container status from runtime , cpuset has been included in resp (type of ContainerStatusResponse).
ContainerStatusResponse
-> ContainerStatus
-> ContainerResources
-> LinuxContainerResources
-> CpusetCpus string `protobuf:"
So, I think it is possible to read cpuset from runtime, but need some modification.
Another thing we need to consider is if updateContainerCPUSet return error, even though it is possible to Get cpuset from runtime, but get cpuset may also fail and lead to conflict.
If accepting that, I think this approach is fine.
There was a problem hiding this comment.
Thanks @shiya0705
I modified the algorithm as you suggested at the beginning. I also added some more test cases to verify failures during multi-pass reconciliation.
I think we cannot go with adding CPUSetCPUs to runtime response. That would change the protocol between kubernetes and runtime and is a huge change that requires separate FeatureGate and consultation with runtime implementing projects.
Please make another round of review
| continue | ||
| } | ||
| m.lastUpdateState.SetCPUSet(rca.podUID, rca.containerName, iset) | ||
| } |
There was a problem hiding this comment.
Container is not appended to success list in this loop when it updateContainerCPUSet success.
Even though it could be updated in the third loop(step 3-3), however, the prerequisite is that there is no container fail in the previous loop.
There was a problem hiding this comment.
Yes, that's intentional, because it is not fully set yet. The exclusiveCPUContainers will get to success or failure list in the 3rd pass.
However if you open the discussion about it, maybe we can chat about it:
- So first of all please notice that the
successandfailurelists are used only in test - Second, there is no definition what success or failure means.
(before my patch, it was easier, because every container ended in one of the categories)
So here are possible solution to how we can handle that:
Option 1
Define success as a successful application of final state to the container.
So containers can be on success list if application of CPU sets is complete, on failure list if an error happened or not present on any of lists if it was not processed at all or just partially.
Results:
- if there is an error in 1st pass - only failed containers from
exclusiveCPUContainerswill be reported onfailurelist;nonExclusiveCPUContainerswon't be reported. - if there is an error in 2nd pass (first pass no errors) all
nonExclusiveCPUContainerswill be returned assuccessorfailure; none ofexclusiveCPUContainerswill be reported. - if we reach 3rd pass (no errors in passes 1 and 2) all containers will be reported in one of two categories.
Option 2
Make every container to appear on one of the returned lists. That would require adding 2 more lists to be returned: unprocessed, partial.
Results:
- if there is an error in 1st pass - failed containers from
exclusiveCPUContainerswill be reported onfailurelist; partially set up containers fromexclusiveCPUContainerswill be reported onpartiallist;nonExclusiveCPUContainerswill be reported onunprocessedlist. - if there is an error in 2nd pass (first pass no errors) all
nonExclusiveCPUContainerswill be returned assuccessorfailure; all ofexclusiveCPUContainerswill be reported aspartial. - if we reach 3rd pass (no errors in passes 1 and 2) all containers will be reported in one of two categories:
success,failure; thepartialandunprocessedlists will be empty.
@shiya0705 , @esotsal please let me know your opinion.
After we decide I will apply changes to the code (or not) and add tests to cover the scenarios.
My opinion is to leave this code as is (so choosing Option 1), but add unit tests to cover all execution paths.
There was a problem hiding this comment.
Both of these two options are based on if there is one container failed in the previous stage, all of the containers in the following stage will stop cpuset update. It may not be very friendly.
Option 1 only reported updateContainerCPUSet failed container,unprocessed and partial container cpuset not been updated but been ignored, the option 2 seems to be covered comprehensively.
I added a new design option to comment https://github.com/esotsal/kubernetes/pull/11#discussion_r2870674556,and let's check if it helps.
7a6a74b to
ff01a41
Compare
9acd6cf to
8071ebc
Compare
ff01a41 to
de84720
Compare
8071ebc to
1a734ac
Compare
de84720 to
6245285
Compare
1a734ac to
5d3ebe8
Compare
a1182a8 to
7321dd1
Compare
5d3ebe8 to
ed9dfc4
Compare
5bd954d to
3452089
Compare
ed9dfc4 to
31b5189
Compare
7bbfcdd to
3e66885
Compare
31b5189 to
d793f64
Compare
044d951 to
f52feef
Compare
212ea47 to
2411032
Compare
90428e3 to
cc7b71a
Compare
Use new topology.Allocation struct (a CPU set plus alignment metadata) instead of CPU set, due to rebase. Remove duplicate unecessary SetDefaultCPUSet call as per review comment.
- Revert introduction of API env mustKeepCPUs
- Replace mustKeepCPUs with local checkpoint "Original"
- Introduce "Original" / "Resized" in CPUManagerCheckpointV3 format
- Add logic, refactor with Beta candidate
- Fix lint issues
- Fail if mustKeepCPUs are not subset of resulted CPUs
- Fail if reusableCPUsForResize, mustKeepCPUs are not a subset
of aligned CPUs
- Fail if mustKeepCPUs are not a subset of reusable CPUs
- TODO improve align resize tests, go through testing, corner cases
refactor using cpumanager_test.go
- TODO improve CPUManagerCheckpointV3 tests
- TODO address code review/feedback to try different approach to allocate
stepwise instead of once off when resizing
- TODO check init-containers
- TODO check migration from v2 to v3 CPU Manager checkpoint
- TODO check kubectl failure when prohibited can this be done earlier?
- WIP update CPU Manager tests to use refactored cpu_manager_test
- TODO update topologymanager,cpumanager,memorymanager documentation
To implement the design approved by KEP, update admit handler for topology/cpu manager to perform the appropriate feasibility checks on lifecycle.ResizeOperation.
…rror Handle first review round, not ready yet
Enhance CPU manager's test of reconcileState function (the one that actuates allocated CPU sets in runtime). The improvement involves three elements: 1) verification if lastUpdateState contains expected values; 2) enabling verification of how the reconcile process completed for multiple containers, not just a single one; 3) extending test cases to cover the above changes and add a simple multi-container test. Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
Extend mockRuntimeService in the CPU manager tests to store CPUSets that were applied to the runtime. Additionally, after each update of container resources (if the testCPUConflicts flag is enabled), the mock runtime verifies if exclusive CPUs are assigned to a single container only. The extended mockRuntimeService is applied to TestReconcileState, so it verifies conflicts after each update and after reconciliation is completed, it is verified if the runtime state matches expectations. Proper fields are added to each of the existing test cases. Additionally, the mockRuntimeService has been enhanced to support returning a sequence of errors from UpdateContainerResources calls. The err field was changed from a single error to a slice of errors ([]error), and the UpdateContainerResources function now returns the first error from the slice and removes it, allowing different errors to be returned for successive calls. This enhancement enables more sophisticated testing scenarios where multiple container resource updates may fail at different points in the test sequence. Proper fields are added to each of the existing test cases to accommodate the new error slice functionality and CPU set tracking capabilities. Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
Add test cases for verification of CPUSets reconcilation for containers using exclusive CPUs. These test cases verify behavior of CPUs scaling with InPlacePodVerticalScalingExclusiveCPUs enabled. Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
The former implementation of reconcileState in CPU manager had two issues: 1) It didn't apply CPU sets of all containers as a consistent state. The loop for all pods and containers applied CPUSets one by one without any critical section. During iteration over loop, the allocated CPUSets and default set could have been changed by executing Allocate for needs of resize or appearance of new container. Such situation could lead to conflicts of exclusive CPUs, e.g. a) reconcileState applies default CPU Set to container A runtime b) Allocate removes some CPUs from default CPU Set and use them as additional exclusive CPUS for container B which resizes up c) reconcileState loop continues and applies new CPU Set to container B runtime The CPUs that were allocated in step b) are now assigned to both containers A and B in runtime. 2) It didn't consider temporary conflicts when moving CPUs from one container to another. For example: a) container A uses CPUs: 1, 2; container B uses CPUs: 3, 4 b) container B scales down by one cpu, and now only CPU: 3 is allocated for it c) container A scales up and receives CPU: 4 during allocation (so now it has CPUs: 1, 2, 4 allocated) d) reconcileState applies new CPU Set to container A runtime: 1, 2, 4 e) reconcileState applies new CPU Set to container B runtime: 3 Between steps d) and e) CPU: 4 is assigned to both container A and container B. If kubelet is restarted that time, the situation will hold for some time. The new algorithm: 1) Modifies the loop iterating over all containers in all pods to act in a critical section controlled by CPU manager's lock - same that is used during Allocate. During the iteration CPU Sets are not yet applied but only collected to local variables: exclusiveCPUContainers and nonExclusiveCPUContainers. Usage of the lock guarantees consistent state. 2) After collection and outside critical section CPU Sets are applied to runtime in three steps: 2.1) remove scaled down exclusive CPUs from containers * as containers using exclusive CPUs cannot be scaled down to 0, because they need to retain Original CPU Set, it is safe operation that won't try to set an empty set in runtime * after this operation all CPUs that belong now to default CPU Set are no longer used exclusively by any container, so next step can be applied 2.2) apply CPU Sets for all non-exclusive containers * these containers will use default CPUSet * if default CPUSet shrank since last reconcileState call due to allocation of the CPUs as exclusive, the CPUs removed from it are now no longer used by non-exclusive containers, so next step can be applied 2.3) set final CPU Sets for containers using exclusive CPUs 3) Improves conflict detection by tracking failed container updates and preventing further updates that would conflict with previously failed ones. Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
This commit adds extensive test coverage for the new multi-pass reconciliation algorithm in the CPU manager, covering various failure scenarios of UpdateContainerResources function during the reconciliation process to ensure proper handling of CPU conflicts. Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
cc7b71a to
4b31238
Compare
| // Determine the CPU set to use based on the pass | ||
| var targetCPUSet cpuset.CPUSet | ||
| if preliminary { | ||
| targetCPUSet = rca.allocatedSet.Intersection(lcset) |
There was a problem hiding this comment.
preliminary as true only in case remove CPUs from containers using exclusive CPUs, in this case
targetCPUSet = rca.allocatedSet.Intersection(lcset) and targetCPUSet = rca.allocatedSet are same. No need to add a preliminary condition check here.
There was a problem hiding this comment.
We need to think also about a case when the set of CPUs is not only scaled down, but changed.
Let's consider the following scenario for a container:
- initial scenario state:
allocated: {original: {1}, resized: {1, 2, 3} }, lastUpdateState: {1, 2, 3}} - scale down 3--> 1 :
allocated: {original: {1}, resized: {1} }, lastUpdateState: {1, 2, 3}} - meanwhile some other container is resized and got allocated CPUs 2 and 3.
- scale up 1 --> 2 :
allocated: {original: {1}, resized: {1, 4} }, lastUpdateState: {1, 2, 3}} - only now is reconcileState launched
The targetCPUSet for the first pass must be {1} - the intersection of allocated and lastUpdateState.
We don't want to use {1, 4} as CPU 4 might be still used by some other containers.
There was a problem hiding this comment.
I understand your design idea, It seems reasonable.
| } | ||
|
|
||
| // Check if update is needed | ||
| if !targetCPUSet.Equals(lcset) { |
There was a problem hiding this comment.
In the first pass updateContainers(exclusiveCPUContainers, true), all the containers in the exclusiveCPUContainers which has been resized will upate cpuset, not only the CPUs removed container.
There was a problem hiding this comment.
Yes, they update lastUpdateState each time changes to runtime are made.
However, the change in the first pass (updateContainers(exclusiveCPUContainers, true)) might not be final - only some CPUs were removed, so in the 3rd pass we still need to check if another update is required.
Look at the scenario from above comment.
reconcileState is run with following state: allocated: {original: {1}, resized: {1, 4} }, lastUpdateState: {1, 2, 3}}
So
- in the 1st pass there will be update to {1} - only removal of CPUs: 2 and 3
- in the 3rd pass there will be update to {1, 4} - final set after we have the guarantee that CPU 4 is no longer used by any other container.
aa5c826 to
ecd92e6
Compare
|
Since we missed the deadline of v1.36 and aiming now for v1.37, i think it makes sense to merge this to the main PR and continue the discussion there. Please @lukaszwojciechowski rebase to merge your PR. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR modifies the algorithm for CPU Set application to containers runtime in CPU manager's reconciliation loop.
It allows to solve conflicts of CPUs usage that might appear in scenarios of exclusive CPU resources resizing without restart with Guaranteed QoS Class, static CPU policy and InPlacePodVerticalScaling enabled.
For details about conflicting scenarios see Which issue(s) this PR fixes section below.
PR structure
PR consists of 5 commits:
-- verification if lastUpdateState contains expected values
-- enabling verification of how the reconcile process completed for multiple containers, not just a single one
-- extending test cases to cover the above changes and add a simple multi-container test
-- extend mockRuntimeService in the CPU manager tests to store CPUSets that were applied to the runtime
-- verify if exclusive CPUs are assigned to a single container only
-- verified if the runtime state matches expectations
-- add test cases for verification of CPUSets reconcilation for containers using exclusive CPUs
-- modification of the algorithm (NOTE: test cases pass again)
-- adds testcases covering UpdateContainerResources failure at different passes
Algorithm:
exclusiveCPUContainersandnonExclusiveCPUContainers. Usage of the lock guarantees consistent state.Step 3.1. remove scaled down exclusive CPUs from containers:
Step 3.2. apply CPU Sets for all non-exclusive containers:
Step 3.3. set final CPU Sets for containers using exclusive CPUs
Which issue(s) this PR fixes:
Resize of integer CPU resources of pods with Guaranteed QoS Class without restarting containers when CPU policy is static and InPlacePodVerticalScaling enabled requires handling of more complex scenarios when applying the allocated CPUs to containers' runtime.
The algorithm used in reconcileState in CPU manager has few drawbacks that may lead to temporary conflicts of CPU usage (exclusive CPUs used by multiple containers):
Consistent state application
The reconcileState function does not apply CPU sets of all containers as a consistent state. The loop for all pods and containers applies CPU sets one by one without any critical section. During iteration over loop, the allocated CPU sets and the default set can be changed by parallel resize. Such operation executes Allocate in another go routine changing allocations, e.g.
Since this moment CPUs that were allocated in step 2. are now applied to runtime of both containers A and B.
The situation will fix by itself next time a reconcileState is executed (after 10s).
Temporary conflicts when moving CPUs between containers
Consider following scenario:
[1, 2]; container B uses[3, 4][3][1, 2, 4](so container's A runtime uses
[1, 2, 4], while container B still uses[3, 4])Between steps 4. and 5. CPU 4 is applied in both containers. Situation is temporary and usually very short, but when it is coincided with kubelet restart or some delay, it might affect operation of containers.
Special notes for your reviewer:
Only patch "rework CPUSet reconciliation algorithm" contains changes to the code.
Other patches extend tests and add test cases covering 100% of changes made and scenarios that new algorithm fixes.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: