FIXES: Hugepage double allocation and add node uncordon retries functionality#23
FIXES: Hugepage double allocation and add node uncordon retries functionality#23psiwczak wants to merge 8 commits intoViasat:masterfrom
Conversation
React to the following k8s cluster conditions: * K8s node is deleted (NHD deletes node from the list of nodes) * K8s node is cordoned (NHD removes node "active" mark) * K8s node becomes NotReady (NHD removes node "active" mark) * K8s node has its NHD taint removed (NHD removes node "active" mark) * K8s node is added to the cluster (NHD adds the node as "inactive" to the node list) * K8s node becomes Ready (NHD scans & updates node properties & state and activates if scan is successful)
If node enters Unreachable state - cordon. If node enters Reachable state AND is in Ready state - send uncordon signal to NHDScheduler.
Detect nodes coming and going
Reset list of NIC-s and GPU-s before parsing labels
This commit fixes following issues: 1. Hugepages are overallocated each time node goes through cordon/uncordon cycle. 2. On loaded clusters it can take a few seconds to sync the node state when uncordoned. Before the fix - NHD would mark the uncordoned node as non-active if it did not report "schedulable" state right after uncordon event. This commit adds a retry count set by NODE_ACTIVE_RETRIES constant in NHDScheduler.py. NHD will retry to check if the node is active this number of times with a 1 second sleep in between.
| pods = self.k8s.GetNodeScheduledPods(sched_name=self.sched_name, node_name=node_name) | ||
| self.logger.info(f'Found scheduled pods: {pods} for node: {node_name}') | ||
| for p in pods: | ||
| if p[3] in ('Running', 'CrashLoopBackOff', 'Pending'): |
There was a problem hiding this comment.
The fourth value in the tuple returned by self.k8s.GetNodeScheduledPods is the pod's status.phase. This is checking for 'CrashLoopBackOff,' yet that is not a valid pod status.phase (I believe it is a container state). Further, this is not reclaiming resources of pods with in the Failed state (e.g. evicted due to overrunning ephemeral storage due to runaway logs), so this would have the effect of double booking the resources consumed of the failed pod by failing to update the accounting properly.
| for _,n in self.nodes.items(): | ||
| n.PrintResourceStats() | ||
|
|
||
| def PrintSingleNodeResources(self,node): |
There was a problem hiding this comment.
I don't understand the point of this function. Whoever calls it has to pass the node, so why wouldn't the caller just invoke node.PrintResourceStats() rather than invoking self.PrintSingleNodeResources(node)?
| if active: | ||
| break | ||
| self.logger.info(f'Node {v.name} not active yet - will retry status check. Count: {i}') | ||
| time.sleep(1) |
There was a problem hiding this comment.
It's probably better to set a wake time here and check in the idle part above. Otherwise you have a single-threaded controller sleeping for up to 10 seconds and causing a backlog of events.
This commit fixes following issues:
Hugepages are overallocated each time node goes through cordon/uncordon cycle.
On loaded clusters it can take K8s a few seconds to sync the node state when uncordoned.
Before the fix - NHD would mark the freshly uncordoned node as non-active if it K8s did not report
"schedulable" state right after uncordon event. This does not always happen instantly. This commit adds a retry count set by
NODE_ACTIVE_RETRIES constant in NHDScheduler.py. NHD will retry to check if the node is
active this number of times with a 1 second sleep in between.