Skip to content

Comments

FIXES: Hugepage double allocation and add node uncordon retries functionality#23

Open
psiwczak wants to merge 8 commits intoViasat:masterfrom
psiwczak:pstest8
Open

FIXES: Hugepage double allocation and add node uncordon retries functionality#23
psiwczak wants to merge 8 commits intoViasat:masterfrom
psiwczak:pstest8

Conversation

@psiwczak
Copy link
Contributor

@psiwczak psiwczak commented Aug 3, 2021

This commit fixes following issues:

  1. Hugepages are overallocated each time node goes through cordon/uncordon cycle.

  2. On loaded clusters it can take K8s a few seconds to sync the node state when uncordoned.
    Before the fix - NHD would mark the freshly uncordoned node as non-active if it K8s did not report
    "schedulable" state right after uncordon event. This does not always happen instantly. This commit adds a retry count set by
    NODE_ACTIVE_RETRIES constant in NHDScheduler.py. NHD will retry to check if the node is
    active this number of times with a 1 second sleep in between.

psiwczak added 8 commits June 21, 2021 22:04
React to the following k8s cluster conditions:

* K8s node is deleted
  (NHD deletes node from the list of  nodes)
* K8s node is cordoned
  (NHD removes node "active" mark)
* K8s node becomes NotReady
  (NHD removes node "active" mark)
* K8s node has its NHD taint removed
  (NHD removes node "active" mark)

* K8s node is added to the cluster
  (NHD adds the node as "inactive" to the node list)
* K8s node becomes Ready
  (NHD scans & updates node properties & state and activates if scan is successful)
If node enters Unreachable state - cordon.
If node enters Reachable state AND is in Ready state -
send uncordon signal to NHDScheduler.
Detect nodes coming and going
Reset list of NIC-s and GPU-s before parsing labels
This commit fixes following issues:
1. Hugepages are overallocated each time node goes through cordon/uncordon cycle.

2. On loaded clusters it can take a few seconds to sync the node state when uncordoned.
   Before the fix - NHD would mark the uncordoned node as non-active if it did not report
   "schedulable" state right after uncordon event. This commit adds a retry count set by
   NODE_ACTIVE_RETRIES constant in NHDScheduler.py. NHD will retry to check if the node is
   active this number of times with a 1 second sleep in between.
pods = self.k8s.GetNodeScheduledPods(sched_name=self.sched_name, node_name=node_name)
self.logger.info(f'Found scheduled pods: {pods} for node: {node_name}')
for p in pods:
if p[3] in ('Running', 'CrashLoopBackOff', 'Pending'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fourth value in the tuple returned by self.k8s.GetNodeScheduledPods is the pod's status.phase. This is checking for 'CrashLoopBackOff,' yet that is not a valid pod status.phase (I believe it is a container state). Further, this is not reclaiming resources of pods with in the Failed state (e.g. evicted due to overrunning ephemeral storage due to runaway logs), so this would have the effect of double booking the resources consumed of the failed pod by failing to update the accounting properly.

for _,n in self.nodes.items():
n.PrintResourceStats()

def PrintSingleNodeResources(self,node):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the point of this function. Whoever calls it has to pass the node, so why wouldn't the caller just invoke node.PrintResourceStats() rather than invoking self.PrintSingleNodeResources(node)?

if active:
break
self.logger.info(f'Node {v.name} not active yet - will retry status check. Count: {i}')
time.sleep(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better to set a wake time here and check in the idle part above. Otherwise you have a single-threaded controller sleeping for up to 10 seconds and causing a backlog of events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants