FIXES: Hugepage double allocation and add node uncordon retries functionality by psiwczak · Pull Request #23 · Viasat/nhd

psiwczak · 2021-08-03T15:29:33Z

This commit fixes following issues:

Hugepages are overallocated each time node goes through cordon/uncordon cycle.
On loaded clusters it can take K8s a few seconds to sync the node state when uncordoned.
Before the fix - NHD would mark the freshly uncordoned node as non-active if it K8s did not report
"schedulable" state right after uncordon event. This does not always happen instantly. This commit adds a retry count set by
NODE_ACTIVE_RETRIES constant in NHDScheduler.py. NHD will retry to check if the node is
active this number of times with a 1 second sleep in between.

React to the following k8s cluster conditions: * K8s node is deleted (NHD deletes node from the list of nodes) * K8s node is cordoned (NHD removes node "active" mark) * K8s node becomes NotReady (NHD removes node "active" mark) * K8s node has its NHD taint removed (NHD removes node "active" mark) * K8s node is added to the cluster (NHD adds the node as "inactive" to the node list) * K8s node becomes Ready (NHD scans & updates node properties & state and activates if scan is successful)

If node enters Unreachable state - cordon. If node enters Reachable state AND is in Ready state - send uncordon signal to NHDScheduler.

Detect nodes coming and going

Reset list of NIC-s and GPU-s before parsing labels

This commit fixes following issues: 1. Hugepages are overallocated each time node goes through cordon/uncordon cycle. 2. On loaded clusters it can take a few seconds to sync the node state when uncordoned. Before the fix - NHD would mark the uncordoned node as non-active if it did not report "schedulable" state right after uncordon event. This commit adds a retry count set by NODE_ACTIVE_RETRIES constant in NHDScheduler.py. NHD will retry to check if the node is active this number of times with a 1 second sleep in between.

bendavis0 · 2021-08-06T21:43:50Z

nhd/NHDScheduler.py

+        pods = self.k8s.GetNodeScheduledPods(sched_name=self.sched_name, node_name=node_name)
+        self.logger.info(f'Found scheduled pods: {pods} for node: {node_name}')
+        for p in pods:
+            if p[3] in ('Running', 'CrashLoopBackOff', 'Pending'):


The fourth value in the tuple returned by self.k8s.GetNodeScheduledPods is the pod's status.phase. This is checking for 'CrashLoopBackOff,' yet that is not a valid pod status.phase (I believe it is a container state). Further, this is not reclaiming resources of pods with in the Failed state (e.g. evicted due to overrunning ephemeral storage due to runaway logs), so this would have the effect of double booking the resources consumed of the failed pod by failing to update the accounting properly.

bendavis0 · 2021-08-06T21:46:08Z

nhd/NHDScheduler.py

        for _,n in self.nodes.items():
            n.PrintResourceStats()

+    def PrintSingleNodeResources(self,node):


I don't understand the point of this function. Whoever calls it has to pass the node, so why wouldn't the caller just invoke node.PrintResourceStats() rather than invoking self.PrintSingleNodeResources(node)?

cliffburdick · 2021-08-12T21:36:07Z

nhd/NHDScheduler.py

+                                    if active:
+                                        break
+                                    self.logger.info(f'Node {v.name} not active yet - will retry status check. Count: {i}')
+                                    time.sleep(1)


It's probably better to set a wake time here and check in the idle part above. Otherwise you have a single-threaded controller sleeping for up to 10 seconds and causing a backlog of events.

psiwczak added 8 commits June 21, 2021 22:04

Bump nhd-version to 0.3.36

b10ef53

Fix typos

aa13752

React to node Unreachable/Reachable state changes.

96ddc49

If node enters Unreachable state - cordon. If node enters Reachable state AND is in Ready state - send uncordon signal to NHDScheduler.

Merge pull request Viasat#20 from psiwczak/pstest8

1a3b892

Detect nodes coming and going

Reset list of NIC-s and GPU-s before node label initialization

cf06826

Merge pull request Viasat#22 from Viasat/pstest8

f26ff4e

Reset list of NIC-s and GPU-s before parsing labels

psiwczak requested review from bendavis0 and jjmiller-vs August 3, 2021 15:29

bendavis0 reviewed Aug 6, 2021

View reviewed changes

cliffburdick reviewed Aug 12, 2021

View reviewed changes

jjmiller-vs force-pushed the master branch from f26ff4e to 51b5180 Compare January 11, 2022 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

FIXES: Hugepage double allocation and add node uncordon retries functionality#23

FIXES: Hugepage double allocation and add node uncordon retries functionality#23
psiwczak wants to merge 8 commits intoViasat:masterfrom
psiwczak:pstest8

psiwczak commented Aug 3, 2021 •

edited

Loading

Uh oh!

bendavis0 Aug 6, 2021

Uh oh!

bendavis0 Aug 6, 2021

Uh oh!

cliffburdick Aug 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

psiwczak commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bendavis0 Aug 6, 2021

Choose a reason for hiding this comment

Uh oh!

bendavis0 Aug 6, 2021

Choose a reason for hiding this comment

Uh oh!

cliffburdick Aug 12, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

psiwczak commented Aug 3, 2021 •

edited

Loading