-
Notifications
You must be signed in to change notification settings - Fork 462
Description
Today the MCO makes no attempt to apply any ordering to which nodes it updates from the candidates. One problem we're thinking about is (particularly on bare metal scenarios where there might be a lot of pods on a node, and possibly pods expensive to reschedule like CNV) that it's quite possible that workloads are disrupted multiple times for an OS upgrade.
When we go to drain a node, its pods will be rescheduled across the remaining nodes...and then we will upgrade one of those, quite possibly moving one of the workload pods again etc.
One idea here is to add the minimal hooks such that a separate controller could influence this today.
If for example we supported a label machineconfig.openshift.io/upgrade-weight=42 and the node controller picked the highest weight node, then the separate controller could also e.g. mark $number nodes which are next in the upgrade ordering as unschedulable, ensuring that the drain from the current node doesn't land on them.
Without excess capacity or changing the scheduler to more strongly prefer packing nodes it seems hard to avoid multiple disruption, but the label would allow this baseline integration.