-
Notifications
You must be signed in to change notification settings - Fork 462
Description
The CI team is trying to use new AWS m5d.xlarge instances which have two NVMe disks attached. We crafted a custom RAID partition machineconfig to enable that.
We added this as part of the main worker pool - the MCD will fail to roll out the partitioning on the existing workers, but that's fine because the plan was to "roll" the worker pool. Basically get the new MC in the pool, have new workers come online with that config, then scale down the old workers.
However, there are a few issues here.
First, this whole thing would obviously be a lot better if we had machineset-specific machineconfigs. That would solve a bunch of races and be much more elegant.
What we're seeing right now is that one new m5d node went OutOfDisk=true because it was booted with just a 16G root volume from the old config. That unschedulable node then blocks rollout of further changes.
I think we can unstick ourselves here by deleting that node and getting the MCO to roll out the new config.