Skip to content

CNI wrapper for Windows AKS nodes, to solve the thundering herd problem when a node reboots

Notifications You must be signed in to change notification settings

BroadcastRadio/aks-windows-pacemaker

Repository files navigation

Pacemaker for AKS Windows

What is this project?

This project aims to solve the 'thundering herd' issue with Windows on Azure AKS. This problem can particularly affect Windows nodes on AKS, as the process of setting up networking for a pod can be delicate under high loads.

This can cause health checks to time out, and make a node enter the NotReady state. Particularly, on Azure, the auto repair feature built into the control-plane may then restart your node.

For some use cases, a custom scheduler may suffice. This only works during pod scheduling, so, for instance, if a node reboots, the custom scheduler won't suffice. On Linux nodes, you can use pod-pacemaker, which is the inspiration of this project. But we needed a solution for Windows Server nodes.

How does it work?

We need to hook into the pod startup mechanism. Windows Server on AKS doesn't support custom CNI plugins, as it uses the custom azure-vnet.exe CNI plugin.

This project works by replacing the CNI plugin azure-vnet.exe with a wrapper. This wrapper waits for a signal from the pacemaker daemonset, over a Named Pipe, before running the original azure-vnet.exe plugin. The daemonset monitors the status of the local node, and allows a pod to startup when enough previously starting pods have entered a Running (or a fail) state.

How do I install it?

You'll need the Go toolchain

  1. Clone this repository
  2. Run make (or use the pre-built binaries which we've left in this repo)
  3. Run docker build . -t yourrepo.docker/pacemaker/pacemaker:latest
  4. Edit the image path in pacemaker.yaml, and apply the configuration

What is the admission webhook and how do I install it?

This is a simple mutating webhook, which is used to mark a new node with the bootstrapping taint. This prevents pods from being scheduled on a brand new node. Although not strictly required, it is recommended for that reason.

To install it (we recommend using the free Cloudflare Workers), simply create a new Worker with the contents of webhook.js. Then, edit admission-webhook.yaml with your worker URL, and apply the configuration to your node.

Limitations

We've only tested this on AKS clusters using Azure CNI Node Subnet. It may work with Azure CNI Overlay - if it works, or if you discover any issues, please do open an issue.

Pull requests welcome!

About

CNI wrapper for Windows AKS nodes, to solve the thundering herd problem when a node reboots

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published