Skip to content

Coordinate Cyclops with Cluster Autoscaler to prevent node upgrade conflicts#134

Merged
dnorman3 merged 17 commits intomasterfrom
dnorman3/scale-down-disabled-annotation
Feb 5, 2026
Merged

Coordinate Cyclops with Cluster Autoscaler to prevent node upgrade conflicts#134
dnorman3 merged 17 commits intomasterfrom
dnorman3/scale-down-disabled-annotation

Conversation

@dnorman3
Copy link
Contributor

@dnorman3 dnorman3 commented Jan 27, 2026

Cluster Autoscaler Annotation Management - Feature Summary

Adds annotation management to coordinate between Cyclops and Cluster Autoscaler during node cycling, preventing Cluster Autoscaler from removing new nodes before old nodes are drained.

Problem

During node cycling, Cluster Autoscaler can remove new nodes before Cyclops finishes draining old nodes, causing cycling to fail or nodes to be terminated prematurely.

Solution

  • Adds cluster-autoscaler.kubernetes.io/scale-down-disabled: "true" annotation to new nodes during ScalingUp phase
  • Removes annotation in both Successful and Healing phase transitions
  • Marker annotation tracking: Uses cyclops.atlassian.com/annotation-managed to track ownership (distinguishes Cyclops-managed vs pre-existing annotations)
  • Pre-existing preservation: Preserves annotations set by ASG Launch Templates or other external sources
  • Opt-out feature: NodeGroup annotation cyclops.atlassian.com/disable-annotation-management: "true" disables management (default: enabled)

Key Changes

pkg/controller/cyclenoderequest/transitioner/util.go:

  • addScaleDownDisabledAnnotation() - Adds annotation + marker (preserves pre-existing)
  • cleanupScaleDownDisabledAnnotations() - Removes annotations only if marker present
  • shouldManageAnnotations() - Checks NodeGroup opt-out annotation
  • getNodeGroup() - Finds matching NodeGroup resource

pkg/controller/cyclenoderequest/transitioner/transitions.go:

  • transitionScalingUp() - Adds annotations (if enabled)
  • transitionSuccessful() / transitionHealing() - Removes annotations (if added)

pkg/controller/cyclenoderequest/transitioner/transitioner.go:

  • Constants: nodeGroupAnnotationKey, cyclopsManagedAnnotation

Opt-Out Configuration

apiVersion: atlassian.com/v1
kind: NodeGroup
metadata:
  annotations:
    cyclops.atlassian.com/disable-annotation-management: "true"  # Disables management

Why NodeGroup? Persistent configuration (survives CNR deletion), aligned with node lifecycle, prevents conflicts with overlapping CNRs.

Design Decisions

  • Marker annotation ensures safe cleanup (only removes Cyclops-managed annotations)
  • Pre-existing annotations preserved for backward compatibility
  • Best-effort operations (don't block cycling)
  • Dual cleanup paths (Successful + Healing phases)
  • Opt-out approach (default enabled)

Verification

# Check nodes with annotations
kubectl get nodes -o json | jq -r '.items[] | select(.metadata.annotations["cluster-autoscaler.kubernetes.io/scale-down-disabled"] == "true") | .metadata.name'

# Check opt-out configuration
kubectl get nodegroup <name> -o jsonpath='{.metadata.annotations.cyclops\.atlassian\.com/disable-annotation-management}'

# Enable opt-out
kubectl annotate nodegroup <name> cyclops.atlassian.com/disable-annotation-management="true"

@mwhittington21
Copy link
Collaborator

@dnorman3 I think you also need to add logic to remove this label from a node if we transition to healing. The logic would go somewhere here. The reason for this is, if a CNR fails we need to remove the label or the node will forever be ignored by cluster-autoscaler.

mwhittington21
mwhittington21 previously approved these changes Jan 27, 2026
Copy link
Collaborator

@mwhittington21 mwhittington21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

mwhittington21
mwhittington21 previously approved these changes Jan 27, 2026
@dnorman3 dnorman3 marked this pull request as ready for review January 28, 2026 05:58
// during the cycling process from being removed by Cluster Autoscaler before
// the corresponding old nodes are fully terminated.
// See: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node
const clusterAutoscalerScaleDownDisabledAnnotation = "cluster-autoscaler.kubernetes.io/scale-down-disabled"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we generalise this behaviour and give the annotation via the CNR instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few reasons why i didn't

The annotation is tied to a node lifecycle, not a cnr lifecycle. If annotations were the in cnr spec, deleting a cnr before cleanup could leave stale annotations.

Autoscaler reads annotations from a node object, not the cnr . If we put them in cnr spec, we would have to sync them to nodes.

cnrs can affect overlapping nodegroups, having annotations in cnr spec can create race conditions and ownership issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put them in cnr spec, we would have to sync them to nodes.

This is exactly what I'm suggesting. We are already doing this in the code, I am just proposing that the annotation(s) get defined in the CNR rather than hardcoded.

cnrs can affect overlapping nodegroups

Each CNR is generated by observer for a single nodegroup. If a CNR affects nodes across nodegroups that would be a misconfiguration somewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to be generic here, the annotations are very specific to cluster-autoscaler. If we need another label, we probably want a code change because it will need to be used in specific conditions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this comment has caused me to re-evaluate is that we should add a field on the CNR to enable/disable this behaviour and default it to true. That way if any users of the software have an issue with this particular annotation/workflow, they can disable the feature and keep their cluster working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have the annotation defined in the CNR then that achieve's it. It wouldn't be enabled by default because the annotations need to be added for the behaviour to take effect. I'd argue this feature is just about adding annotations to new instances and then removing them later. We are doing it for cluster-autoscaler of course but it can apply more widely.

Copy link
Contributor Author

@dnorman3 dnorman3 Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added the option to opt out of it in the nodegroup (not cnr) see 67d6d0f#commitcomment-175885640 for reasoning

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like putting it on the nodegroup, it feels like something that should be associated with a base nodegroup config.

@dnorman3 dnorman3 merged commit a2fcd45 into master Feb 5, 2026
6 checks passed
@dnorman3 dnorman3 deleted the dnorman3/scale-down-disabled-annotation branch February 5, 2026 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants