Skip to content

Conversation

@IsaacJames
Copy link

@IsaacJames IsaacJames commented Jan 5, 2026

Rollback Controller Implementation

This PR implements the RollbackReconciler which manages Rollback resources by triggering deployments via a configurable CICD backend and polling for completion.

internal/controller/deploy/rollback_controller.go

  • Reconciliation loop:
    • Watches Rollback resources and triggers deployments via the injected Deployer
    • Polls for deployment status until completion (succeeded/failed)
    • Tracks attempt count with configurable retry limit
    • Updates rollback status conditions (InProgress, Succeeded)
    • Finds the currently active release to populate FromReleaseName

cmd/rollback-manager/main.go

  • Configurable CICD backend via --cicd-backend flag (noop, github)
  • GitHub token via --github-token flag / ROLLBACK_MANAGER_GITHUB_TOKEN env var
  • Rollback history limit via --rollback-history-limit flag

Reconcile Loop

┌─────────────────────────────────────────────────────────────────────────────┐
│                           Rollback Reconcile Loop                           │
└─────────────────────────────────────────────────────────────────────────────┘

                              ┌──────────────┐
                              │   Rollback   │
                              │   Created    │
                              └──────┬───────┘
                                     │
                                     ▼
                        ┌────────────────────────┐
                        │  Succeeded condition   │───Yes───▶ Done (skip)
                        │       exists?          │
                        └────────────┬───────────┘
                                     │ No
                                     ▼
                        ┌────────────────────────┐
                        │  Fetch target Release  │───NotFound───▶ markRollbackFailed
                        └────────────┬───────────┘
                                     │ Found
                                     ▼
                        ┌────────────────────────┐
                        │  InProgress == True?   │
                        └────────────┬───────────┘
                                     │
                    ┌────────────────┴────────────────┐
                    │ No                              │ Yes
                    ▼                                 ▼
        ┌───────────────────────┐       ┌───────────────────────┐
        │  triggerDeployment()  │       │ pollDeploymentStatus()│
        └───────────┬───────────┘       └───────────┬───────────┘
                    │                               │
                    ▼                               ▼
        ┌───────────────────────┐       ┌───────────────────────┐
        │  AttemptCount >= 3?   │       │   Deployment Status   │
        └───────────┬───────────┘       └───────────┬───────────┘
                    │                               │
        ┌───────────┴───────────┐       ┌───────────┼───────────┬───────────┐
        │ Yes                   │ No    │ Succeeded │ Failed    │ Pending/  │
        ▼                       ▼       ▼           ▼           │ InProgress│
   markRollback           Call CICD  markRollback  Retry?       ▼           
   Failed()               Deployer   Succeeded()     │      Requeue         
                              │                      │      (15s)           
                              ▼               ┌──────┴──────┐               
                    Set InProgress=True       │ Yes         │ No            
                    Requeue (15s)             ▼             ▼               
                                        triggerDeployment  markRollback     
                                        (immediate retry)  Failed()         

Error Handling

  • Retryable errors (5xx, rate limits): Requeue after 15s, increment attempt count
  • Non-retryable errors: Mark rollback as failed immediately
  • Max retries exceeded: Mark rollback as failed after 3 attempts

Rollback Status Conditions

Condition Status Meaning
InProgress True Deployment is running
InProgress False + Succeeded Terminal state reached
Succeeded True Rollback completed successfully
Succeeded False Rollback failed (all retries exhausted)

Manager Configuration

--cicd-backend       CICD backend to use (noop, github) [default: noop]
--github-token       GitHub API token (required for github backend)

Copy link
Contributor

@goelozev goelozev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

return false
}

// TODO: move this into api/deploy/v1alpha1/release_helpers.go once release reconciler PR is merged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@IsaacJames IsaacJames force-pushed the CI-3637-add-rollback-controller branch 2 times, most recently from e41713a to 2adf735 Compare January 12, 2026 15:25
@IsaacJames IsaacJames changed the title WIP: add rollback controller reconcile loop Add rollback controller reconcile loop Jan 12, 2026
@IsaacJames IsaacJames marked this pull request as ready for review January 12, 2026 16:15
@IsaacJames IsaacJames force-pushed the CI-3637-add-rollback-controller branch 2 times, most recently from 1eb2a73 to 05eccce Compare January 12, 2026 17:35
@IsaacJames IsaacJames force-pushed the CI-3637-add-rollback-controller branch from 05eccce to c203ff0 Compare January 12, 2026 17:53
Copy link
Contributor

@goelozev goelozev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fabulous!

Copy link
Contributor

@0x0013 0x0013 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

Added a couple comments.

Comment on lines 203 to 209
meta.SetStatusCondition(&rollback.Status.Conditions, metav1.Condition{
Type: deployv1alpha1.RollbackConditionInProgress,
Status: metav1.ConditionTrue,
Reason: "Retrying",
Message: fmt.Sprintf("Deployment attempt %d failed: %s. Retrying...", rollback.Status.AttemptCount, statusResp.Message),
LastTransitionTime: now,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're setting the condition here, but it seems it will get overwritten almost immediately in triggerDeployment.

Copy link
Author

@IsaacJames IsaacJames Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily - if triggerDeployment fails to trigger the deployment with a retryable error, the condition won't be updated and the Rollback will get requeued for reconciliation. That said, I'm now wondering if the rollback.Status.Message message set on line 144 would be better set on the InProgress condition.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0x0013 I've decided to simplify this a bit and agree it doesn't make sense to set the InProgress=True condition here in pollDeploymentStatus. Instead, we now set it once at the start of triggerDeployment with Reason=DeploymentTriggering and once at the end after a successful trigger with Reason=DeploymentTriggered.

This should make the behaviour clearer when a Rollback gets stuck trying and failing to trigger a deployment via the cicd backend, as the reason DeploymentTriggering will remain, whilst the failures themselves will be recorded in the rollback.Status.Message.

Status: metav1.ConditionFalse,
Reason: "Completed",
Message: "Rollback deployment completed",
LastTransitionTime: now,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as noted previously, the transition time should get set by setStatusCondition.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards leaving this (and the LastTransitionTime below) as set explicitly since this is the only transition the InProgress condition will make to False (and likewise below with the Succeeded condition to True).

The value I see is that by setting rollback.Status.CompletionTime and the LastTransitionTime of both conditions in this function to the same now value, it'll be clearer that the conditions transitioned to the terminal states as part of the completion of the rollback (particularly programmatically, if we ever need to compare the timestamps).

@IsaacJames IsaacJames force-pushed the CI-3637-add-rollback-controller branch from aa85c7c to a7ac338 Compare January 14, 2026 11:17
@IsaacJames IsaacJames requested review from 0x0013 and goelozev January 14, 2026 11:18
Copy link
Contributor

@0x0013 0x0013 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more comments.

One thing I'm not fully convinced about, is setting the RollbackConditionInProgress with the "DeploymentTriggering" reason. While I understand the idea behind the behavior, I think I would prefer for RollbackConditionInProgress==True to mean that Deployment has already been triggered. It would make it clearer what the condition state implies.

However, I'm not yet fully convinced myself, and I will think a little more whether the current approach could be problematic down the line.

case cicd.DeploymentStatusPending, cicd.DeploymentStatusInProgress:
// Update message and continue polling
rollback.Status.Message = statusResp.Message
rollback.Status.LastHTTPCallTime = &now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of the intention of the LastHTTPCallTime field, but we only update it here on pending or in progress status. Should it not be updated if status is failed or successful?

I'm aware that a new triggered deployment would update the field as well, but I think other cases wouldn't.

logger.Error(err, "failed to trigger deployment")

// Check if error is retryable
if deployerErr, ok := err.(*cicd.DeployerError); ok && deployerErr.Retryable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if deployerErr, ok := err.(*cicd.DeployerError); ok && deployerErr.Retryable {
var deployerErr *cicd.DeployerError
if errors.As(err, &deployerErr) && deployerErr.Retryable {

This should have the same outcome, but is better practice, because errors.As() will evaluate any error in the chain, in case cicd.DeployerError had been wrapped before it's returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants