scality · JBWatenbergScality · Jan 12, 2026 · Jan 12, 2026
@@ -112,6 +112,7 @@ __Bert-E__.
 | options name              | description              | requires admin rights? | requires pull request author? |
 |:------------------------- |:------------------------ |:----------------------:|:-----------------------------:|
 | after_pull_request        | Wait for the given pull request id to be merged before continuing with the current one. May be used like this: @bert-e after_pull_request=< pr_id_1 > ... | no | no
+| babysit                   | Automatically retry failed GitHub Actions builds (see [Babysit](#babysit) section for details) | no | no
 | bypass_author_approval    | Bypass the pull request author's approval   | yes | no
 | bypass_build_status       | Bypass the build and test status| yes | no
 | bypass_incompatible_branch | Bypass the check on the source branch prefix | yes | no
@@ -482,6 +483,9 @@ to progress to the next step.  message code
 | 122   | Unknown command | One of the participants asked __Bert-E__ to activate an option, or execute a command he doesn't know. Edit the corresponding message if it contains a typo. Delete it otherwise
 | 123   | Not authorized | One of the participants asked __Bert-E__ to activate a privileged option, or execute a privileged command, but doesn't have enough credentials to do so. Delete the corresponding command ask a __Bert-E__ administrator to run/set the desired command/option.
 | 134   | Not author | One of the participants asked __Bert-E__ to activate an authored option, but the participant is not the author of the pull request.
+| 140   | Babysit: Retrying build | __Bert-E__ is automatically retrying failed GitHub Actions jobs because the babysit option is enabled. No action required - wait for the new build to complete.
+| 141   | Babysit: Maximum retries reached | __Bert-E__ has exhausted all automatic retry attempts. Investigate the build failure. To get more retries, comment `@bert-e babysit` again.
+| 142   | Babysit: Cancelled | __Bert-E__ cancelled the babysit option because new commits were pushed. To re-enable automatic retries, comment `@bert-e babysit` again.
 
 Queues
 ------
@@ -562,6 +566,96 @@ All those states can be found on Bert-E's UI.
 > Note: Bert-E will not notify the user if a build
 fails inside the queue.
 
+Babysit
+-------
+
+__The babysit option enables automatic retry of failed GitHub Actions builds.__
+
+When working with GitHub Actions, builds can sometimes fail due to flaky tests,
+transient infrastructure issues, or other temporary problems. The `babysit`
+option allows __Bert-E__ to automatically retry failed workflow runs, reducing
+the need for manual intervention.
+
+### Enabling Babysit
+
+To enable babysit on a pull request, comment:
+
+    @bert-e babysit
+
+### How It Works
+
+When babysit is enabled and a build fails:
+
+1. __Bert-E__ detects the failed GitHub Actions workflow runs
+2. For each failed workflow, __Bert-E__ triggers GitHub's "Re-run failed jobs"
+3. __Bert-E__ posts a comment indicating the retry attempt
+4. This process repeats until the build succeeds or the maximum retry limit
+   is reached
+
+### Scope of Babysit
+
+The babysit behavior applies to:
+
+* **Integration branches** (`w/x.y/...`): Failed builds on integration branches
+  are automatically retried
+* **Queue branches** (`q/...`): Failed builds in the merge queue are also
+  retried if babysit was enabled on the corresponding pull request
+* **All workflow runs individually**: Each GitHub Actions workflow is tracked
+  and retried independently. If you have multiple workflows (e.g., CI, Tests,
+  Lint), each one has its own retry counter. This means:
+  - If CI fails 5 times but Tests only fails twice, CI is exhausted while
+    Tests can still be retried 3 more times
+  - Only workflows that haven't reached their retry limit are retried
+  - __Bert-E__ shows a table with each workflow's retry count in the comments
+
+### Maximum Retries
+
+By default, __Bert-E__ will retry failed builds up to **5 times**. After the
+maximum number of retries is reached, __Bert-E__ posts a `BabysitExhausted`
+message indicating that automatic retries have been exhausted.
+
+This limit can be configured per repository by setting the
+`max_babysit_retries` parameter in the repository's __Bert-E__ configuration:
+
+```yaml
+max_babysit_retries: 10  # Allow up to 10 retries instead of the default 5
+```
+
+### Re-enabling Babysit After Exhaustion
+
+If the maximum retries have been exhausted but you want to continue with
+automatic retries, simply comment `@bert-e babysit` again. This resets the
+retry counter and allows for another round of automatic retries.
+
+### Babysit Cancellation on New Commits
+
+**Important:** If you push new commits to your branch after enabling babysit,
+the babysit option is automatically cancelled. This prevents stale retry
+attempts from continuing on outdated code.
+
+When this happens, __Bert-E__ will post a `BabysitCancelled` message explaining
+that new commits were detected. To re-enable automatic retries for the new
+commits, you must comment `@bert-e babysit` again.
+
+> **Example workflow:**
+>
+> 1. You comment `@bert-e babysit`
+> 2. Build fails, __Bert-E__ retries (attempt 1/5)
+> 3. Build fails again, __Bert-E__ retries (attempt 2/5)
+> 4. You push a fix to address the build failure
+> 5. __Bert-E__ detects the new commit and cancels babysit
+> 6. Build fails on the new commit
+> 7. You comment `@bert-e babysit` again to enable retries for the new code
+> 8. __Bert-E__ retries (attempt 1/5 - counter is reset)
+
+### Limitations
+
+* Babysit only works with **GitHub Actions** (`build_key: github_actions`)
+* Babysit is not available for other CI systems (Bitbucket Pipelines, Jenkins,
+  etc.)
+* Babysit does not bypass build failures - if the issue is not transient, the
+  build will continue to fail after all retries are exhausted
+
 Going further with __Bert-E__
 -----------------------------
 Do you like __Bert-E__? Would like to use it on your own projects?

@@ -585,3 +585,25 @@ class JobFailure(SilentException):
 
 class QueueBuildFailed(SilentException):
     code = 309
+
+
+class BabysitRetry(TemplateException):
+    """Raised when babysit mode triggers a retry of failed GitHub Actions."""
+    code = 140
+    template = 'babysit_retry.md'
+    dont_repeat_if_in_history = 0  # allow repeating for each retry
+    status = "in_progress"
+
+
+class BabysitExhausted(TemplateException):
+    """Raised when babysit mode has exhausted all retry attempts."""
+    code = 141
+    template = 'babysit_exhausted.md'
+    status = "failure"
+
+
+class BabysitCancelled(TemplateException):
+    """Raised when babysit mode is cancelled due to new commits."""
+    code = 142
+    template = 'babysit_cancelled.md'
+    status = "in_progress"
@@ -518,6 +518,17 @@ def create_pull_request(self, title, src_branch, dst_branch, description,
         return PullRequest.create(self.client, data=kwargs, owner=self.owner,
                                   repo=self.slug)
 
+    def rerun_failed_workflow_jobs(self, run_id: int) -> None:
+        """Re-run only the failed jobs of a workflow run.
+
+        Args:
+            run_id: The ID of the workflow run to re-run failed jobs for.
+
+        """
+        url = (f'/repos/{self.owner}/{self.slug}/actions/runs/'
+               f'{run_id}/rerun-failed-jobs')
+        self.client.post(url, data='{}')
+
 
 class AggregatedStatus(base.AbstractGitHostObject):
     GET_URL = '/repos/{owner}/{repo}/commits/{ref}/status'
@@ -640,6 +651,34 @@ def branch(self) -> str | None:
             return self._workflow_runs[0]['head_branch']
         return None
 
+    def get_failed_runs(self):
+        """Get workflow runs that have failed.
+
+        This method filters workflow runs to keep only the most relevant run
+        per workflow (same logic as remove_unwanted_workflows), then returns
+        those that have failed.
+
+        Returns:
+            List of dicts with 'id' and 'run_attempt' for each failed run.
+        """
+        # First, filter to get the best run per workflow (same as state check)
+        self.remove_unwanted_workflows()
+
+        failed_runs = []
+        for run in self._workflow_runs:
+            if run.get('conclusion') == 'failure':
+                failed_runs.append({
+                    'id': run['id'],
+                    'run_attempt': run.get('run_attempt', 1),
+                    'workflow_id': run.get('workflow_id'),
+                    'name': run.get('name', 'unknown'),
+                    'html_url': run.get('html_url', ''),
+                })
+                LOG.debug(
+                    "Babysit: found failed run id=%d, run_attempt=%d, name=%s",
+                    run['id'], run.get('run_attempt', 1), run.get('name', ''))
+        return failed_runs
+
     def remove_unwanted_workflows(self):
         """
         Remove two things:

@@ -143,6 +143,9 @@ class WorkflowRun(GitHubSchema):
     event = fields.Str()
     repository = fields.Nested(Repo)
     workflow_id = fields.Integer()
+    # run_attempt indicates the number of times this workflow has been run
+    # Defaults to 1 for first run, increments with each rerun
+    run_attempt = fields.Integer(load_default=1)
 
 
 class AggregateWorkflowRuns(GitHubSchema):

@@ -195,6 +195,9 @@ class Meta:
 
     send_bot_status = fields.Bool(required=False, load_default=False)
 
+    # Babysit feature: automatic retry of failed GitHub Actions
+    max_babysit_retries = fields.Int(required=False, load_default=5)
+
     @pre_load(pass_many=True)
     def load_env(self, data, **kwargs):
         """Load environment variables"""

@@ -0,0 +1,14 @@
+{% extends "message.md" %}
+
+{% block title -%}
+Babysit: Cancelled
+{% endblock %}
+
+{% block message %}
+**Babysit mode has been cancelled** because new commits were pushed to the branch.
+
+Previous retries were for commit `{{ previous_commit[:7] }}`, but the current commit is `{{ current_commit[:7] }}`.
+
+If you want to enable automatic retries for the new commits, please comment `@{{ robot }} babysit` again.
+{% endblock %}
+
@@ -0,0 +1,23 @@
+{% extends "message.md" %}
+
+{% block title -%}
+Babysit: Maximum retries reached
+{% endblock %}
+
+{% block message %}
+The {% if build_url -%}[build]({{ build_url }}) {% else -%}build {% endif -%}
+has exhausted all automatic retry attempts on branch `{{ branch.name }}`.
+
+**Exhausted workflows** ({{ max_retries }} retries each):
+{% for wf in exhausted_workflows -%}
+- `{{ wf }}`
+{% endfor %}
+To investigate:
+- Review the [build logs]({{ build_url }}) for the failure cause
+- Check if this is a flaky test or a genuine issue
+
+To get more retries:
+- Fix the issue and push new commits (babysit will continue with fresh retries), or
+- Comment `@{{ robot }} babysit` again to reset the retry counter
+{% endblock %}
+
@@ -0,0 +1,20 @@
+{% extends "message.md" %}
+
+{% block title -%}
+Babysit: Retrying build
+{% endblock %}
+
+{% block message %}
+The {% if build_url -%}[build]({{ build_url }}) {% else -%}build {% endif -%}
+failed on branch `{{ branch.name }}` (commit `{{ commit_sha[:7] }}`).
+
+**Babysit mode is active** - automatically retrying failed workflows:
+
+| Workflow | Retry |
+|:---------|:-----:|
+{% for wf in workflows -%}
+| `{{ wf.name }}` | {{ wf.retry_count }}/{{ max_retries }} |
+{% endfor %}
+Please wait for the new build to complete.
+{% endblock %}
+