Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions bert_e/docs/USER_DOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ __Bert-E__.
| options name | description | requires admin rights? | requires pull request author? |
|:------------------------- |:------------------------ |:----------------------:|:-----------------------------:|
| after_pull_request | Wait for the given pull request id to be merged before continuing with the current one. May be used like this: @bert-e after_pull_request=< pr_id_1 > ... | no | no
| babysit | Automatically retry failed GitHub Actions builds (see [Babysit](#babysit) section for details) | no | no
| bypass_author_approval | Bypass the pull request author's approval | yes | no
| bypass_build_status | Bypass the build and test status| yes | no
| bypass_incompatible_branch | Bypass the check on the source branch prefix | yes | no
Expand Down Expand Up @@ -482,6 +483,9 @@ to progress to the next step. message code
| 122 | Unknown command | One of the participants asked __Bert-E__ to activate an option, or execute a command he doesn't know. Edit the corresponding message if it contains a typo. Delete it otherwise
| 123 | Not authorized | One of the participants asked __Bert-E__ to activate a privileged option, or execute a privileged command, but doesn't have enough credentials to do so. Delete the corresponding command ask a __Bert-E__ administrator to run/set the desired command/option.
| 134 | Not author | One of the participants asked __Bert-E__ to activate an authored option, but the participant is not the author of the pull request.
| 140 | Babysit: Retrying build | __Bert-E__ is automatically retrying failed GitHub Actions jobs because the babysit option is enabled. No action required - wait for the new build to complete.
| 141 | Babysit: Maximum retries reached | __Bert-E__ has exhausted all automatic retry attempts. Investigate the build failure. To get more retries, comment `@bert-e babysit` again.
| 142 | Babysit: Cancelled | __Bert-E__ cancelled the babysit option because new commits were pushed. To re-enable automatic retries, comment `@bert-e babysit` again.

Queues
------
Expand Down Expand Up @@ -562,6 +566,96 @@ All those states can be found on Bert-E's UI.
> Note: Bert-E will not notify the user if a build
fails inside the queue.

Babysit
-------

__The babysit option enables automatic retry of failed GitHub Actions builds.__

When working with GitHub Actions, builds can sometimes fail due to flaky tests,
transient infrastructure issues, or other temporary problems. The `babysit`
option allows __Bert-E__ to automatically retry failed workflow runs, reducing
the need for manual intervention.

### Enabling Babysit

To enable babysit on a pull request, comment:

@bert-e babysit

### How It Works

When babysit is enabled and a build fails:

1. __Bert-E__ detects the failed GitHub Actions workflow runs
2. For each failed workflow, __Bert-E__ triggers GitHub's "Re-run failed jobs"
3. __Bert-E__ posts a comment indicating the retry attempt
4. This process repeats until the build succeeds or the maximum retry limit
is reached

### Scope of Babysit

The babysit behavior applies to:

* **Integration branches** (`w/x.y/...`): Failed builds on integration branches
are automatically retried
* **Queue branches** (`q/...`): Failed builds in the merge queue are also
retried if babysit was enabled on the corresponding pull request
* **All workflow runs individually**: Each GitHub Actions workflow is tracked
and retried independently. If you have multiple workflows (e.g., CI, Tests,
Lint), each one has its own retry counter. This means:
- If CI fails 5 times but Tests only fails twice, CI is exhausted while
Tests can still be retried 3 more times
- Only workflows that haven't reached their retry limit are retried
- __Bert-E__ shows a table with each workflow's retry count in the comments

### Maximum Retries

By default, __Bert-E__ will retry failed builds up to **5 times**. After the
maximum number of retries is reached, __Bert-E__ posts a `BabysitExhausted`
message indicating that automatic retries have been exhausted.

This limit can be configured per repository by setting the
`max_babysit_retries` parameter in the repository's __Bert-E__ configuration:

```yaml
max_babysit_retries: 10 # Allow up to 10 retries instead of the default 5
```

### Re-enabling Babysit After Exhaustion

If the maximum retries have been exhausted but you want to continue with
automatic retries, simply comment `@bert-e babysit` again. This resets the
retry counter and allows for another round of automatic retries.

### Babysit Cancellation on New Commits

**Important:** If you push new commits to your branch after enabling babysit,
the babysit option is automatically cancelled. This prevents stale retry
attempts from continuing on outdated code.

When this happens, __Bert-E__ will post a `BabysitCancelled` message explaining
that new commits were detected. To re-enable automatic retries for the new
commits, you must comment `@bert-e babysit` again.

> **Example workflow:**
>
> 1. You comment `@bert-e babysit`
> 2. Build fails, __Bert-E__ retries (attempt 1/5)
> 3. Build fails again, __Bert-E__ retries (attempt 2/5)
> 4. You push a fix to address the build failure
> 5. __Bert-E__ detects the new commit and cancels babysit
> 6. Build fails on the new commit
> 7. You comment `@bert-e babysit` again to enable retries for the new code
> 8. __Bert-E__ retries (attempt 1/5 - counter is reset)

### Limitations

* Babysit only works with **GitHub Actions** (`build_key: github_actions`)
* Babysit is not available for other CI systems (Bitbucket Pipelines, Jenkins,
etc.)
* Babysit does not bypass build failures - if the issue is not transient, the
build will continue to fail after all retries are exhausted

Going further with __Bert-E__
-----------------------------
Do you like __Bert-E__? Would like to use it on your own projects?
Expand Down
22 changes: 22 additions & 0 deletions bert_e/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -585,3 +585,25 @@ class JobFailure(SilentException):

class QueueBuildFailed(SilentException):
code = 309


class BabysitRetry(TemplateException):
"""Raised when babysit mode triggers a retry of failed GitHub Actions."""
code = 140
template = 'babysit_retry.md'
dont_repeat_if_in_history = 0 # allow repeating for each retry
status = "in_progress"


class BabysitExhausted(TemplateException):
"""Raised when babysit mode has exhausted all retry attempts."""
code = 141
template = 'babysit_exhausted.md'
status = "failure"


class BabysitCancelled(TemplateException):
"""Raised when babysit mode is cancelled due to new commits."""
code = 142
template = 'babysit_cancelled.md'
status = "in_progress"
39 changes: 39 additions & 0 deletions bert_e/git_host/github/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -518,6 +518,17 @@ def create_pull_request(self, title, src_branch, dst_branch, description,
return PullRequest.create(self.client, data=kwargs, owner=self.owner,
repo=self.slug)

def rerun_failed_workflow_jobs(self, run_id: int) -> None:
"""Re-run only the failed jobs of a workflow run.

Args:
run_id: The ID of the workflow run to re-run failed jobs for.

"""
url = (f'/repos/{self.owner}/{self.slug}/actions/runs/'
f'{run_id}/rerun-failed-jobs')
self.client.post(url, data='{}')


class AggregatedStatus(base.AbstractGitHostObject):
GET_URL = '/repos/{owner}/{repo}/commits/{ref}/status'
Expand Down Expand Up @@ -640,6 +651,34 @@ def branch(self) -> str | None:
return self._workflow_runs[0]['head_branch']
return None

def get_failed_runs(self):
"""Get workflow runs that have failed.

This method filters workflow runs to keep only the most relevant run
per workflow (same logic as remove_unwanted_workflows), then returns
those that have failed.

Returns:
List of dicts with 'id' and 'run_attempt' for each failed run.
"""
# First, filter to get the best run per workflow (same as state check)
self.remove_unwanted_workflows()

failed_runs = []
for run in self._workflow_runs:
if run.get('conclusion') == 'failure':
failed_runs.append({
'id': run['id'],
'run_attempt': run.get('run_attempt', 1),
'workflow_id': run.get('workflow_id'),
'name': run.get('name', 'unknown'),
'html_url': run.get('html_url', ''),
})
LOG.debug(
"Babysit: found failed run id=%d, run_attempt=%d, name=%s",
run['id'], run.get('run_attempt', 1), run.get('name', ''))
return failed_runs

def remove_unwanted_workflows(self):
"""
Remove two things:
Expand Down
3 changes: 3 additions & 0 deletions bert_e/git_host/github/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,9 @@ class WorkflowRun(GitHubSchema):
event = fields.Str()
repository = fields.Nested(Repo)
workflow_id = fields.Integer()
# run_attempt indicates the number of times this workflow has been run
# Defaults to 1 for first run, increments with each rerun
run_attempt = fields.Integer(load_default=1)


class AggregateWorkflowRuns(GitHubSchema):
Expand Down
3 changes: 3 additions & 0 deletions bert_e/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,9 @@ class Meta:

send_bot_status = fields.Bool(required=False, load_default=False)

# Babysit feature: automatic retry of failed GitHub Actions
max_babysit_retries = fields.Int(required=False, load_default=5)

@pre_load(pass_many=True)
def load_env(self, data, **kwargs):
"""Load environment variables"""
Expand Down
14 changes: 14 additions & 0 deletions bert_e/templates/babysit_cancelled.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{% extends "message.md" %}

{% block title -%}
Babysit: Cancelled
{% endblock %}

{% block message %}
**Babysit mode has been cancelled** because new commits were pushed to the branch.

Previous retries were for commit `{{ previous_commit[:7] }}`, but the current commit is `{{ current_commit[:7] }}`.

If you want to enable automatic retries for the new commits, please comment `@{{ robot }} babysit` again.
{% endblock %}

23 changes: 23 additions & 0 deletions bert_e/templates/babysit_exhausted.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{% extends "message.md" %}

{% block title -%}
Babysit: Maximum retries reached
{% endblock %}

{% block message %}
The {% if build_url -%}[build]({{ build_url }}) {% else -%}build {% endif -%}
has exhausted all automatic retry attempts on branch `{{ branch.name }}`.

**Exhausted workflows** ({{ max_retries }} retries each):
{% for wf in exhausted_workflows -%}
- `{{ wf }}`
{% endfor %}
To investigate:
- Review the [build logs]({{ build_url }}) for the failure cause
- Check if this is a flaky test or a genuine issue

To get more retries:
- Fix the issue and push new commits (babysit will continue with fresh retries), or
- Comment `@{{ robot }} babysit` again to reset the retry counter
{% endblock %}

20 changes: 20 additions & 0 deletions bert_e/templates/babysit_retry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{% extends "message.md" %}

{% block title -%}
Babysit: Retrying build
{% endblock %}

{% block message %}
The {% if build_url -%}[build]({{ build_url }}) {% else -%}build {% endif -%}
failed on branch `{{ branch.name }}` (commit `{{ commit_sha[:7] }}`).

**Babysit mode is active** - automatically retrying failed workflows:

| Workflow | Retry |
|:---------|:-----:|
{% for wf in workflows -%}
| `{{ wf.name }}` | {{ wf.retry_count }}/{{ max_retries }} |
{% endfor %}
Please wait for the new build to complete.
{% endblock %}

Loading
Loading