Skip to content

Enhancement: Additional mirror metrics and health (RFE-6452)#35

Open
joseorpa wants to merge 5 commits intoquay:mainfrom
joseorpa:repository-mirror-metrics
Open

Enhancement: Additional mirror metrics and health (RFE-6452)#35
joseorpa wants to merge 5 commits intoquay:mainfrom
joseorpa:repository-mirror-metrics

Conversation

@joseorpa
Copy link

This enhancement introduces comprehensive metrics and health endpoints for Quay's repository mirroring functionality. The goal is to provide improved visibility, monitoring, and alerting for mirror worker operations, with data available on a per-mirror basis.

This addresses the need for the current lack of mirror metrics Relates to: RFE-6452

This enhancement introduces comprehensive metrics and health endpoints for Quay's repository mirroring functionality. The goal is to provide improved visibility, monitoring, and alerting for mirror worker operations, with data available on a per-mirror basis.

This addresses the need for the current lack of mirror metrics
Relates to: RFE-6452
@HammerMeetNail HammerMeetNail changed the title Enhancement: Additional mirror metrics and health Enhancement: Additional mirror metrics and health (RFE-6452) Oct 22, 2025
@ibazulic
Copy link
Member

Hey,

Thanks for the detailed document, very nice work. I do have some comments on it, hopefully it will help drive the discussion further!

Let me just preface everything with the statement that repository mirroring is not a crucial Quay component. Much as builders are not a crucial Quay component. They are an addon to Quay that one can use (but is not ncessarily forced to use). The crucial function of Quay is to provide registry functionality, meaning storing of Docker and OCI images.

With that said, lI'd like to comment on the health checks first. Repository mirror workers are small Python workers that basically start skopeo with a set of parameters defined in the repo mirroring configuration. The only functionality that repo mirror workers require is access to the database, much as any other worker that runs in main Quay pod. The proposal to run health checks would be equivalent to asking of adding a health check to GC processes or anything else that runs in the background. The decision to run repo mirroring workers outside of the main Quay pod was made when this feature was implemented, but this worker might have as well be run inside main Quay pod and would behave exactly in the same way as any other Quay backend worker. The worker itself, the piece of code that runs Skopeo based on conditions in the mirroring configuration, does not communicate directly with skopeo once the process has begun. It only knows when the process ends and what message it returns, skopeo runs independently of the worker and returns a success/fail message at the end. One mirroring worker can run exactly one skopeo mirror process at the same time (similarly to builders, one builder builds one image, not 5). So the question is what exactly would a health check do here. Just reporting health: ok makes little sence to me, if the worker doesn't start, we know there's something wrong (with the db, because that's the only thing it communicates with). Even if it does, mirror worker without the main Quay app pod has no real use. Note that state of each repo mirror is stored in the db, per mirror configuration. So if one wants to know what failed and what succeeded and you want it per repo, a much better option is to set repo notifications for mirroring events. If we don't have enough events, we should expand their number and make it more configurable.

Concerning metrics, when you say:

- Tags pending synchronization: Total number of tags not yet synchronized across all mirrored repositories

you don't really get anything useful here. First, you cannot know how many tags are not synchronized, you would need to check and count all tags in each upstream repository, compare that list of tags per repository with the list of tags that we already have stored (again per repository) and then calculate the difference. That doesn't tell you much, if you have 50 repos and 842 tags to sync, how do you know which repo contains these tags? You'd need a metric like this per repository, not across all mirrored repos. Either of these resolutions would require a lot of queries to be done on the most expensive table we have, tag. And there would be a lot of them because this metric would need to be refreshed very, very often. There's also a question of what happens if the upstream repo has less tags than what you have stored locally, but okay, this would be simple to resolve (we'd need to take an absolute value of the result to ensure metric is positive).

With all that said, I completely agree that these three metrics would in fact be useful and would be good to have them implemented:

   - Status of the last synchronization: An indicator (success/fail/in-progress) of the latest synchronization attempt per repository
   - Complete synchronization per repository: A boolean metric (0/1) indicating if a specific mirrored repository has successfully synchronized all its tags since the last run
   - Synchronization failure counter: A cumulative counter of mirroring failures per repository for alerting purpose

We do report all these errors in the audit log so it would be easy to set up counters to get them.

Cheers!

@joseorpa
Copy link
Author

I agree with you in the metric "Tags pending synchronization": What I have here is a "lost in translation" issue. I meant: Total number of tags not yet synchronized for each mirrored repository.

@joseorpa
Copy link
Author

Hi, I have updated the enhancement accordingly, and added an alternatives section.

joseorpa pushed a commit to joseorpa/quay that referenced this pull request Oct 24, 2025
…o: RFE-6452 . This commit introduces changes on workers/repomirrorworker/__init__.py and docs/mirroring-metrics.md to add the following metrics:

- Tags pending synchronization: Total number of tags not yet synchronized for each mirrored repository
- Status of the last synchronization: An indicator (success/fail/in-progress) of the latest synchronization attempt per repository
- Complete synchronization per repository: A boolean metric (0/1) indicating if a specific mirrored repository has successfully synchronized all its tags since the last run
- Synchronization failure counter: A cumulative counter of mirroring failures per repository for alerting purposes
Does not cover Implement a health endpoint as it has not been discussed yet.
joseorpa pushed a commit to joseorpa/quay that referenced this pull request Oct 24, 2025
Add four new metrics to monitor mirroring operations:
- quay_repository_mirror_pending_tags: Track tags pending sync
- quay_repository_mirror_last_sync_status: Last sync status indicator
- quay_repository_mirror_sync_complete: Boolean for complete sync
- quay_repository_mirror_sync_failures_total: Failure counter for alerting

Includes comprehensive documentation with alerting examples and troubleshooting guides.
Relates to enhancement Additional mirror metrics quay/enhancements#35 and to: RFE-6452
Does not cover Implement a health endpoint as it has not been discussed yet

**2. Last Synchronization Status**
```
# HELP quay_repository_mirror_last_sync_status Status of the last synchronization attempt (0=failed, 1=success, 2=in_progress)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the gauge returns 0 (failed), I wonder if we'd want to add a last_error_reason label to this metric, with the label mirroring the content of the reason label used on quay_repository_mirror_sync_failures_total.

This way, it'd allow users to instantly query the failed reason for a specific repository that is currently failing (e.g., quay_repository_mirror_last_sync_status{namespace="org1", status="failed"}) without the need to associate the failure rate from the counter.

- **Permissions**: Same as viewing repository information (user must have access to at least one mirrored repository)
- **Response Format**: JSON
- **Response Codes**:
- 200: Success, returns health status
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we'd want this /api/v1/repository/mirror/health endpoint to return 503 Service Unavailable when the healthy: boolean field is false.

Something like:

  • If the JSON response body shows "healthy": true, return 200 OK.
  • If the JSON response body shows "healthy": false due to any critical issue (e.g., zero workers, database error, etc), return 503 Service Unavailable.

This way, the overall mirror worker health status is immediately clear without parsing the JSON body.

@tlwu2013
Copy link

tlwu2013 commented Dec 1, 2025

hey @joseorpa, thanks for sharing this PR with me.

Overall, this looks great to me, and I only have two nits as suggestions in comments.
Awesome work!

@joseorpa
Copy link
Author

joseorpa commented Dec 9, 2025

Hey @tlwu2013 , thanks for the feedback, much appreciated!
I've updated the enhancement, I will be modifying the code as well this week.
Cheers!

@joseorpa
Copy link
Author

joseorpa commented Dec 9, 2025

Hey @tlwu2013 , I have used
quay_repository_mirror_last_sync_status{namespace="org1",repository="repo1",last_error_reason=""} 1 quay_repository_mirror_last_sync_status{namespace="org2",repository="repo2",last_error_reason="auth_failed"} 0
should I use 0 for no errors and 1 for failed somehow? I see in:
repo_mirror_last_sync_status = Gauge( "quay_repository_mirror_last_sync_status", "status of the last synchronization attempt (1=SUCCESS, 0=NEVER_RUN, -1=FAIL, -2=CANCEL, 2=SYNCING, 3=SYNC_NOW)", labelnames=["namespace", "repository", "status"], )
That other status values exists, should I update that?

@tlwu2013
Copy link

tlwu2013 commented Jan 6, 2026

hey @joseorpa,

Regarding the status values, in the Prometheus State Metric pattern, the value should be 1 whenever that specific status label is active.

In other words, rather than using integers to represent "Syncing" vs "Failed", simply use a status label instead.

If we use status="FAIL" 0, we can't use the sum() function to count failures because it would sum to zero. By using status="FAIL" 1, an admin can easily see a count of all failing mirrors across the entire registry with a simple query, as shown in the example below:

For example (a failure case):
quay_repository_mirror_last_sync_status{status="FAIL", last_error_reason="auth_failed"} 1

This makes the math “meaningful”, following the best practice in the Official Prometheus Naming Guide:

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful

…rics

This update refines the `quay_repository_mirror_last_sync_status` metric by introducing a `status` label to indicate the current state (success, failed, in_progress) and modifying the `last_error_reason` label for clarity. Additionally, adjustments were made to the Prometheus queries in the metrics overview to improve aggregation and reporting capabilities. This change enhances the visibility of synchronization statuses across repositories.
@joseorpa joseorpa requested a review from tlwu2013 January 23, 2026 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants