Enhancement: Additional mirror metrics and health (RFE-6452)#35
Enhancement: Additional mirror metrics and health (RFE-6452)#35
Conversation
This enhancement introduces comprehensive metrics and health endpoints for Quay's repository mirroring functionality. The goal is to provide improved visibility, monitoring, and alerting for mirror worker operations, with data available on a per-mirror basis. This addresses the need for the current lack of mirror metrics Relates to: RFE-6452
|
Hey, Thanks for the detailed document, very nice work. I do have some comments on it, hopefully it will help drive the discussion further! Let me just preface everything with the statement that repository mirroring is not a crucial Quay component. Much as builders are not a crucial Quay component. They are an addon to Quay that one can use (but is not ncessarily forced to use). The crucial function of Quay is to provide registry functionality, meaning storing of Docker and OCI images. With that said, lI'd like to comment on the health checks first. Repository mirror workers are small Python workers that basically start Concerning metrics, when you say: you don't really get anything useful here. First, you cannot know how many tags are not synchronized, you would need to check and count all tags in each upstream repository, compare that list of tags per repository with the list of tags that we already have stored (again per repository) and then calculate the difference. That doesn't tell you much, if you have 50 repos and 842 tags to sync, how do you know which repo contains these tags? You'd need a metric like this per repository, not across all mirrored repos. Either of these resolutions would require a lot of queries to be done on the most expensive table we have, With all that said, I completely agree that these three metrics would in fact be useful and would be good to have them implemented: We do report all these errors in the audit log so it would be easy to set up counters to get them. Cheers! |
|
I agree with you in the metric "Tags pending synchronization": What I have here is a "lost in translation" issue. I meant: Total number of tags not yet synchronized for each mirrored repository. |
|
Hi, I have updated the enhancement accordingly, and added an alternatives section. |
…o: RFE-6452 . This commit introduces changes on workers/repomirrorworker/__init__.py and docs/mirroring-metrics.md to add the following metrics: - Tags pending synchronization: Total number of tags not yet synchronized for each mirrored repository - Status of the last synchronization: An indicator (success/fail/in-progress) of the latest synchronization attempt per repository - Complete synchronization per repository: A boolean metric (0/1) indicating if a specific mirrored repository has successfully synchronized all its tags since the last run - Synchronization failure counter: A cumulative counter of mirroring failures per repository for alerting purposes Does not cover Implement a health endpoint as it has not been discussed yet.
Add four new metrics to monitor mirroring operations: - quay_repository_mirror_pending_tags: Track tags pending sync - quay_repository_mirror_last_sync_status: Last sync status indicator - quay_repository_mirror_sync_complete: Boolean for complete sync - quay_repository_mirror_sync_failures_total: Failure counter for alerting Includes comprehensive documentation with alerting examples and troubleshooting guides. Relates to enhancement Additional mirror metrics quay/enhancements#35 and to: RFE-6452 Does not cover Implement a health endpoint as it has not been discussed yet
|
|
||
| **2. Last Synchronization Status** | ||
| ``` | ||
| # HELP quay_repository_mirror_last_sync_status Status of the last synchronization attempt (0=failed, 1=success, 2=in_progress) |
There was a problem hiding this comment.
When the gauge returns 0 (failed), I wonder if we'd want to add a last_error_reason label to this metric, with the label mirroring the content of the reason label used on quay_repository_mirror_sync_failures_total.
This way, it'd allow users to instantly query the failed reason for a specific repository that is currently failing (e.g., quay_repository_mirror_last_sync_status{namespace="org1", status="failed"}) without the need to associate the failure rate from the counter.
| - **Permissions**: Same as viewing repository information (user must have access to at least one mirrored repository) | ||
| - **Response Format**: JSON | ||
| - **Response Codes**: | ||
| - 200: Success, returns health status |
There was a problem hiding this comment.
I wonder if we'd want this /api/v1/repository/mirror/health endpoint to return 503 Service Unavailable when the healthy: boolean field is false.
Something like:
- If the JSON response body shows
"healthy": true, return200 OK. - If the JSON response body shows
"healthy": falsedue to any critical issue (e.g., zero workers, database error, etc), return503 Service Unavailable.
This way, the overall mirror worker health status is immediately clear without parsing the JSON body.
|
hey @joseorpa, thanks for sharing this PR with me. Overall, this looks great to me, and I only have two nits as suggestions in comments. |
|
Hey @tlwu2013 , thanks for the feedback, much appreciated! |
|
Hey @tlwu2013 , I have used |
|
hey @joseorpa, Regarding the status values, in the Prometheus State Metric pattern, the value should be 1 whenever that specific status label is active. In other words, rather than using integers to represent "Syncing" vs "Failed", simply use a If we use For example (a failure case): This makes the math “meaningful”, following the best practice in the Official Prometheus Naming Guide:
|
…rics This update refines the `quay_repository_mirror_last_sync_status` metric by introducing a `status` label to indicate the current state (success, failed, in_progress) and modifying the `last_error_reason` label for clarity. Additionally, adjustments were made to the Prometheus queries in the metrics overview to improve aggregation and reporting capabilities. This change enhances the visibility of synchronization statuses across repositories.
This enhancement introduces comprehensive metrics and health endpoints for Quay's repository mirroring functionality. The goal is to provide improved visibility, monitoring, and alerting for mirror worker operations, with data available on a per-mirror basis.
This addresses the need for the current lack of mirror metrics Relates to: RFE-6452