Fix/337 move metrics to cache refresh#338
Open
gimballock wants to merge 1 commit intostratum-mining:mainfrom
Open
Fix/337 move metrics to cache refresh#338gimballock wants to merge 1 commit intostratum-mining:mainfrom
gimballock wants to merge 1 commit intostratum-mining:mainfrom
Conversation
56a1252 to
a0e972e
Compare
Author
|
Addresses #337 |
0a55dce to
13457ee
Compare
04dfe60 to
1b016e3
Compare
150b77a to
985afe4
Compare
28e6fe2 to
224e8fc
Compare
9443c5d to
7d09803
Compare
Shourya742
reviewed
Apr 5, 2026
Collaborator
Shourya742
left a comment
There was a problem hiding this comment.
Can you explain what this PR does?
de781ad to
01a7d9e
Compare
Author
Sorry for forgetting the description, it might have gotten attention sooner if I had made it more recognizable, lol. |
c6f46b5 to
a98c0d2
Compare
a98c0d2 to
9229bc0
Compare
…hotCache::refresh (stratum-mining#337) Move all Prometheus gauge updates (set + stale-label removal) out of the /metrics HTTP handler and into SnapshotCache::refresh(), which runs as a periodic background task. This eliminates the GaugeVec reset gap where label series momentarily disappeared on every scrape. Changes: - SnapshotCache now owns PrometheusMetrics and PreviousLabelSets - refresh() updates snapshot data AND Prometheus gauges atomically - /metrics handler reduced to: set uptime gauge, gather, encode - ServerState simplified (no more PreviousLabelSets or Mutex) - Tests updated to wire metrics through cache via with_metrics() - Integration tests: replace fixed-sleep assertions with poll_until_metric_gte (100ms poll, 5s deadline) for CI resilience - Clone impl preserves previous_labels for correct stale-label detection - debug-level tracing on stale label removal errors - debug_assert on with_metrics double-attachment Closes stratum-mining#337
9229bc0 to
9059932
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Monitoring integration tests were flaky due to race conditions between metric collection and assertion. The /metrics handler synchronously collected data by acquiring locks on business logic structures, creating two issues:
External Evidence of the Race Condition
1. Issue #301: CI Flakiness (Reporter: lucasbalieiro)
Test
monitoring::snapshot_cache::tests::test_snapshot_refreshwas breaking coverage CI:CI runs: https://github.com/stratum-mining/sv2-apps/actions/runs/22486794825
2. Commit ff46a73 by Lucas Balieiro
The "fix" was removing timing assertions rather than fixing the root cause.
Scope: Only Per-Connection/Per-Channel Metrics Affected
Only GaugeVec metrics with per-channel labels are affected by this race condition. Simple gauges (totals, uptime) are not affected.
Affected metrics (these use
.reset()+ repopulate in the old handler):sv2_client_channel_hashrate{client_id, channel_id, user_identity}sv2_client_shares_accepted_total{client_id, channel_id, user_identity}sv2_server_channel_hashrate{channel_id, user_identity}sv2_server_shares_accepted_total{channel_id, user_identity}Why these specifically: When a channel disconnects, its label combination becomes stale. The old code called
.reset()on every/metricsrequest to clean up stale labels, creating a gap where all per-channel metrics temporarily disappear.Not affected: Simple gauges like
sv2_uptime_seconds,sv2_clients_total,sv2_client_hashrate_total— these don't need label cleanup and were never reset.Recent Commits Amplify the Problem
Share accounting fixes (6994167) and blocks_found metrics (d3d703c) keep adding more metric collection code to the handler, making the race window larger.
Solution
Move gauge updates into the background SnapshotCache refresh task:
.reset()+ repopulate-all, we now.set()current labels and.remove()only labels that are no longer presentThe handler's doc comment confirms the fix:
Verification
Current Branch (with fix)
Monitoring integration tests pass consistently:
Impact Assessment
Before fix (evidence from Issue #301 and ff46a73):
poll_until_metric_gteuses 5-second timeout with 100ms polling intervalsAfter fix:
Test Infrastructure Evidence
The very existence of
poll_until_metric_gteinprometheus_metrics_assertions.rsdemonstrates the problem:This polling exists because without atomic updates, metrics could be missing during the reset/repopulate gap.
Changes
Closes #337