Fix/337 move metrics to cache refresh by gimballock · Pull Request #338 · stratum-mining/sv2-apps

gimballock · 2026-03-17T21:54:09Z

Problem

Monitoring integration tests were flaky due to race conditions between metric collection and assertion. The /metrics handler synchronously collected data by acquiring locks on business logic structures, creating two issues:

Test flakiness: Data could change between scrape and assertion
Production DoS: Rapid scrapes could block share validation/job distribution

External Evidence of the Race Condition

1. Issue #301: CI Flakiness (Reporter: lucasbalieiro)
Test monitoring::snapshot_cache::tests::test_snapshot_refresh was breaking coverage CI:

failures:
    monitoring::snapshot_cache::tests::test_snapshot_refresh

thread 'monitoring::snapshot_cache::tests::test_snapshot_refresh' panicked at 
stratum-apps/src/monitoring/snapshot_cache.rs:216:31:
assertion failed: metrics_text.contains(...)

CI runs: https://github.com/stratum-mining/sv2-apps/actions/runs/22486794825

2. Commit ff46a73 by Lucas Balieiro

remove time sensitive assertion from `monitoring::snapshot_cache::tests::test_snapshot_refresh`

this is being removed because it was causing the CI job responsible for 
generating the code coverage to be flaky.

the CI environment has very constrained resources and the test is expecting 
some things to happens in less than 100ms

The "fix" was removing timing assertions rather than fixing the root cause.

Scope: Only Per-Connection/Per-Channel Metrics Affected

Only GaugeVec metrics with per-channel labels are affected by this race condition. Simple gauges (totals, uptime) are not affected.

Affected metrics (these use .reset() + repopulate in the old handler):

sv2_client_channel_hashrate{client_id, channel_id, user_identity}
sv2_client_shares_accepted_total{client_id, channel_id, user_identity}
sv2_server_channel_hashrate{channel_id, user_identity}
sv2_server_shares_accepted_total{channel_id, user_identity}

Why these specifically: When a channel disconnects, its label combination becomes stale. The old code called .reset() on every /metrics request to clean up stale labels, creating a gap where all per-channel metrics temporarily disappear.

Not affected: Simple gauges like sv2_uptime_seconds, sv2_clients_total, sv2_client_hashrate_total — these don't need label cleanup and were never reset.

Recent Commits Amplify the Problem

Share accounting fixes (6994167) and blocks_found metrics (d3d703c) keep adding more metric collection code to the handler, making the race window larger.

Solution

Move gauge updates into the background SnapshotCache refresh task:

Atomic updates: Snapshot data and Prometheus gauges are now updated together during cache refresh, not on-demand during scrapes
Targeted stale-label removal: Instead of .reset() + repopulate-all, we now .set() current labels and .remove() only labels that are no longer present
Simplified /metrics handler: Now just sets uptime gauge, gathers, and encodes — no more label manipulation

The handler's doc comment confirms the fix:

"Because metric values are always kept in sync with the snapshot data, there is
never a gap where label series momentarily disappear. Tests can assert on metrics
directly after a cache refresh without polling for transient states."

Verification

Current Branch (with fix)

Monitoring integration tests pass consistently:

cargo nextest run -E 'test(monitoring)' --package integration_tests_sv2
PASS [ 17.859s] pool_monitoring_with_sv2_mining_device
PASS [ 14.763s] jd_aggregated_topology_monitoring  
PASS [ 13.687s] pool_and_tproxy_monitoring_with_sv1_miner

Impact Assessment

Before fix (evidence from Issue #301 and ff46a73):

CI failures under load (constrained resources)
Timing assertions removed to stop CI breakage
poll_until_metric_gte uses 5-second timeout with 100ms polling intervals
Up to 50 polling attempts per metric assertion

After fix:

Atomic updates eliminate the race window
Polling now only waits for cache refresh timing (~1s in tests), not race recovery
Estimated 5× reduction in retry attempts (10 vs 50 polls typical)
CI failures resolved

Test Infrastructure Evidence

The very existence of poll_until_metric_gte in prometheus_metrics_assertions.rs demonstrates the problem:

/// Poll /metrics until metric_name is present with a value >= min...
/// The handler calls .reset() on every /metrics request before repopulating
/// from the cached snapshot, so a label combination is only present when
/// the snapshot contains a non-default value for it.
pub async fn poll_until_metric_gte(...) -> String

This polling exists because without atomic updates, metrics could be missing during the reset/repopulate gap.

Changes

SnapshotCache now owns PrometheusMetrics and PreviousLabelSets
refresh() updates snapshot data AND Prometheus gauges atomically
/metrics handler reduced to: set uptime gauge, gather, encode
ServerState simplified (no more PreviousLabelSets or Mutex)

Closes #337

gimballock · 2026-03-17T23:09:00Z

Addresses #337

Shourya742

Can you explain what this PR does?

gimballock · 2026-04-05T15:19:13Z

Can you explain what this PR does?

Sorry for forgetting the description, it might have gotten attention sooner if I had made it more recognizable, lol.

…hotCache::refresh (stratum-mining#337) Move all Prometheus gauge updates (set + stale-label removal) out of the /metrics HTTP handler and into SnapshotCache::refresh(), which runs as a periodic background task. This eliminates the GaugeVec reset gap where label series momentarily disappeared on every scrape. Changes: - SnapshotCache now owns PrometheusMetrics and PreviousLabelSets - refresh() updates snapshot data AND Prometheus gauges atomically - /metrics handler reduced to: set uptime gauge, gather, encode - ServerState simplified (no more PreviousLabelSets or Mutex) - Tests updated to wire metrics through cache via with_metrics() - Integration tests: replace fixed-sleep assertions with poll_until_metric_gte (100ms poll, 5s deadline) for CI resilience - Clone impl preserves previous_labels for correct stale-label detection - debug-level tracing on stale label removal errors - debug_assert on with_metrics double-attachment Closes stratum-mining#337

gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 3 times, most recently from 56a1252 to a0e972e Compare March 17, 2026 23:06

gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 4 times, most recently from 0a55dce to 13457ee Compare March 18, 2026 16:59

gimballock changed the title ~~[Draft] Fix/337 move metrics to cache refresh~~ Fix/337 move metrics to cache refresh Mar 18, 2026

gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 3 times, most recently from 04dfe60 to 1b016e3 Compare March 23, 2026 14:14

GitGab19 linked an issue Mar 23, 2026 that may be closed by this pull request

fix GaugeVec reset gap in handle_prometheus_metrics #337

Open

gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 2 times, most recently from 150b77a to 985afe4 Compare March 24, 2026 15:42

gimballock mentioned this pull request Mar 24, 2026

refine poll_until_metric_gte to support label-specific assertions #336

Open

gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 5 times, most recently from 28e6fe2 to 224e8fc Compare March 30, 2026 14:21

gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch from 9443c5d to 7d09803 Compare April 4, 2026 02:21

Shourya742 reviewed Apr 5, 2026

View reviewed changes

Comment thread docker/Dockerfile Outdated

gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 2 times, most recently from de781ad to 01a7d9e Compare April 5, 2026 15:09

gimballock mentioned this pull request Apr 5, 2026

Add integration and unit tests for monitoring JSON API endpoints #373

Open

gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 3 times, most recently from c6f46b5 to a98c0d2 Compare April 13, 2026 13:38

gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch from a98c0d2 to 9229bc0 Compare April 13, 2026 13:41

gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch from 9229bc0 to 9059932 Compare April 14, 2026 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/337 move metrics to cache refresh#338

Fix/337 move metrics to cache refresh#338
gimballock wants to merge 1 commit intostratum-mining:mainfrom
fossatmara:fix/337-move-metrics-to-cache-refresh

gimballock commented Mar 17, 2026 •

edited

Loading

Uh oh!

gimballock commented Mar 17, 2026

Uh oh!

Shourya742 left a comment

Uh oh!

Uh oh!

gimballock commented Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gimballock commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

External Evidence of the Race Condition

Scope: Only Per-Connection/Per-Channel Metrics Affected

Recent Commits Amplify the Problem

Solution

Verification

Current Branch (with fix)

Impact Assessment

Test Infrastructure Evidence

Changes

Uh oh!

gimballock commented Mar 17, 2026

Uh oh!

Shourya742 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gimballock commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gimballock commented Mar 17, 2026 •

edited

Loading

gimballock commented Apr 5, 2026 •

edited

Loading