Config toggle between constant and *non-constant* growth metrics #297
gimballock
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We have two open PRs adding new per-channel and per-client metrics:
Together with the existing per-channel hashrate and shares-accepted metrics, these bring the total to 13 high-cardinality GaugeVec definitions — each producing N time series where N is the number of connected miners/channels. The metric namespace is already hard to navigate: determining which metrics track individual miners vs aggregates, or which apply to upstream vs downstream, requires reading the code.
This is a problem for long-running services like a mining pool. Prometheus stores every distinct label combination as a separate time series, and those series persist for the configured retention period — or until you run out of disk. With a default 15-second scrape interval, the storage growth is significant:
Today you can disable monitoring entirely, but then you have zero observability — no hashrate, no client counts, nothing. There's no middle ground between "everything, including per-miner detail that grows with your fleet" and "nothing at all."
We need a mode where monitoring storage growth is constant (independent of miner count) and one where it scales linearly (at least) with miners, controlled by a config toggle. The "no metrics" option could eventually just become the constant-storage mode — you always get aggregate observability without the unbounded storage cost.
Current metric sprawl
Today's metric names encode three concerns in the name itself: protocol (sv2_/sv1_), direction (server_/client_), and granularity (channel or not). This leads to parallel pairs that are structurally identical but have different names:
...and so on for every new observable
Adding bytes and rejections doubles this further. The server/client distinction is really just a direction — upstream vs downstream — and is better expressed as a label.
Proposal
1. Consolidate server/client into a direction label
Replace parallel metric pairs with a single metric using direction="upstream"|"downstream":
This cuts the number of metric names roughly in half. Grafana queries become simpler — one query shows both directions as separate lines automatically.
SV1 metrics remain in their own sv1_* namespace. SV1 has a different protocol model (no channels, no upstream direction) and keeping it separate makes discoverability straightforward: grep for sv1_ to know what exists for SV1 miners, rather than having to query for a protocol label value.
2. Add a monitoring_detailed_metrics config toggle
The JSON API (
/api/v1/clients/{id}/channels, etc.) is unaffected by this toggle — it always returns full per-channel data. The difference is that the JSON API serves data on-demand without creating persistent time series, so it doesn't contribute to storage growth.3. Add aggregate rejection rate and cull redundant per-channel metrics
The per-channel
shares_acceptedGaugeVecs are redundant with per-channel hashrate (hashrate is derived from accepted shares). Cull them.For Tier 1 aggregates, only add metrics that are independently useful to an operator who will never enable Tier 2 — not mechanical counterparts for every per-channel metric. Most per-channel metrics are diagnostic tools whose value depends on per-device attribution; summing them into an aggregate destroys the information that made them useful. It's better to show nothing than to show a number that invites misinterpretation.
Applying this filter:
sv2_shares_rejected_total{direction}— scalar aggregate. This is the missing "error rate" signal. Every monitoring system needs a success/error rate. Right now Tier 1 has throughput (hashrate) and availability (connections) but zero quality signal. You can't distinguish "100 miners hashing fine" from "100 miners submitting garbage that's all being rejected." The direction label adds real diagnostic value even at aggregate level:rejected{direction="upstream"} > 0withrejected{direction="downstream"} == 0means the pool is rejecting your work but your miners are fine — the problem is between your proxy and the pool. That's actionable without per-device breakdown.sv2_shares_accepted_total{direction}— redundant withsv2_hashrate_total{direction}. Hashrate is derived from accepted shares. The aggregate accepted count doesn't answer a question that hashrate doesn't already answer.shares_acceptedGaugeVecs (the information is already captured by per-channel hashrate in detailed mode)Before and after
Before (current main + PRs #285 and #292):
After (this proposal):
With detailed metrics off: 15 fixed series, ~1.0 MB/week regardless of miner count.
With detailed metrics on: 15 + 7*N series, same as today but better organized.
Justification for Tier 2 (detailed) metrics
Each Tier 2 metric must justify the storage cost it imposes. These are the operational scenarios where per-device/per-channel resolution is necessary and aggregates are insufficient:
Hardware failure detection. A degrading hash board (thermal throttling, fan failure, memory errors) reduces one miner's hashrate while the aggregate dips by a barely-noticeable percentage. Per-channel hashrate identifies which machine is sick. Per-channel hashrate variance over time (stddev_over_time) distinguishes intermittent throttling from steady degradation. Per-device alerting (sv2_channel_hashrate < 0.7 * avg_over_time(...)) is impossible with aggregates — the aggregate can't tell you whether a 2% fleet-wide dip is one machine at 0% or five machines each down 20%.
Firmware regression analysis. When rolling out new firmware to a subset of the fleet, per-device rejection rates grouped by firmware cohort (encoded in user_identity) are the only way to A/B test the change. The reason label on per-channel rejections distinguishes firmware bugs (elevated stale-share = slow job switching, invalid-nonce = hash engine bug) from operational issues. Aggregate rejection rate tells you "something got worse" but not which cohort caused it.
Template provider and upstream diagnosis. Per-channel bytes received from upstream detects when a pool stops sending jobs to specific channels. Per-channel rejection rates with the direction label distinguish "the pool is rejecting everyone equally" (template/upstream issue) from "the pool is rejecting only specific channels" (pool-side bug or routing issue). The latter is invisible at aggregate level.
Experimental optimization. Overclocking profiles, target tuning, share batch sizes, and network configuration changes require per-subject measurement. Statistical significance testing needs individual data points, not just the mean. Per-channel bytes-per-share ratios detect protocol-level inefficiencies introduced by configuration changes. Aggregate data cannot distinguish "all miners improved 10%" from "half improved 20% and half stayed the same."
Multi-tenant accountability. In hosted mining or mining-as-a-service, per-identity hashrate over time is required for billing and SLA verification. This is impossible to derive from aggregate metrics.
What this doesn't save
This is primarily a Prometheus storage concern. The JSON API (/api/v1/clients/{id}/channels) continues to serve full per-channel detail regardless of the toggle — it needs to for operational use. The difference is that the API returns data on-demand from a periodically-refreshed snapshot, not as an infinite time series. The snapshot memory cost (~276 bytes per channel) is the same either way.
Summary of benefits
Design guideline for future metrics
When adding a new observable:
Beta Was this translation helpful? Give feedback.
All reactions