Config toggle between constant and non-constant growth metrics #297

gimballock · 2026-02-24T16:54:31Z

gimballock
Feb 24, 2026

We have two open PRs adding new per-channel and per-client metrics:

feat(monitoring): track all share response outcomes per channel feat(monitoring): track all share response outcomes per channel #292
[draft] feat(monitoring): add per-channel and per-client byte metrics [draft] feat(monitoring): add byte metrics with aggregate Prometheus gauges #285

Together with the existing per-channel hashrate and shares-accepted metrics, these bring the total to 13 high-cardinality GaugeVec definitions — each producing N time series where N is the number of connected miners/channels. The metric namespace is already hard to navigate: determining which metrics track individual miners vs aggregates, or which apply to upstream vs downstream, requires reading the code.

This is a problem for long-running services like a mining pool. Prometheus stores every distinct label combination as a separate time series, and those series persist for the configured retention period — or until you run out of disk. With a default 15-second scrape interval, the storage growth is significant:

Connected miners	Per-scrape series (current)	Per-scrape series (after #285 + #292)	Storage/week	Storage/month
10	~77	~147	~30 MB	~130 MB
100	~617	~1,317	~270 MB	~1.1 GB
1,000	~6,017	~13,017	~2.7 GB	~11 GB
10,000	~60,017	~130,017	~26 GB	~105 GB

Today you can disable monitoring entirely, but then you have zero observability — no hashrate, no client counts, nothing. There's no middle ground between "everything, including per-miner detail that grows with your fleet" and "nothing at all."

We need a mode where monitoring storage growth is constant (independent of miner count) and one where it scales linearly (at least) with miners, controlled by a config toggle. The "no metrics" option could eventually just become the constant-storage mode — you always get aggregate observability without the unbounded storage cost.

Current metric sprawl

Today's metric names encode three concerns in the name itself: protocol (sv2_/sv1_), direction (server_/client_), and granularity (channel or not). This leads to parallel pairs that are structurally identical but have different names:

server	client
sv2_server_hashrate_total	sv2_client_hashrate_total
sv2_server_channels	sv2_client_channels
sv2_server_blocks_found_total	sv2_client_blocks_found_total
sv2_server_channel_hashrate	sv2_client_channel_hashrate
sv2_server_shares_accepted_total	sv2_client_shares_accepted_total

...and so on for every new observable

Adding bytes and rejections doubles this further. The server/client distinction is really just a direction — upstream vs downstream — and is better expressed as a label.

Proposal

1. Consolidate server/client into a direction label

Replace parallel metric pairs with a single metric using direction="upstream"|"downstream":

sv2_hashrate_total{direction="upstream"}       # was sv2_server_hashrate_total
sv2_hashrate_total{direction="downstream"}     # was sv2_client_hashrate_total

This cuts the number of metric names roughly in half. Grafana queries become simpler — one query shows both directions as separate lines automatically.

SV1 metrics remain in their own sv1_* namespace. SV1 has a different protocol model (no channels, no upstream direction) and keeping it separate makes discoverability straightforward: grep for sv1_ to know what exists for SV1 miners, rather than having to query for a protocol label value.

2. Add a monitoring_detailed_metrics config toggle

# When false, only aggregate (scalar) Prometheus metrics are registered.
# The JSON API always returns full per-channel data regardless of this setting.
# Default: true
monitoring_detailed_metrics = true

Tier 1 (always-on): Scalar aggregates and fixed-cardinality labels. Constant storage growth regardless of miner count.
Tier 2 (detailed, behind toggle): Per-channel/per-client breakdowns. Storage grows linearly with miners.

The JSON API (/api/v1/clients/{id}/channels, etc.) is unaffected by this toggle — it always returns full per-channel data. The difference is that the JSON API serves data on-demand without creating persistent time series, so it doesn't contribute to storage growth.

3. Add aggregate rejection rate and cull redundant per-channel metrics

The per-channel shares_accepted GaugeVecs are redundant with per-channel hashrate (hashrate is derived from accepted shares). Cull them.

For Tier 1 aggregates, only add metrics that are independently useful to an operator who will never enable Tier 2 — not mechanical counterparts for every per-channel metric. Most per-channel metrics are diagnostic tools whose value depends on per-device attribution; summing them into an aggregate destroys the information that made them useful. It's better to show nothing than to show a number that invites misinterpretation.

Applying this filter:

Add: sv2_shares_rejected_total{direction} — scalar aggregate. This is the missing "error rate" signal. Every monitoring system needs a success/error rate. Right now Tier 1 has throughput (hashrate) and availability (connections) but zero quality signal. You can't distinguish "100 miners hashing fine" from "100 miners submitting garbage that's all being rejected." The direction label adds real diagnostic value even at aggregate level: rejected{direction="upstream"} > 0 with rejected{direction="downstream"} == 0 means the pool is rejecting your work but your miners are fine — the problem is between your proxy and the pool. That's actionable without per-device breakdown.
Skip: sv2_shares_accepted_total{direction} — redundant with sv2_hashrate_total{direction}. Hashrate is derived from accepted shares. The aggregate accepted count doesn't answer a question that hashrate doesn't already answer.
Skip: aggregate bytes sent/received — "total bytes across all channels" is not actionable. What would you alert on? The number goes up when you add miners, when templates get bigger, or when one device misbehaves — the aggregate conflates all three. Host-level network monitoring (node_exporter, interface counters) already covers bandwidth capacity. Per-channel bytes are a diagnostic tool for investigating specific devices; the aggregate doesn't preserve that diagnostic value.
Removed: per-channel shares_accepted GaugeVecs (the information is already captured by per-channel hashrate in detailed mode)

Before and after

Before (current main + PRs #285 and #292):

ALWAYS ON — scalar / fixed cardinality (no per-miner labels):
  sv2_uptime_seconds                                          Gauge
  sv2_server_channels{channel_type}                           GaugeVec (2 fixed values)
  sv2_server_hashrate_total                                   Gauge
  sv2_server_blocks_found_total                               Gauge
  sv2_clients_total                                           Gauge
  sv2_client_channels{channel_type}                           GaugeVec (2 fixed values)
  sv2_client_hashrate_total                                   Gauge
  sv2_client_blocks_found_total                               Gauge
  sv1_clients_total                                           Gauge
  sv1_hashrate_total                                          Gauge

ALWAYS ON — per-channel / per-client (grows with miner count, no toggle):
  sv2_server_channel_hashrate{ch, user}                       GaugeVec
  sv2_server_shares_accepted_total{ch, user}                  GaugeVec
  sv2_server_channel_bytes_received_total{ch, user}           GaugeVec  (PR #285)
  sv2_server_channel_bytes_sent_total{ch, user}               GaugeVec  (PR #285)
  sv2_client_channel_hashrate{client, ch, user}               GaugeVec
  sv2_client_shares_accepted_total{client, ch, user}          GaugeVec
  sv2_client_shares_rejected_total{client, ch, user, reason}  GaugeVec  (PR #292)
  sv2_client_channel_bytes_received_total{client, ch, user}   GaugeVec  (PR #285)
  sv2_client_channel_bytes_sent_total{client, ch, user}       GaugeVec  (PR #285)
  sv1_client_hashrate{client, user}                           GaugeVec  (planned)
  sv1_client_bytes_received_total{client, user}               GaugeVec  (PR #285)
  sv1_client_bytes_sent_total{client, user}                   GaugeVec  (PR #285)

Total: 22 metric definitions
       10 fixed + 12 unbounded (no way to disable the 12)
       ~13*N time series at scale

After (this proposal):

TIER 1 — always on, constant storage:
  sv2_uptime_seconds                                          Gauge
  sv2_connections_total{direction}                             GaugeVec (2 values)
  sv2_channels{direction, channel_type}                       GaugeVec (4 values)
  sv2_hashrate_total{direction}                               GaugeVec (2 values)
  sv2_blocks_found_total{direction}                           GaugeVec (2 values)
  sv2_shares_rejected_total{direction}                        GaugeVec (2 values)  NEW
  sv1_clients_total                                           Gauge
  sv1_hashrate_total                                          Gauge

TIER 2 — behind monitoring_detailed_metrics toggle:
  sv2_channel_hashrate{direction, client_id, ch, user}        GaugeVec
  sv2_channel_bytes_received_total{direction, client_id, ch, user} GaugeVec
  sv2_channel_bytes_sent_total{direction, client_id, ch, user} GaugeVec
  sv2_channel_shares_rejected_total{direction, client_id, ch, user, reason} GaugeVec
  sv1_client_hashrate{client_id, user}                        GaugeVec
  sv1_client_bytes_received_total{client_id, user}            GaugeVec
  sv1_client_bytes_sent_total{client_id, user}                GaugeVec
  
Total: 15 metric definitions (down from 22)
       8 fixed (~15 series, constant) + 7 toggleable (~7*N series)
       Per-channel shares_accepted culled (redundant with hashrate)
       Server/client pairs consolidated via direction label
       No mechanical aggregate counterparts for bytes or accepted shares

With detailed metrics off: 15 fixed series, ~1.0 MB/week regardless of miner count.
With detailed metrics on: 15 + 7*N series, same as today but better organized.

Justification for Tier 2 (detailed) metrics

Each Tier 2 metric must justify the storage cost it imposes. These are the operational scenarios where per-device/per-channel resolution is necessary and aggregates are insufficient:

Hardware failure detection. A degrading hash board (thermal throttling, fan failure, memory errors) reduces one miner's hashrate while the aggregate dips by a barely-noticeable percentage. Per-channel hashrate identifies which machine is sick. Per-channel hashrate variance over time (stddev_over_time) distinguishes intermittent throttling from steady degradation. Per-device alerting (sv2_channel_hashrate < 0.7 * avg_over_time(...)) is impossible with aggregates — the aggregate can't tell you whether a 2% fleet-wide dip is one machine at 0% or five machines each down 20%.

Firmware regression analysis. When rolling out new firmware to a subset of the fleet, per-device rejection rates grouped by firmware cohort (encoded in user_identity) are the only way to A/B test the change. The reason label on per-channel rejections distinguishes firmware bugs (elevated stale-share = slow job switching, invalid-nonce = hash engine bug) from operational issues. Aggregate rejection rate tells you "something got worse" but not which cohort caused it.

Template provider and upstream diagnosis. Per-channel bytes received from upstream detects when a pool stops sending jobs to specific channels. Per-channel rejection rates with the direction label distinguish "the pool is rejecting everyone equally" (template/upstream issue) from "the pool is rejecting only specific channels" (pool-side bug or routing issue). The latter is invisible at aggregate level.

Experimental optimization. Overclocking profiles, target tuning, share batch sizes, and network configuration changes require per-subject measurement. Statistical significance testing needs individual data points, not just the mean. Per-channel bytes-per-share ratios detect protocol-level inefficiencies introduced by configuration changes. Aggregate data cannot distinguish "all miners improved 10%" from "half improved 20% and half stayed the same."

Multi-tenant accountability. In hosted mining or mining-as-a-service, per-identity hashrate over time is required for billing and SLA verification. This is impossible to derive from aggregate metrics.

What this doesn't save

This is primarily a Prometheus storage concern. The JSON API (/api/v1/clients/{id}/channels) continues to serve full per-channel detail regardless of the toggle — it needs to for operational use. The difference is that the API returns data on-demand from a periodically-refreshed snapshot, not as an infinite time series. The snapshot memory cost (~276 bytes per channel) is the same either way.

Summary of benefits

Organized namespace: direction-as-label eliminates parallel metric pairs. Adding a new observable means registering one metric, not two.
Storage protection: the toggle prevents unintentional disk bloat on long-running pools. Default true preserves backward compatibility; operators opt into lightweight mode.
Simpler queries: sv2_hashrate_total with a direction label auto-splits in Grafana instead of requiring two separate queries.
No dead-weight aggregates: Tier 1 only contains metrics that are independently actionable. Tier 2 metrics without a meaningful aggregate form (bytes, accepted shares) simply aren't exposed at Tier 1 rather than being summed into misleading numbers.
SV1 discoverability: SV1 stays in its own namespace — sv1_* tells you immediately what's available for stratum v1 miners without having to query label values.

Design guideline for future metrics

When adding a new observable:

Add field to the channel/client info struct
Add aggregate to the summary struct (if the aggregate is independently actionable)
Register a Tier 1 aggregate metric only if it answers a question an operator would ask with aggregate data alone. Uses direction label if applicable to both upstream and downstream. Not every Tier 2 metric needs a Tier 1 counterpart — some observables are purely diagnostic and only meaningful with per-device attribution.
Register a Tier 2 per-entity metric (behind the toggle, uses direction, client_id, channel_id, user_identity labels)
Rule: every Tier 1 aggregate must be independently useful; every Tier 2 metric must justify the storage cost with a concrete operational scenario where per-device resolution is necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Config toggle between constant and non-constant growth metrics #297

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Config toggle between constant and *non-constant* growth metrics #297

Uh oh!

Uh oh!

gimballock Feb 24, 2026

Replies: 0 comments

Config toggle between constant and non-constant growth metrics #297

gimballock
Feb 24, 2026