Block count and plumbing for low-latency metrics #271

gimballock · 2026-02-16T15:41:27Z

gimballock
Feb 16, 2026

Block Count Metric — Design Discussion

We need a blocks_found metric for the pool. I want to lay out two approaches and make a case for why this is a good opportunity to introduce event-driven metrics alongside our existing snapshot-based system.

Background: How Metrics Work Today

Currently, all our Prometheus metrics are gauges populated from a SnapshotCache that refreshes every ~15 seconds. On each refresh cycle, a background task acquires locks on the business logic, reads the current state (channels, hashrate, shares, etc.), and writes it into a cache. When Prometheus scrapes /metrics, we just read from that cache — no lock contention with the hot path. This was a deliberate design to prevent monitoring scrapes from becoming a DoS vector against share validation.

This works well for continuously-varying state like hashrate, channel counts, and connected clients. But it has a fundamental mismatch with discrete events like finding a block.

Option A: Snapshot-Based (Path of Least Resistance)

Add an internal blocks_found: AtomicU32 counter somewhere in the channel manager. The 15s snapshot refresh reads it and exposes it as a Prometheus gauge.

Pros:

Fits the existing architecture with minimal code changes
No new patterns to introduce

Cons:

The metric is a gauge pretending to be a counter. Prometheus counters have specific semantics — they only go up, and tools like rate() and increase() are designed for them. When you model a monotonically increasing value as a gauge, you lose rate() reset-detection and PromQL ergonomics.
Observation latency is bounded by the snapshot interval. If a block is found at T=1s and the snapshot fires at T=15s, Prometheus doesn't see the event until somewhere between T=15s and T=15s + scrape_interval. That's up to 30 seconds of latency on a metric that operators will want to alert on immediately. For a block-found event — arguably the single most important thing a pool does — that's not great.
Alerting on gauges is awkward. You'd need something like changes(sv2_blocks_found_total[1m]) > 0 instead of the idiomatic increase(sv2_blocks_found_counter[5m]). The former is fragile and sensitive to scrape alignment.

Option B: Event-Driven Counter (The Right Metric Type)

Expose a prometheus::Counter (or IntCounter) directly and increment it at the call site when a block is found. Two sub-options for how to get the counter to the call site:

Pass the metric through the component hierarchy — thread the IntCounter (or a small metrics handle struct) down through constructors so it's available where SubmitSharesSuccess / block-found logic runs. Simple, explicit, but adds a parameter to a few constructors.
Callback / event-bus pattern — register a closure or channel that the block-found handler fires into, and the monitoring layer listens on the other end. More decoupled, but more machinery.

For a single counter, option (1) is probably the right call — it's a few lines of plumbing, no new abstractions. We can revisit with a callback pattern if we end up needing many event-driven metrics.

Pros:

Zero observation latency. The counter increments the instant the block is found. The next Prometheus scrape sees it immediately — no 15s delay.
Native counter semantics. rate(), increase(), and resets() all work correctly out of the box. Alerting is idiomatic: increase(sv2_blocks_found_total[5m]) > 0.
No snapshot overhead. The counter lives in the Prometheus registry and is gathered directly — it doesn't need to round-trip through the snapshot cache at all.

Cons:

Requires threading a metric handle into the business logic, which is a small coupling. But Prometheus counters are just atomic integers — it's not like we're pulling in a web framework dependency.

Why This Matters Beyond Block Counts

The snapshot pattern is the right default for high-cardinality state that changes continuously (per-channel hashrate, client counts). But discrete, low-frequency, high-importance events are the textbook use case for counters. Blocks found is the clearest example, but the same argument applies to future metrics like error counts, reconnection events, or job distribution failures.

Establishing the event-driven counter pattern now — even for just one metric — gives us the foundation to handle these correctly going forward without having to retrofit later.

Practical Impact

For block counts specifically, the snapshot-gauge approach would work. Blocks are rare enough that we're unlikely to miss increments between snapshots. But "it works" and "it's the right tool" are different things. The latency difference matters for alerting, the counter semantics matter for PromQL ergonomics, and the pattern matters for what comes next.

I'd advocate for Option B(1) — pass an IntCounter handle to the block-found code path. It's maybe 10-15 lines of plumbing across 2-3 files, and it gives us a properly-typed, zero-latency metric with idiomatic alerting support.

Thoughts?

Feel free to adjust tone, add context about specific files/PRs, or trim sections. The key persuasive points are:

Latency: 0s vs up to 30s for the single most important event a pool produces
Semantics: native Counter vs gauge-pretending-to-be-a-counter, with downstream impact on rate()/increase() and alerting rules
Precedent: establishing the pattern now for one metric is cheap; retrofitting later for many metrics is expensive

GitGab19 · 2026-02-17T12:27:14Z

GitGab19
Feb 17, 2026
Maintainer

I don't think the small latency is not a strong enough point IMO, because you're not gonna rely on the monitoring data for any other urgent actions. You're gonna use them for monitoring your operations, so if you need to wait ~15s to get the updated block_found metric, it's not a huge problem IMO.

Have you considered to simply add this new field to the current ShareAccounting structure that we have? If so, why don't you think it's the right way to go?

0 replies

gimballock · 2026-02-17T16:03:19Z

gimballock
Feb 17, 2026
Author

I agree that this is an easy one to incorporate into ShareAccounting, and that's probably the right call. The argument i'm making is simply that blocks found are a business/revenue metric and that time sensitivity to this metric could be considered as a different category beyond operational service level objectives.

No one is asking me for this specifically but i'm curious what the parameters of this position are. There is a lot written about what we are calling event-based vs snapshot-based metrics and i get that we want to resist event-based as long as possible b/c of the tighter coupling of metric libraries to core logic, plus the overhead of doing non-essential work in critical path logic. So I am just wondering if there is any situation that you can imagine needing to break out of the snapshot based metric model? Does ShareAccounting get too big at some point for internal tracking or is there some type of event that is so important that 15s is a significant actionable time savings?

1 reply

GitGab19 Feb 17, 2026
Maintainer

So I am just wondering if there is any situation that you can imagine needing to break out of the snapshot based metric model? Does ShareAccounting get too big at some point for internal tracking or is there some type of event that is so important that 15s is a significant actionable time savings?

This is a good question. For now I don't see the need for any specific metric which doesn't fit in the current ShareAccounting module, since with that we are already covering almost every possible operational metric I can think of.

But I guess that we will refine the answer to this question in the future, while improving our applications step-by-step!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block count and plumbing for low-latency metrics #271

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Block count and plumbing for low-latency metrics #271

Uh oh!

gimballock Feb 16, 2026

Block Count Metric — Design Discussion

Background: How Metrics Work Today

Option A: Snapshot-Based (Path of Least Resistance)

Option B: Event-Driven Counter (The Right Metric Type)

Why This Matters Beyond Block Counts

Practical Impact

Replies: 2 comments · 1 reply

Uh oh!

GitGab19 Feb 17, 2026 Maintainer

Uh oh!

gimballock Feb 17, 2026 Author

Uh oh!

GitGab19 Feb 17, 2026 Maintainer

gimballock
Feb 16, 2026

Replies: 2 comments 1 reply

GitGab19
Feb 17, 2026
Maintainer

gimballock
Feb 17, 2026
Author

GitGab19 Feb 17, 2026
Maintainer