Block count and plumbing for low-latency metrics #271
Replies: 2 comments 1 reply
-
|
I don't think the small latency is not a strong enough point IMO, because you're not gonna rely on the monitoring data for any other urgent actions. You're gonna use them for monitoring your operations, so if you need to wait ~15s to get the updated Have you considered to simply add this new field to the current |
Beta Was this translation helpful? Give feedback.
-
|
I agree that this is an easy one to incorporate into ShareAccounting, and that's probably the right call. The argument i'm making is simply that blocks found are a business/revenue metric and that time sensitivity to this metric could be considered as a different category beyond operational service level objectives. No one is asking me for this specifically but i'm curious what the parameters of this position are. There is a lot written about what we are calling event-based vs snapshot-based metrics and i get that we want to resist event-based as long as possible b/c of the tighter coupling of metric libraries to core logic, plus the overhead of doing non-essential work in critical path logic. So I am just wondering if there is any situation that you can imagine needing to break out of the snapshot based metric model? Does ShareAccounting get too big at some point for internal tracking or is there some type of event that is so important that 15s is a significant actionable time savings? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Block Count Metric — Design Discussion
We need a
blocks_foundmetric for the pool. I want to lay out two approaches and make a case for why this is a good opportunity to introduce event-driven metrics alongside our existing snapshot-based system.Background: How Metrics Work Today
Currently, all our Prometheus metrics are gauges populated from a
SnapshotCachethat refreshes every ~15 seconds. On each refresh cycle, a background task acquires locks on the business logic, reads the current state (channels, hashrate, shares, etc.), and writes it into a cache. When Prometheus scrapes/metrics, we just read from that cache — no lock contention with the hot path. This was a deliberate design to prevent monitoring scrapes from becoming a DoS vector against share validation.This works well for continuously-varying state like hashrate, channel counts, and connected clients. But it has a fundamental mismatch with discrete events like finding a block.
Option A: Snapshot-Based (Path of Least Resistance)
Add an internal
blocks_found: AtomicU32counter somewhere in the channel manager. The 15s snapshot refresh reads it and exposes it as a Prometheus gauge.Pros:
Cons:
rate()andincrease()are designed for them. When you model a monotonically increasing value as a gauge, you loserate()reset-detection and PromQL ergonomics.changes(sv2_blocks_found_total[1m]) > 0instead of the idiomaticincrease(sv2_blocks_found_counter[5m]). The former is fragile and sensitive to scrape alignment.Option B: Event-Driven Counter (The Right Metric Type)
Expose a
prometheus::Counter(orIntCounter) directly and increment it at the call site when a block is found. Two sub-options for how to get the counter to the call site:Pass the metric through the component hierarchy — thread the
IntCounter(or a small metrics handle struct) down through constructors so it's available whereSubmitSharesSuccess/ block-found logic runs. Simple, explicit, but adds a parameter to a few constructors.Callback / event-bus pattern — register a closure or channel that the block-found handler fires into, and the monitoring layer listens on the other end. More decoupled, but more machinery.
For a single counter, option (1) is probably the right call — it's a few lines of plumbing, no new abstractions. We can revisit with a callback pattern if we end up needing many event-driven metrics.
Pros:
rate(),increase(), andresets()all work correctly out of the box. Alerting is idiomatic:increase(sv2_blocks_found_total[5m]) > 0.Cons:
Why This Matters Beyond Block Counts
The snapshot pattern is the right default for high-cardinality state that changes continuously (per-channel hashrate, client counts). But discrete, low-frequency, high-importance events are the textbook use case for counters. Blocks found is the clearest example, but the same argument applies to future metrics like error counts, reconnection events, or job distribution failures.
Establishing the event-driven counter pattern now — even for just one metric — gives us the foundation to handle these correctly going forward without having to retrofit later.
Practical Impact
For block counts specifically, the snapshot-gauge approach would work. Blocks are rare enough that we're unlikely to miss increments between snapshots. But "it works" and "it's the right tool" are different things. The latency difference matters for alerting, the counter semantics matter for PromQL ergonomics, and the pattern matters for what comes next.
I'd advocate for Option B(1) — pass an
IntCounterhandle to the block-found code path. It's maybe 10-15 lines of plumbing across 2-3 files, and it gives us a properly-typed, zero-latency metric with idiomatic alerting support.Thoughts?
Feel free to adjust tone, add context about specific files/PRs, or trim sections. The key persuasive points are:
Countervs gauge-pretending-to-be-a-counter, with downstream impact onrate()/increase()and alerting rulesBeta Was this translation helpful? Give feedback.
All reactions