Add redis metrics giving better view into redis #4743
Conversation
16d6bf3 to
45e5f0e
Compare
Greptile SummaryThis PR adds a new Redis stream metrics subsystem to the event ingester: a
Confidence Score: 4/5Safe to merge after fixing the zero CollectionInterval panic; the feature is disabled by default so it only affects users who opt in without setting the interval. Most concerns from prior review rounds are addressed. One new P1 remains: an unchecked zero CollectionInterval causes a hard panic. Feature defaults to disabled so production impact is low. internal/eventingester/metrics/redis/collector.go (CollectionInterval zero guard), internal/eventingester/repository/scanner_test.go (Redis DB conflict) Important Files Changed
Sequence DiagramsequenceDiagram
participant App as ingester.Run
participant LC as LeaderController
participant C as Collector.Run (goroutine)
participant S as Scanner.ScanAll
participant R as Redis
participant P as Prometheus Scraper
App->>LC: leaderController.Run(ctx)
App->>C: collector.Run(ctx)
Note over C: initial jitter delay [0, initialDelayMax)
loop Every CollectionInterval
C->>LC: GetToken().Leader()?
alt is leader
C->>S: ScanAll(ctx)
S->>R: SCAN Events:* (batched)
S->>R: Pipeline: XINFO STREAM + MEMORY USAGE
R-->>S: stream info + memory bytes
S-->>C: []StreamInfo
C->>C: resetMetricsForNewCycle()
C->>C: update topN/queue gauges + histograms
C->>C: collectSnapshot() state.Store(metrics)
else not leader
C->>C: ClearState()
end
end
P->>C: Collect(ch)
C->>LC: GetToken().Leader()?
alt is leader
C->>P: emit cached []prometheus.Metric from state
else
C-->>P: return (no metrics)
end
|
| bytesHistogram: prometheus.NewHistogram(prometheus.HistogramOpts{ | ||
| Name: "armada_redis_stream_size_bytes_distribution", | ||
| Help: "Distribution of Redis stream sizes in bytes", | ||
| Buckets: prometheus.ExponentialBuckets(1024, 2, 20), |
There was a problem hiding this comment.
These buckets are too small
There was a problem hiding this comment.
Discussed this
Make byte and events 30 buckets, and age 20 buckets
| ), | ||
| topNEventsGauge: prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: "armada_redis_stream_event_count", |
There was a problem hiding this comment.
It'd be good to use a constant prefix - as we do everywhere else
Possibly it should be armada_event_redis although I'm not 100%
- I can envision we have more redis in future
- So either we'd need the name to distinguish the difference or we'd need another label (instance/purpose something like that)
| pipelineBatchSize: 500 | ||
| interBatchDelay: 100ms | ||
| memoryUsageSamples: 5 | ||
| # metricsRedis: # Optional: separate Redis for metrics collection |
There was a problem hiding this comment.
Does this redis need metrics in its name? It is already under the metrics section
…r defaults - Restructure config types: RedisMemoryMetrics (top-level) → Metrics.Redis (nested) - Update server wiring to read config.Metrics.Redis.* (11 references) - Apply reviewer-requested defaults: collectionInterval 15m, memoryUsageSamples 25 - Update all YAML surfaces: primary, local dev, and Helm values - Preserve dev-specific values in local configs (enabled, fast interval, standalone) Addresses JamesMurkin review comments on PR #4743
0a68a51 to
25ac033
Compare
d5ddeb6 to
e062235
Compare
ff9d09b to
593abe5
Compare
1b99cc9 to
2c609b1
Compare
|
|
||
| // Delay to guard against crash loops during startup and to prevent thundering herd on leadership changes | ||
| // The delay is [0, 1 minute) to ensure that in the worst case, all collectors will be staggered by at least 1 minute.] | ||
| initialDelay := time.Duration(rand.Int64N(int64(1 * time.Minute))) |
There was a problem hiding this comment.
@JamesMurkin here is the initial delay as random interval up to a minute. I think this might be fine, but we can change that
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
900d47a to
d8d778e
Compare
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
masipauskas
left a comment
There was a problem hiding this comment.
notes:
- high level skim through, lgtm.
- I don't see how metrics from
topN/queuehistograms drop out for deleted queues, instead of being reset to 0.
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com> Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>
No description provided.