Add redis metrics giving better view into redis by nikola-jokic · Pull Request #4743 · armadaproject/armada

nikola-jokic · 2026-03-06T12:27:14Z

No description provided.

greptile-apps · 2026-03-17T14:55:55Z

Greptile Summary

This PR adds a new Redis stream metrics subsystem to the event ingester: a Scanner that pipelines XINFO STREAM + MEMORY USAGE calls across all Events:* keys, and a Prometheus Collector that periodically runs the scanner, caches an atomic snapshot, and gates metric emission behind Kubernetes or standalone leader election.

P1 — CollectionInterval: 0 panics on startup: time.NewTicker is called unconditionally with c.config.CollectionInterval; if the field is absent from the YAML, Go's zero-value time.Duration(0) is passed and the process crashes with "non-positive interval for NewTicker". A guard or default (similar to the InitialCollectionDelayMax pattern already used two lines above) should be added before the NewTicker call.

Confidence Score: 4/5

Safe to merge after fixing the zero CollectionInterval panic; the feature is disabled by default so it only affects users who opt in without setting the interval.

Most concerns from prior review rounds are addressed. One new P1 remains: an unchecked zero CollectionInterval causes a hard panic. Feature defaults to disabled so production impact is low.

internal/eventingester/metrics/redis/collector.go (CollectionInterval zero guard), internal/eventingester/repository/scanner_test.go (Redis DB conflict)

Important Files Changed

Filename	Overview
internal/eventingester/metrics/redis/collector.go	New Prometheus collector with leader-gated snapshot caching; panics on zero CollectionInterval (P1), otherwise the reset-per-cycle and atomic snapshot approach is sound.
internal/eventingester/repository/scanner.go	New Redis SCAN + pipeline scanner; properly handles NOSTREAM/WRONGTYPE/redis.Nil per-key errors and the pipeline-level error guard, age computation is correct.
internal/eventingester/repository/types.go	Adds StreamInfo, RedisClient interface, and parseStreamKey; comment and implementation correctly document that jobSetId (not queue) may contain ':'.
internal/eventingester/ingester.go	Wires Redis metrics collector into an errgroup alongside the existing ingester pipeline; correctly conditionally creates separate Redis client for metrics, leader election setup looks correct.
internal/eventingester/repository/scanner_test.go	Good unit/integration tests for the scanner; uses DB 11 which conflicts with existing event_test.go and eventstore_test.go (P2 flakiness risk).
internal/eventingester/metrics/redis/collector_test.go	Comprehensive unit tests covering top-N, queue aggregation, histogram resets, leadership transitions, and concurrent collection; uses DB 12 (no conflict).
internal/common/constants/constants.go	Promotes local eventStreamPrefix constant "Events:" to a shared package-level constant; refactored correctly across eventstore.go, event_repository.go, and tests.

Sequence Diagram

sequenceDiagram
    participant App as ingester.Run
    participant LC as LeaderController
    participant C as Collector.Run (goroutine)
    participant S as Scanner.ScanAll
    participant R as Redis
    participant P as Prometheus Scraper

    App->>LC: leaderController.Run(ctx)
    App->>C: collector.Run(ctx)
    Note over C: initial jitter delay [0, initialDelayMax)
    loop Every CollectionInterval
        C->>LC: GetToken().Leader()?
        alt is leader
            C->>S: ScanAll(ctx)
            S->>R: SCAN Events:* (batched)
            S->>R: Pipeline: XINFO STREAM + MEMORY USAGE
            R-->>S: stream info + memory bytes
            S-->>C: []StreamInfo
            C->>C: resetMetricsForNewCycle()
            C->>C: update topN/queue gauges + histograms
            C->>C: collectSnapshot() state.Store(metrics)
        else not leader
            C->>C: ClearState()
        end
    end
    P->>C: Collect(ch)
    C->>LC: GetToken().Leader()?
    alt is leader
        C->>P: emit cached []prometheus.Metric from state
    else
        C-->>P: return (no metrics)
    end

Comments Outside Diff (1)

internal/eventingester/metrics/redis/collector.go, line 206 (link)

time.NewTicker panics on zero CollectionInterval

time.NewTicker(d) panics with "non-positive interval for NewTicker" if d <= 0. If collectionInterval is absent or zero in the YAML, the zero-value time.Duration(0) is passed here, causing a hard panic at startup whenever the feature is enabled. There is no validation or default guard for this field.

Add a guard before calling NewTicker:

_{Reviews (29): Last reviewed commit: "format" | Re-trigger Greptile}

JamesMurkin · 2026-03-19T10:37:07Z

+		bytesHistogram: prometheus.NewHistogram(prometheus.HistogramOpts{
+			Name:    "armada_redis_stream_size_bytes_distribution",
+			Help:    "Distribution of Redis stream sizes in bytes",
+			Buckets: prometheus.ExponentialBuckets(1024, 2, 20),


These buckets are too small

Discussed this

Make byte and events 30 buckets, and age 20 buckets

JamesMurkin · 2026-03-19T10:39:00Z

+		),
+		topNEventsGauge: prometheus.NewGaugeVec(
+			prometheus.GaugeOpts{
+				Name: "armada_redis_stream_event_count",


It'd be good to use a constant prefix - as we do everywhere else

Possibly it should be armada_event_redis although I'm not 100%

I can envision we have more redis in future

So either we'd need the name to distinguish the difference or we'd need another label (instance/purpose something like that)

JamesMurkin · 2026-03-19T10:43:18Z

+  pipelineBatchSize: 500
+  interBatchDelay: 100ms
+  memoryUsageSamples: 5
+  # metricsRedis:  # Optional: separate Redis for metrics collection


Does this redis need metrics in its name? It is already under the metrics section

…r defaults - Restructure config types: RedisMemoryMetrics (top-level) → Metrics.Redis (nested) - Update server wiring to read config.Metrics.Redis.* (11 references) - Apply reviewer-requested defaults: collectionInterval 15m, memoryUsageSamples 25 - Update all YAML surfaces: primary, local dev, and Helm values - Preserve dev-specific values in local configs (enabled, fast interval, standalone) Addresses JamesMurkin review comments on PR #4743

nikola-jokic · 2026-04-01T18:58:15Z

+
+	// Delay to guard against crash loops during startup and to prevent thundering herd on leadership changes
+	// The delay is [0, 1 minute) to ensure that in the worst case, all collectors will be staggered by at least 1 minute.]
+	initialDelay := time.Duration(rand.Int64N(int64(1 * time.Minute)))


@JamesMurkin here is the initial delay as random interval up to a minute. I think this might be fine, but we can change that

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

masipauskas

notes:

high level skim through, lgtm.
I don't see how metrics from topN / queue histograms drop out for deleted queues, instead of being reset to 0.

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com> Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>

JamesMurkin reviewed Mar 6, 2026

View reviewed changes

Comment thread internal/server/submit/validation/submit_request.go Outdated

nikola-jokic force-pushed the nikola-jokic/redis-metrics branch 4 times, most recently from 16d6bf3 to 45e5f0e Compare March 17, 2026 14:30

nikola-jokic marked this pull request as ready for review March 17, 2026 14:49

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

Comment thread internal/eventingester/repository/scanner.go

Comment thread internal/server/redismetrics/scanner.go Outdated

Comment thread internal/eventingester/repository/types.go

JamesMurkin reviewed Mar 19, 2026

View reviewed changes

Comment thread config/server/config.yaml Outdated

JamesMurkin reviewed Mar 19, 2026

View reviewed changes

Comment thread config/server/config.yaml Outdated

JamesMurkin reviewed Mar 19, 2026

View reviewed changes

Comment thread internal/server/redismetrics/scanner.go Outdated

JamesMurkin reviewed Mar 19, 2026

View reviewed changes

Comment thread internal/server/redismetrics/scanner.go Outdated

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

Comment thread internal/server/redismetrics/scanner.go Outdated

nikola-jokic force-pushed the nikola-jokic/redis-metrics branch 4 times, most recently from 0a68a51 to 25ac033 Compare March 20, 2026 10:32

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

Comment thread internal/eventingester/repository/scanner.go

Comment thread internal/server/redismetrics/collector.go Outdated

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

Comment thread internal/eventingester/repository/scanner.go

Comment thread internal/eventingester/metrics/redis/collector.go

Comment thread internal/eventingester/metrics/redis/collector.go

Comment thread internal/server/redismetrics/collector.go Outdated

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

Comment thread internal/eventingester/metrics/redis/collector.go

Comment thread internal/eventingester/metrics/redis/collector.go

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

Comment thread internal/eventingester/repository/scanner.go

nikola-jokic force-pushed the nikola-jokic/redis-metrics branch 2 times, most recently from d5ddeb6 to e062235 Compare March 23, 2026 22:53

nikola-jokic force-pushed the nikola-jokic/redis-metrics branch 3 times, most recently from ff9d09b to 593abe5 Compare March 31, 2026 15:44

nikola-jokic force-pushed the nikola-jokic/redis-metrics branch 2 times, most recently from 1b99cc9 to 2c609b1 Compare April 1, 2026 18:47

nikola-jokic commented Apr 1, 2026

View reviewed changes

nikola-jokic added 7 commits April 1, 2026 21:05

Redis metrics package

0043d90

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

fix tests

0baf386

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

refactor

134abfa

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

ok

c2d1eca

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

ok

84cd65b

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

add delay collecting metrics

79932d2

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

refactoring to use constant for redis key prefix

d8d778e

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

nikola-jokic force-pushed the nikola-jokic/redis-metrics branch from 900d47a to d8d778e Compare April 1, 2026 19:05

nikola-jokic added 5 commits April 1, 2026 23:44

fix typo

c828186

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

configuratble initial delay for tests

b7d7524

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

format

4d87be9

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

fmt

83d2a53

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>

Merge branch 'master' into nikola-jokic/redis-metrics

c276d18

JamesMurkin previously approved these changes Apr 2, 2026

View reviewed changes

Merge branch 'master' into nikola-jokic/redis-metrics

7b61822

JamesMurkin dismissed their stale review via 7b61822 April 13, 2026 16:31

JamesMurkin added 7 commits April 13, 2026 17:34

Set default InitialCollectionDelayMax

0542f55

Merge branch 'master' into nikola-jokic/redis-metrics

e16ebbd

Minor code review fixes

b283b64

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Move redis metrics to be subfolder of metrics

30d6b24

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Simplify

9c655be

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

move files

17a4e3d

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

format

48edf0b

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

masipauskas approved these changes Apr 15, 2026

View reviewed changes

JamesMurkin merged commit 9d1164b into master Apr 15, 2026
18 checks passed

JamesMurkin deleted the nikola-jokic/redis-metrics branch April 15, 2026 16:38

Conversation

nikola-jokic commented Mar 6, 2026

Uh oh!

Uh oh!

greptile-apps bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JamesMurkin Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

JamesMurkin Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

JamesMurkin Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JamesMurkin Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nikola-jokic Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

masipauskas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Mar 17, 2026 •

edited

Loading