Skip to content

Add redis metrics giving better view into redis #4743

Merged
JamesMurkin merged 20 commits intomasterfrom
nikola-jokic/redis-metrics
Apr 15, 2026
Merged

Add redis metrics giving better view into redis #4743
JamesMurkin merged 20 commits intomasterfrom
nikola-jokic/redis-metrics

Conversation

@nikola-jokic
Copy link
Copy Markdown
Contributor

No description provided.

Comment thread internal/server/submit/validation/submit_request.go Outdated
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/redis-metrics branch 4 times, most recently from 16d6bf3 to 45e5f0e Compare March 17, 2026 14:30
@nikola-jokic nikola-jokic marked this pull request as ready for review March 17, 2026 14:49
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 17, 2026

Greptile Summary

This PR adds a new Redis stream metrics subsystem to the event ingester: a Scanner that pipelines XINFO STREAM + MEMORY USAGE calls across all Events:* keys, and a Prometheus Collector that periodically runs the scanner, caches an atomic snapshot, and gates metric emission behind Kubernetes or standalone leader election.

  • P1 — CollectionInterval: 0 panics on startup: time.NewTicker is called unconditionally with c.config.CollectionInterval; if the field is absent from the YAML, Go's zero-value time.Duration(0) is passed and the process crashes with "non-positive interval for NewTicker". A guard or default (similar to the InitialCollectionDelayMax pattern already used two lines above) should be added before the NewTicker call.

Confidence Score: 4/5

Safe to merge after fixing the zero CollectionInterval panic; the feature is disabled by default so it only affects users who opt in without setting the interval.

Most concerns from prior review rounds are addressed. One new P1 remains: an unchecked zero CollectionInterval causes a hard panic. Feature defaults to disabled so production impact is low.

internal/eventingester/metrics/redis/collector.go (CollectionInterval zero guard), internal/eventingester/repository/scanner_test.go (Redis DB conflict)

Important Files Changed

Filename Overview
internal/eventingester/metrics/redis/collector.go New Prometheus collector with leader-gated snapshot caching; panics on zero CollectionInterval (P1), otherwise the reset-per-cycle and atomic snapshot approach is sound.
internal/eventingester/repository/scanner.go New Redis SCAN + pipeline scanner; properly handles NOSTREAM/WRONGTYPE/redis.Nil per-key errors and the pipeline-level error guard, age computation is correct.
internal/eventingester/repository/types.go Adds StreamInfo, RedisClient interface, and parseStreamKey; comment and implementation correctly document that jobSetId (not queue) may contain ':'.
internal/eventingester/ingester.go Wires Redis metrics collector into an errgroup alongside the existing ingester pipeline; correctly conditionally creates separate Redis client for metrics, leader election setup looks correct.
internal/eventingester/repository/scanner_test.go Good unit/integration tests for the scanner; uses DB 11 which conflicts with existing event_test.go and eventstore_test.go (P2 flakiness risk).
internal/eventingester/metrics/redis/collector_test.go Comprehensive unit tests covering top-N, queue aggregation, histogram resets, leadership transitions, and concurrent collection; uses DB 12 (no conflict).
internal/common/constants/constants.go Promotes local eventStreamPrefix constant "Events:" to a shared package-level constant; refactored correctly across eventstore.go, event_repository.go, and tests.

Sequence Diagram

sequenceDiagram
    participant App as ingester.Run
    participant LC as LeaderController
    participant C as Collector.Run (goroutine)
    participant S as Scanner.ScanAll
    participant R as Redis
    participant P as Prometheus Scraper

    App->>LC: leaderController.Run(ctx)
    App->>C: collector.Run(ctx)
    Note over C: initial jitter delay [0, initialDelayMax)
    loop Every CollectionInterval
        C->>LC: GetToken().Leader()?
        alt is leader
            C->>S: ScanAll(ctx)
            S->>R: SCAN Events:* (batched)
            S->>R: Pipeline: XINFO STREAM + MEMORY USAGE
            R-->>S: stream info + memory bytes
            S-->>C: []StreamInfo
            C->>C: resetMetricsForNewCycle()
            C->>C: update topN/queue gauges + histograms
            C->>C: collectSnapshot() state.Store(metrics)
        else not leader
            C->>C: ClearState()
        end
    end
    P->>C: Collect(ch)
    C->>LC: GetToken().Leader()?
    alt is leader
        C->>P: emit cached []prometheus.Metric from state
    else
        C-->>P: return (no metrics)
    end
Loading

Comments Outside Diff (1)

  1. internal/eventingester/metrics/redis/collector.go, line 206 (link)

    time.NewTicker panics on zero CollectionInterval

    time.NewTicker(d) panics with "non-positive interval for NewTicker" if d <= 0. If collectionInterval is absent or zero in the YAML, the zero-value time.Duration(0) is passed here, causing a hard panic at startup whenever the feature is enabled. There is no validation or default guard for this field.

    Add a guard before calling NewTicker:

Reviews (29): Last reviewed commit: "format" | Re-trigger Greptile

Comment thread internal/eventingester/repository/scanner.go
Comment thread internal/server/redismetrics/scanner.go Outdated
Comment thread internal/eventingester/repository/types.go
bytesHistogram: prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "armada_redis_stream_size_bytes_distribution",
Help: "Distribution of Redis stream sizes in bytes",
Buckets: prometheus.ExponentialBuckets(1024, 2, 20),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These buckets are too small

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this

Make byte and events 30 buckets, and age 20 buckets

),
topNEventsGauge: prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "armada_redis_stream_event_count",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to use a constant prefix - as we do everywhere else

Possibly it should be armada_event_redis although I'm not 100%

  • I can envision we have more redis in future
  • So either we'd need the name to distinguish the difference or we'd need another label (instance/purpose something like that)

Comment thread config/server/config.yaml Outdated
Comment thread config/server/config.yaml Outdated
pipelineBatchSize: 500
interBatchDelay: 100ms
memoryUsageSamples: 5
# metricsRedis: # Optional: separate Redis for metrics collection
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this redis need metrics in its name? It is already under the metrics section

Comment thread config/server/config.yaml Outdated
Comment thread internal/server/redismetrics/scanner.go Outdated
Comment thread internal/server/redismetrics/scanner.go Outdated
nikola-jokic added a commit that referenced this pull request Mar 19, 2026
…r defaults

- Restructure config types: RedisMemoryMetrics (top-level) → Metrics.Redis (nested)
- Update server wiring to read config.Metrics.Redis.* (11 references)
- Apply reviewer-requested defaults: collectionInterval 15m, memoryUsageSamples 25
- Update all YAML surfaces: primary, local dev, and Helm values
- Preserve dev-specific values in local configs (enabled, fast interval, standalone)

Addresses JamesMurkin review comments on PR #4743
Comment thread internal/server/redismetrics/scanner.go Outdated
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/redis-metrics branch 4 times, most recently from 0a68a51 to 25ac033 Compare March 20, 2026 10:32
Comment thread internal/eventingester/repository/scanner.go
Comment thread internal/server/redismetrics/collector.go Outdated
Comment thread internal/eventingester/repository/scanner.go
Comment thread internal/eventingester/metrics/redis/collector.go
Comment thread internal/eventingester/metrics/redis/collector.go
Comment thread internal/server/redismetrics/collector.go Outdated
Comment thread internal/eventingester/metrics/redis/collector.go
Comment thread internal/eventingester/metrics/redis/collector.go
Comment thread internal/eventingester/repository/scanner.go
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/redis-metrics branch 2 times, most recently from d5ddeb6 to e062235 Compare March 23, 2026 22:53
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/redis-metrics branch 3 times, most recently from ff9d09b to 593abe5 Compare March 31, 2026 15:44
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/redis-metrics branch 2 times, most recently from 1b99cc9 to 2c609b1 Compare April 1, 2026 18:47

// Delay to guard against crash loops during startup and to prevent thundering herd on leadership changes
// The delay is [0, 1 minute) to ensure that in the worst case, all collectors will be staggered by at least 1 minute.]
initialDelay := time.Duration(rand.Int64N(int64(1 * time.Minute)))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesMurkin here is the initial delay as random interval up to a minute. I think this might be fine, but we can change that

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/redis-metrics branch from 900d47a to d8d778e Compare April 1, 2026 19:05
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
JamesMurkin
JamesMurkin previously approved these changes Apr 2, 2026
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Copy link
Copy Markdown
Contributor

@masipauskas masipauskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notes:

  • high level skim through, lgtm.
  • I don't see how metrics from topN / queue histograms drop out for deleted queues, instead of being reset to 0.

@JamesMurkin JamesMurkin merged commit 9d1164b into master Apr 15, 2026
18 checks passed
@JamesMurkin JamesMurkin deleted the nikola-jokic/redis-metrics branch April 15, 2026 16:38
dslear pushed a commit that referenced this pull request Apr 16, 2026
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: David Slear <david_slear@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants