Skip to content

Prometheus: New component to add metric-processing to Alloy #5960

@monsdar

Description

@monsdar

Component(s)

prometheus.relabel

Request

This request is about aggregating/combining available metrics into new high-level metrics following a set of rules. I have the following concrete usecase in mind:

Use case

I'd like to combine a number of metrics from different sources into a high-level metric to show the "healthiness" of a host. I have the following metrics for example:

  • Gauge: disk_space_used(host=hostX) Range 0.0 (empty) to 1.0 (fully used disk)
  • Gauge: service_online(host=hostX, service=serviceX) (0 when offline, 1 when online)
  • Counter: failed_login_attempts(host=hostX) (counting number of failed attempts)

These should be combined to a new high-level metric:

  • Gauge: host_healthy(host=hostX) Range 0.0 (unhealthy) to 1.0 (healthy)

There are conditions that drive the resulting value:

  • Host is healthy (1.0) when disk_space_used is below 0.9
    • Host is unhealthy (0.0) when disk_space_used is over 0.99
    • Host is "not-so-healthy" (0.7) when disk_space_used is over 0.9, but under 0.99
  • Host is healthy when all services of a host are online
  • Host is healthy when failed login attempts over the time_range are below 3

Suggestion

I think this functionality could be added to Alloy in a similar fashion as the loki.process stages work:

  • prometheus.scrape collects metrics from a number of scrape-targets and forwards the metrics to a new prometheus.process component
  • prometheus.process goes through several stages:
    • add_gauge creates a new metric called host_healthiness with an initial value of 1.0 and a label matching the input metrics hostname value. (Optional: If the metric already exists it changes the value accordingly)
    • set_value has a condition disk_space_left >= 0.9. If this condition applies the value of host_healthiness is multiplied by 0.7 decreasing the "overall health score"
    • set_value can also set absolute values. For example when failed_login_attempts > 3 it sets the host_healthiness value to 0.0 (host unhealthy)

In the end the collected metrics are either dropped or send forward to Prometheus along with the new host_healthiness metric. In Grafana all you have to do is to add a widget that shows what host_healthiness is set to. Perhaps in a Timeline Widget that allows to quickly see what hosts are available and in what state they have been over the past time.

I'm sure this is just the tip of the iceberg for a feature that could extend the usefulness of using Alloy to scrape Prometheus targets. I'm also aware that this could get complex quite quickly and needs some proper analysis to get the design and usage just right. I'd consider this suggestion as an initial starting point and would like to update the issue as discussion goes on.

Please let me know whether there are any alternatives I haven't considered yet. This request has been created as a follow-up to my Stackoverflow Question about implementing such a thing.

Tip

React with 👍 if this issue is important to you.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions