From c4119662a2964e71ee3e4b18f82d608883ba0865 Mon Sep 17 00:00:00 2001 From: Frank van Lankvelt Date: Fri, 9 May 2025 14:19:36 +0200 Subject: [PATCH] Describe howto create custom Dynamic Threshold monitors --- SUMMARY.md | 1 + .../k8s-dynamic-threshold-monitors.md | 50 +++++++++++++++++++ 2 files changed, 51 insertions(+) create mode 100644 use/alerting/k8s-dynamic-threshold-monitors.md diff --git a/SUMMARY.md b/SUMMARY.md index b225df7d2..f10e1c95b 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -34,6 +34,7 @@ * [Customize](dynamic/customize-alerting.md) * [Add a monitor using the CLI](use/alerting/k8s-add-monitors-cli.md) * [Derived State monitor](use/alerting/k8s-derived-state-monitors.md) + * [Dynamic Threshold monitor](use/alerting/k8s-dynamic-threshold-monitors.md) * [Override monitor arguments](use/alerting/k8s-override-monitor-arguments.md) * [Write a remediation guide](use/alerting/k8s-write-remediation-guide.md) diff --git a/use/alerting/k8s-dynamic-threshold-monitors.md b/use/alerting/k8s-dynamic-threshold-monitors.md new file mode 100644 index 000000000..d331bc47a --- /dev/null +++ b/use/alerting/k8s-dynamic-threshold-monitors.md @@ -0,0 +1,50 @@ +--- +description: SUSE Observability +--- + +# Dynamic Threshold Monitors + +## Overview + +For metrics that vary significantly over time and differ from service to service, a Dynamic Threshold monitor provides simple and performant anomaly detection. It uses data from 1, 2 or 3 weeks ago in addition to the recent past as context to compare current data to. + +Data from the "check window" is compared to that provided by the historic context using the Anderson-Darling test. This imposes very little assumptions on the data distribution. The test is particularly sensitive to outliers on the upper and lower ends of the distribution. The metric can be smooth, spiky or have a couple of "levels" - as data values are compared directly, without any model fitting, the Dynamic Threshod monitor is very robust. + +For metrics that vary smoothly over time (e.g. on a timescale of 5 minutes), the effective number of data points is smaller than the raw number. The DT compensates for this so the same monitor can be used for a wide range of metrics without the need for adjusting its parameters. + +There are a couple of parameters that can be set for the monitor function: +* `falsePositiveRate`: say `!!float 1e-8` - the sensitivity of the monitor to deviating behavior. A lower value suppresses more (false) positives but may also lead to false negatives (unnoticed anomalies). +* `checkWindowMinutes`: say `10` minutes - the check window needs to be balanced between quick alerting (small values) and correctly identified anomalies (high values). A handful of data points works well in practice. +* `historicWindowMinutes`: say `120` (2 hours) - bracketed around the current time, but then one or more weeks ago - so from 1 hour before the current time to 1 hour after. Also the 2 hours before the check window are used. The dynamic threshold monitor compares the distribution of this historic data with the data points in the check window. +* `historySizeWeeks`: say `2` - the number of weeks that data is taken from for historic context. Can be `1`, `2` or `3`. +* `removeTrend`: for metrics that have trend behavior (say, number of requests), such that the absolute value differs from week to week, this trend (the average value) can be accounted for. +* `includePreviousDay`: typically `false` - for metrics that do not have a weekly but only a daily pattern, this allows the use of more recent data + +## Dynamic Threshold Monitor example + +A Monitor implemented using the Dynamic Threshold function looks like: + +``` + - _type: "Monitor" + name: "" + identifier: "urn:custom:monitor:" + status: "DISABLED" + description: "" + function: {{ get "urn:stackpack:aad-v2:shared:monitor-function:dt" }} + arguments: + telemetryQuery: + query: "" + unit: s + aliasTemplate: "" + topologyQuery: + falsePositiveRate: + checkWindowMinutes: + historicWindowMinutes: + historySizeWeeks: + includePreviousDay: + removeTrend: + intervalSeconds: 60 + remediationHint: "" +``` + +The monitor can be implemented using the guide at [Add a threshold monitor to components using the CLI](/use/alerting/k8s-add-monitors-cli.md)