-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Issue by Dieterbe
Wednesday Sep 09, 2015 at 13:05 GMT
Originally opened as raintank/grafana#458
-
people can shoot themselves in the foot:
any alerting setting that is just around the "sweet spot" (henceforth referred to as "sour spot") of what your data dances around, can cause a lot of alert notifications. basically constantly flipping between critical and ok because at each point in time the data changed enough to be considered as critical or ok. ("flapping" as per nagios)
our current default settings put a lot of people right in the sour spot, but even with adjusted defaults, the problem is there. -
people's typical data might be outside of the sour spot, but in case they're having a service degradation, the amount of additional failures might be just the right amount that their data goes into the sour spot, and they still become flap victims. (for example they configured alert if 6 out 20 collectors return errors, and they normally always have <2 errors at a same time, but a service degradation brings them to 5~7 erroring collectors)
-
frankly, in case our collectors suffer some subtle issues that we can't easily detect or timely remediate, they might also contribute to the user's data going into the sour spot.
in all these cases, we can't just keep sending emails to people.
- nagios famously has the "flap detection": it surpresses alerts if it considers the service to have been flapping during a window of the last 21 points (see https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/flapping.html), for our case, this window is actually pretty short.
- other systems I believe have a hard limit on max emails per hour/day.
I think we should do something similar, but also send them an email when we start surpressing notifications with an explanation of what is happening, what we did, and that they might want to change their settings