Skip to content

flapping detection and alert surpression #24

@woodsaj

Description

@woodsaj

Issue by Dieterbe
Wednesday Sep 09, 2015 at 13:05 GMT
Originally opened as raintank/grafana#458


  1. people can shoot themselves in the foot:
    any alerting setting that is just around the "sweet spot" (henceforth referred to as "sour spot") of what your data dances around, can cause a lot of alert notifications. basically constantly flipping between critical and ok because at each point in time the data changed enough to be considered as critical or ok. ("flapping" as per nagios)
    our current default settings put a lot of people right in the sour spot, but even with adjusted defaults, the problem is there.

  2. people's typical data might be outside of the sour spot, but in case they're having a service degradation, the amount of additional failures might be just the right amount that their data goes into the sour spot, and they still become flap victims. (for example they configured alert if 6 out 20 collectors return errors, and they normally always have <2 errors at a same time, but a service degradation brings them to 5~7 erroring collectors)

  3. frankly, in case our collectors suffer some subtle issues that we can't easily detect or timely remediate, they might also contribute to the user's data going into the sour spot.

in all these cases, we can't just keep sending emails to people.

I think we should do something similar, but also send them an email when we start surpressing notifications with an explanation of what is happening, what we did, and that they might want to change their settings

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions