Skip to content

Make health checks perform more meaningful checks #583

@MSch

Description

@MSch

Right now riak_core_node_watcher performs health checks by just sending messages between processes on other nodes and failing if those time out. Recently we had a really hard-to-debug network issue where connections established from one node to another where extremly slow (huge packet loss, leading to <1mbit/s throughput) while the other direction worked perfectly.

ceph has a more meaningful health check where it measures throughput and latency to another node and only if those pass a sane lower bound is the node marked as up, otherwise it won't be used. Having such a feature in riak would have helped us immensely in tracking down what was going on and preventing timeouts to our riak clients, as it would with any other weird network issue.

Furthermore ceph also exposes its automatic up/down states while riak doesn't. It would be nice if riak-admin member-status had one more column where it reports the automatic up/down state as seen from this node (this state isn't cluster wide I think?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions