Make health checks perform more meaningful checks

Right now `riak_core_node_watcher` performs health checks by just sending messages between processes on other nodes and failing if those time out. Recently we had a really [hard-to-debug network issue](https://bugzilla.kernel.org/show_bug.cgi?id=73891) where connections established from one node to another where extremly slow (huge packet loss, leading to <1mbit/s throughput) while the other direction worked perfectly.

[ceph](http://ceph.com/) has a more meaningful health check where it measures throughput and latency to another node and only if those pass a sane lower bound is the node marked as up, otherwise it won't be used. Having such a feature in riak would have helped us immensely in tracking down what was going on and preventing timeouts to our riak clients, as it would with any other weird network issue.

Furthermore ceph also exposes its automatic up/down states while riak doesn't. It would be nice if `riak-admin member-status` had one more column where it reports the automatic up/down state as seen from this node (this state isn't cluster wide I think?)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make health checks perform more meaningful checks #583

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make health checks perform more meaningful checks #583

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions