-
Notifications
You must be signed in to change notification settings - Fork 388
Description
Right now riak_core_node_watcher performs health checks by just sending messages between processes on other nodes and failing if those time out. Recently we had a really hard-to-debug network issue where connections established from one node to another where extremly slow (huge packet loss, leading to <1mbit/s throughput) while the other direction worked perfectly.
ceph has a more meaningful health check where it measures throughput and latency to another node and only if those pass a sane lower bound is the node marked as up, otherwise it won't be used. Having such a feature in riak would have helped us immensely in tracking down what was going on and preventing timeouts to our riak clients, as it would with any other weird network issue.
Furthermore ceph also exposes its automatic up/down states while riak doesn't. It would be nice if riak-admin member-status had one more column where it reports the automatic up/down state as seen from this node (this state isn't cluster wide I think?)