-
Notifications
You must be signed in to change notification settings - Fork 76
Description
The riak_test ensemble_ring_changes attempts to test ensemble when subject to various cluster admin changes. Unfortunately the original test didn't wait for some of the cluster operations to complete before making its assertions, and so the test would pass without the scenario actually being proven.
There is now an updated version of this test, which fails most of the time - https://github.com/basho/riak_test/pulls.
The failure occurs after the force-replace operation. After this some ensembles are stuck in the probe state. Generally the situation is as follows:
-
There are peers probing, but those peers think that there is a quorum of nodes on a now unreachable node (the one which has left under the
force-leave). -
The probe state constantly returns with timeout errors (unavailable nodes lead to immediate nacks - that re then interpreted as a quorum for timeout), the peer then does a probe delay and re-probe.
-
The
riak_ensemble_managerhas a more correct view (i.e. one that doesn't show ensemble peers on the unavailable node) and this is checked after the probe has failed. -
The
riak_ensemble_managerhas a lower{epoch, sqn}than that of the peer view - so the peers persist in using its incorrect view and the cycle of probe failures continue. -
Despite there being no nodes down, multiple ensembles are un usable.
This is true on both the current develop-3.0 branch as well as the develop-3.0-lastgaspring branch which corrects the issue in #128.