Skip to content

Stuck in probe state after force leave #129

@martinsumner

Description

@martinsumner

The riak_test ensemble_ring_changes attempts to test ensemble when subject to various cluster admin changes. Unfortunately the original test didn't wait for some of the cluster operations to complete before making its assertions, and so the test would pass without the scenario actually being proven.

There is now an updated version of this test, which fails most of the time - https://github.com/basho/riak_test/pulls.

The failure occurs after the force-replace operation. After this some ensembles are stuck in the probe state. Generally the situation is as follows:

  1. There are peers probing, but those peers think that there is a quorum of nodes on a now unreachable node (the one which has left under the force-leave).

  2. The probe state constantly returns with timeout errors (unavailable nodes lead to immediate nacks - that re then interpreted as a quorum for timeout), the peer then does a probe delay and re-probe.

  3. The riak_ensemble_manager has a more correct view (i.e. one that doesn't show ensemble peers on the unavailable node) and this is checked after the probe has failed.

  4. The riak_ensemble_manager has a lower {epoch, sqn} than that of the peer view - so the peers persist in using its incorrect view and the cycle of probe failures continue.

  5. Despite there being no nodes down, multiple ensembles are un usable.

This is true on both the current develop-3.0 branch as well as the develop-3.0-lastgaspring branch which corrects the issue in #128.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions