Incarnations sometimes lead to nodes not seeing other nodes during mass restarts

Hi.

I'm testing memberlist in a ~4500 node cluster and it seems I found a nontrivial issue: some nodes seem to not reset their "incarnation" numbers to correct (high enough) values during mass node restarts - and it leads to some nodes thinking that some other nodes are dead for a long time, while actually they aren't.

As I understand, the scenario is the following:
- You begin to restart the whole cluster
- During restart, it happens so that node 1 marks node 2 dead at large incarnation (say 100) and sends it to gossip (deadNode 2 incarnation=100)
- Node 2 restarts and syncs with (initial) 1, but it's already restarted too and it doesn't see node 2 at all, so node 2 leaves its incarnation at 1
- Node 3 restarts and it happens so that it first receives new metadata of restarted node 2 with incarnation 1, and AFTER that it receives the deadNode gossip message with node 2 and incarnation 100
- Node 3 tries to re-gossip this deadNode 2/100 message, but again - it happens so that it resends it to nodes 4, 5, and 6 which either also restart just at that moment and so don't resend the gossip message further, OR they were just restarted and don't know about node 2 at all (and thus ignore deadNode 2 message)
- Anyway, the idea is that the deadNode 2/100 message doesn't get to the newly restarted node 2 and thus node 2 doesn't refute it and doesn't increase its incarnation number to 101
- We end up with node 3 thinking that node 2 is dead, while it is not
- Memberlist doesn't recheck dead nodes in any way, so it's left in this state for indefinite time, at least until the full TCP sync with some other node fixes it - but it's an ugly solution and full TCP sync also has a problem #312

I can't guarantee that it's the exact case which I observed (due to the lack of logging in this library) but I'm almost sure it's at least very close.

The general idea is that the "incarnation" mechanism is not a guarantee during mass restarts.

My ideas about how it could be fixed:

1) Add retries/rechecks of dead nodes, do not rely on just 1 "deadNode" message sent to the gossip and then refuted.
2) Use system time instead of the incarnation number.
3) Mark node as alive when it pings you.
4) Request an update from the given node if you receive an "aliveNode" message from it with a smaller incarnation number than you already know.
5) Re-broadcast deadNode messages even if we don't know about the given node at all.
6) At least add logs for all "deadNode" / "suspectNode" messages and for all .
7) A combination of all previous variants. :-)

Generally, it seems that there's plenty of room for improvements in this library, but of course it requires active maintenance. Do you still maintain the library?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incarnations sometimes lead to nodes not seeing other nodes during mass restarts #311

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incarnations sometimes lead to nodes not seeing other nodes during mass restarts #311

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions