-
Notifications
You must be signed in to change notification settings - Fork 468
Description
Hi.
I'm testing memberlist in a ~4500 node cluster and it seems I found a nontrivial issue: some nodes seem to not reset their "incarnation" numbers to correct (high enough) values during mass node restarts - and it leads to some nodes thinking that some other nodes are dead for a long time, while actually they aren't.
As I understand, the scenario is the following:
- You begin to restart the whole cluster
- During restart, it happens so that node 1 marks node 2 dead at large incarnation (say 100) and sends it to gossip (deadNode 2 incarnation=100)
- Node 2 restarts and syncs with (initial) 1, but it's already restarted too and it doesn't see node 2 at all, so node 2 leaves its incarnation at 1
- Node 3 restarts and it happens so that it first receives new metadata of restarted node 2 with incarnation 1, and AFTER that it receives the deadNode gossip message with node 2 and incarnation 100
- Node 3 tries to re-gossip this deadNode 2/100 message, but again - it happens so that it resends it to nodes 4, 5, and 6 which either also restart just at that moment and so don't resend the gossip message further, OR they were just restarted and don't know about node 2 at all (and thus ignore deadNode 2 message)
- Anyway, the idea is that the deadNode 2/100 message doesn't get to the newly restarted node 2 and thus node 2 doesn't refute it and doesn't increase its incarnation number to 101
- We end up with node 3 thinking that node 2 is dead, while it is not
- Memberlist doesn't recheck dead nodes in any way, so it's left in this state for indefinite time, at least until the full TCP sync with some other node fixes it - but it's an ugly solution and full TCP sync also has a problem Full TCP sync in a large-scale cluster may "revive" a gone node for a long time #312
I can't guarantee that it's the exact case which I observed (due to the lack of logging in this library) but I'm almost sure it's at least very close.
The general idea is that the "incarnation" mechanism is not a guarantee during mass restarts.
My ideas about how it could be fixed:
- Add retries/rechecks of dead nodes, do not rely on just 1 "deadNode" message sent to the gossip and then refuted.
- Use system time instead of the incarnation number.
- Mark node as alive when it pings you.
- Request an update from the given node if you receive an "aliveNode" message from it with a smaller incarnation number than you already know.
- Re-broadcast deadNode messages even if we don't know about the given node at all.
- At least add logs for all "deadNode" / "suspectNode" messages and for all .
- A combination of all previous variants. :-)
Generally, it seems that there's plenty of room for improvements in this library, but of course it requires active maintenance. Do you still maintain the library?