Skip to content

Comments

Fix DecisionsInView reset to zero during same-height sync#663

Open
ivan-atme wants to merge 1 commit intohyperledger-labs:mainfrom
ivan-atme:fix-orderer-stuck
Open

Fix DecisionsInView reset to zero during same-height sync#663
ivan-atme wants to merge 1 commit intohyperledger-labs:mainfrom
ivan-atme:fix-orderer-stuck

Conversation

@ivan-atme
Copy link

When sync() is called but the synchronizer returns the same block height as the controller ("already at target height"), newDecisionsInView stays at its zero-initialized value because the update at line 641 is guarded by latestDecisionSeq > controllerSequence, which is false when equal.

This zero is passed to changeView(), resetting DecisionsInView from its correct value (e.g. 9578) to 0. The next proposal from the leader carries the correct DecisionsInView, fails validation at view.go:577, and is rejected as "bad proposal", triggering a recovery sync loop that costs ~10 seconds per cycle and can cause Hyperledger Fabric orderers to fall permanently behind.

Fix: decouple the DecisionsInView update from the checkpoint update by moving it to a separate if latestDecisionMetadata != nil block, so decisions are always derived from sync response metadata regardless of whether the sequence advanced.

When sync() is called but the synchronizer returns the same block height
as the controller ("already at target height"), newDecisionsInView stays
at its zero-initialized value because the update at line 641 is guarded
by `latestDecisionSeq > controllerSequence`, which is false when equal.

This zero is passed to changeView(), resetting DecisionsInView from its
correct value (e.g. 9578) to 0. The next proposal from the leader carries
the correct DecisionsInView, fails validation at view.go:577, and is
rejected as "bad proposal", triggering a recovery sync loop that costs
~10 seconds per cycle and can cause orderers to fall permanently behind.

Fix: decouple the DecisionsInView update from the checkpoint update by
moving it to a separate `if latestDecisionMetadata != nil` block, so
decisions are always derived from sync response metadata regardless of
whether the sequence advanced.

Signed-off-by: ivan-atme <ivan.laishevskiy@atme.com>
@HagarMeir
Copy link
Contributor

Thank you for discovering this bug and providing a fix.
Can you please attach a log with the events you described?

@ivan-atme
Copy link
Author

@HagarMeir I am attaching orderers' logs: org0-orderer-1-stuck.zip .

Sorry, logs are quite long (05:59:38-06:02:31). There is a reason for this (see below).

Additional context about the problem:

  • New blocks are stable arriving with approximate speed 15 blocks/min.
  • The problem recurring weekly when one of orderers suddenly gets stuck.
  • Stuck orderer restart returns it to be in sync.
  • The problem in these logs I am going to solve is that org0-orderer-1 got stuck in "otf" channel at height 2365915.

Timeline

Time Event
05:59:39–05:59:50 The bug I fixed currently happened here: DecisionsInView=0, proposal rejected, recovery sync
05:59:50-06:00:01 The bug reproduced again, but with faster recovery sync
Initially I was going to cut logs here
06:00:45-06:01:45 But today I noticed the fact that not one but all orderers temporary got stuck at height 2365915 here
06:01:45–06:02:30 Other orderers started producing new blocks, and only org0-orderer-1 is still stuck on 2365915 height (until the end of logs)

Conclusion

Thus, currently fixed issue was not the only cause which triggered the org0-orderer-1 got stuck. I found few more issues that together make orderer being stuck. I described two more changes I offer in RFC draft: https://gist.github.com/ivan-atme/73c430c9a530985546af6f34139cddb6#file-rfc-smartbft-controller-bugs-upstream-md

Please let me know if there's anything I can do to make the review easier.
Thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants