-
Notifications
You must be signed in to change notification settings - Fork 27
fix: re-announcement causing node stuck during block execution #1638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fixes a critical race condition where re-announcing unconfirmed transactions during block execution causes cascading nonce validation failures, leading to nodes falling out of sync with the network. The issue occurred when: 1. Block execution takes >1 second (e.g., 3.9s under heavy load) 2. Re-announcement timer triggers during execution window 3. Transactions validated against uncommitted DB state (stale nonces) 4. All re-announced txs fail validation with "invalid nonce" errors 5. Cascade of failures saturates node, preventing new block processing 6. Node falls behind, requires catchup mechanism to recover Solution: - Add CanReannounce() method to ConsensusEngine interface - Returns false when status is Proposed or Executed - Re-announcement logic checks CanReannounce() before proceeding - Skips re-announcement during block execution with debug log Impact: - Eliminates invalid nonce cascades during block processing - Prevents nodes from falling out of sync under heavy load - Minimal performance impact (single mutex read per check) - Re-announcement delayed by at most block execution time (~4s max) Testing: - Added comprehensive unit tests for CanReannounce() - Verified thread safety with concurrent access test - All existing consensus and node tests pass Files changed: - node/interfaces.go: Add CanReannounce() to interface - node/consensus/engine.go: Implement CanReannounce() - node/nogossip.go: Add re-announcement check - node/node_test.go: Update test dummy - node/consensus/can_reannounce_test.go: New unit tests resolves: trufnetwork/truf-network#1305
WalkthroughAdds a CanReannounce() bool method to the ConsensusEngine interface and its implementation (read-locked, true only when state is Committed), integrates the check into transaction re-announcement logic, and adds unit tests covering state behavior and concurrent access. Changes
Sequence Diagram(s)sequenceDiagram
participant Node as Node/Mempool
participant Nog as startTxAnns
participant CE as ConsensusEngine
participant SI as stateInfo
Node->>Nog: trigger re-announcement
Nog->>CE: CanReannounce()
CE->>SI: RLock state
alt state == Committed
SI-->>CE: true
CE-->>Nog: true
Nog->>Node: broadcast transactions
else state == Proposed/Executed
SI-->>CE: false
CE-->>Nog: false
Nog-->>Nog: debug log, return early
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Time Submission Status
|
Bug Report Checklist
@MicBun, please use git blame and specify the link to the commit link that has introduced this bug. Send the following message in this PR: |
|
@pr-time-tracker bug commit not caused by previous contributor |
|
Log: not needed for now as the issue have been resolved without additional touch. Might be needed if the same issue appears in the future. Will keep the PR open for now |
Fixes a critical race condition where re-announcing unconfirmed transactions during block execution causes cascading nonce validation failures, leading to nodes falling out of sync with the network.
The issue occurred when:
Solution:
Impact:
Testing:
Files changed:
resolves: https://github.com/trufnetwork/truf-network/issues/1305
This can be reviewed later as not urgent to fix for now
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.