Skip to content
This repository was archived by the owner on Mar 1, 2022. It is now read-only.

Safety script for RPC nodes#37

Open
softalchemy wants to merge 10 commits intomasterfrom
joeaba-safety-script
Open

Safety script for RPC nodes#37
softalchemy wants to merge 10 commits intomasterfrom
joeaba-safety-script

Conversation

@softalchemy
Copy link
Copy Markdown

No description provided.

safety.sh Outdated
elif [ $DIFFSLOT -gt $FSNAPSHOT ]; then
cd ~
./stop
./fetch-snapshot.sh bv1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your thinking behind the fetch snapshot case? The problem with doing this is that it will create holes in the local rocksdb ledger data. When fetching data over RPC, we only fallback to BigTable for slots that are older than the first local rocskdb slot. So holes in the local ledger will not fallback to BigTable.

I'm thinking the rm -rf ledger/ case is the only one we need

softalchemy and others added 2 commits March 2, 2021 10:43
Co-authored-by: Michael Vines <mvines@gmail.com>
Copy link
Copy Markdown
Author

@softalchemy softalchemy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about now @mvines?
Or do you think 1,500 is too aggressive?

Copy link
Copy Markdown
Contributor

@mvines mvines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1500 seems like a good place to start at least

safety.sh Outdated
logfile=`date +"%d-%b-%Y"`_log.log
echo "================= $currenttime" >> $logfile
RMLEDGER=1500
CLUSTERSLOT=$(solana slot -u http://10.142.0.4:8899)
Copy link
Copy Markdown
Contributor

@mvines mvines Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please turn 10.142.0.4 into a constant, so it's more clear what host this is

@@ -0,0 +1,18 @@
#/bin/bash
currenttime=`date +"%d-%b-%Y %H:%M:%S"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of adding set -e to the top of these kinds of scripts so they bail on failures (usually atleast, there's lots of caveats with set -e) instead of stumble along and start causing splash damage

safety.sh Outdated
./stop
rm -rf ledger/
./restart
echo "Node was:" $DIFFSLOT " slots behind, Services has been stopped, ledger deleted and service restarted" >> $logfile
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we emit a metric here so we can get flagged in pager-duty and or on http://bit.ly/solana-cluster-sanity?

Emiting a metric is easy, start by sourcing this script:

source "$here"/configure-metrics.sh

softalchemy and others added 6 commits March 9, 2021 21:18
Co-authored-by: Michael Vines <mvines@gmail.com>
Co-authored-by: Michael Vines <mvines@gmail.com>
Co-authored-by: Michael Vines <mvines@gmail.com>
Co-authored-by: Michael Vines <mvines@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants