-
Notifications
You must be signed in to change notification settings - Fork 448
fix: Cap WorkerLock retry intervals to 15 minutes
#19394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…es, continue logging at durations greater than 10 minutes
|
|
changelog.d/19394.bugfix
Outdated
| @@ -0,0 +1 @@ | |||
| Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #19390 (comment) (another Famedly PR),
I am submitting this PR as an employee of Famedly, who has signed the corporate CLA, and used my company email in the commit.
I assume the same applies here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is correct. Will we have to state such each time we upstream changes?
| self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2) | ||
| if self._retry_interval > Duration(minutes=10).as_secs(): # >12 iterations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have these as constants WORKER_LOCK_MAX_RETRY_INTERVAL and WORKER_LOCK_WARN_RETRY_INTERVAL (perhaps better name) so we can share better describe these values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had actually considered such before also considering that a more flexible approach for different locks may be worth exploring. For example: when a lock is taken because an event is being persisted, the retry interval could be capped to a much smaller value, and the same for the logging of excessive times. Whereas, instead, a lock for purging a room might start with a longer retry interval but keep the cap the same.
Perhaps as defaults, if that exploration bears fruit. I'll shall add that to my notes for more work in this area, but I would rather do such separately
Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>
|
After the issue occured again in our prod: My hypothesis would be: the issue is not primarily about the dimensions of the growing timeout but, about the timeout being ignored at all? At least the logged timeout is not reflected in the timestamp deltas of the log lines?! |
Yes there is more than one thing going on here. This fix(switch |
Fixes the symptoms of #19315 but not the underlying reason causing the number to grow so large in the first place.
Copied from the original pull request on Famedly's Synapse repo (with some edits):
Basing the time interval around a 5 seconds leaves a big window of waiting especially as this window is doubled each retry, when another worker could be making progress but can not.
Right now, the retry interval in seconds looks like
[0.2, 5, 10, 20, 40, 80, 160, 320, (continues to double)]after which logging should start about excessive times and (relatively quickly) end up with an extremely large retry interval with an unrealistic expectation past the heat death of the universe. 1 year in seconds = 31,536,000.With this change, retry intervals in seconds should look more like:
Further suggested work in this area could be to define the cap, the retry interval starting point and the multiplier depending on how frequently this lock should be checked. See data below for reasons why. Increasing the jitter range may also be a good idea
Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.