-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add configurable max time between successful resyncs #11577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Logs an error if the time between resyncs surpasses the maximum. Signed-off-by: Leo Jenkins <ljenkins50@bloomberg.net>
|
Leo Jenkins seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks! I've made some specific comments, but I have an overall point as well: my medium term plan is to eliminate periodic resync completely, because we don't actually have any need for it. So I'd just like to check that you won't mind if all of this is ripped out again if/when that happens?
To be clear, we will still have a start of day resync, each time that the Neutron server starts or restarts, but we won't have further resyncs periodically after the initial one.
| f" {cfg.CONF.calico.resync_max_interval_secs} seconds" | ||
| ) | ||
|
|
||
| eventlet.sleep(cfg.CONF.calico.resync_max_interval_secs / 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more precise to sleep until self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs, which means to sleep for a period of self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs - now().
Then if self.last_resync_time is still unchanged, we will log the ERROR at the earliest appropriate time. OTOH, if cfg.CONF.calico.resync_max_interval_secs / 5 is less than the period above, then the next check will inevitably not see a problem and will just sleep again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the current polling approach is a bit inefficient but I think it's the best solution available and here are a few reasons why.
I think the calculation self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs - now() would lead to negative sleep times if the max interval is exceeded.
A similar approach that we considered would be to sleep for cfg.CONF.calico.resync_max_interval_secs every time. However, this approach leads to an extremely high difference between the desired and enforced "max interval" in the worst case. For example, if cfg.CONF.calico.resync_max_interval_secs is 5 hours, then it is possible we would not actually log an error until the duration reaches ~10 hours. This would happen if the monitor thread checks to see if the interval has exceeded the threshold right before it does so. The monitor thread would then sleep another 5 hours before checking again and observing the issue. Or, put simply, our worst case error bound on the interval is equal to the time the monitor thread sleeps.
The current solutions balances the worst case error and the inefficiency of polling by sleeping for cfg.CONF.calico.resync_max_interval_secs / 5 but that was a fairly arbitrary choice.
As you mentioned, this will cause the error to be logged multiple times if the resync still has not executed. I think this a feature as it allows us to distinguish between one-off and persistent failures. It also allows us to setup alarms that increase in severity if the issue is unresolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this interacts with the point about repeating the error if the problem persists. With that in mind, what about the best of both worlds:
self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs - now()if this is >= 0- else
cfg.CONF.calico.resync_max_interval_secs / 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that -- it's the best of both worlds. I implemented this logic and added two tests for it.
| self.sleep_patcher.stop() | ||
| super(TestResyncMonitorThread, self).tearDown() | ||
|
|
||
| def increment_epoch(self, curr_epoch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit unintuitive. Could it be named set_epoch, remove + 1 below, and add + 1 to each of the callers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The key problem is that it doesn't get and increment the current self._epoch.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to simulate_epoch_progression to better describe how it's being used in the tests. Please let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this function takes a curr_epoch or init_epoch arg? Why would you not use getattr to get the current _epoch value, then increment, then setattr to set that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah that's not necessary. I removed the curr_epoch argument. I had to replace it with expected_sleep_time in order to test the new sleep time logic, however, I think it is much simpler/easy to understand now.
| def test_monitor_does_nothing_when_not_master(self): | ||
| """Test that a driver that is not master does not monitor.""" | ||
| self.driver.elector.master.return_value = False | ||
| self.mock_sleep.side_effect = self.increment_epoch(INITIAL_EPOCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following this. Is it important to set the epoch to INITIAL_EPOCH+1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I guess it's a way to avoid ending up with multiple resync and monitor threads running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so the mock_sleep object and increment_epoch logic are to turn a long-running thread like resync_monitor_thread into a function that runs once and returns. If you look at the resync_monitor_thread it loops forever while self._epoch == launch_epoch. So by incrementing self._epoch instead of sleeping, the thread will die after a single run.
|
|
||
| self.log_error.assert_not_called() | ||
|
|
||
| def test_monitor_exception_stops_elector(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one feels like it would belong better in test_election.py - WDYT? (It doesn't have any resync-related detail.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it belongs where it is because it calls and tests the monitor thread. Plus, it does not test any of the actual election logic (the actual elector is entirely mocked out), it just checks that the function is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, you're right. I must have misread before.
| self.compaction_patcher = mock.patch( | ||
| "networking_calico.plugins.ml2.drivers" | ||
| ".calico.mech_calico.check_request_etcd_compaction" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you'll need this, as our master code has moved compaction to a separate thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we've also done that on the 3.30 and 3.31 branches, so it also won't be needed for backports to those branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Removed!
| self.log_error.assert_called_once() | ||
|
|
||
| def test_errors_continue_to_log(self): | ||
| """Test that errors contine logging if resync does not occur.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Test that errors contine logging if resync does not occur.""" | |
| """Test that errors continue logging if resync does not occur.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm wondering why this is a useful behaviour. E.g. if the resync thread has locked up, why is it helpful to get N ERRORs, instead of just 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see this question previously so I answered it here
#11577 (comment)
|
/sem-approve |
1 similar comment
|
/sem-approve |
|
@LeoBJenkins WDYT about a release note and documentation for this? I think not the latter, as it's a small point, and one that is likely to be reversed in the nearish future. But perhaps a release note? If you agree, please fill in that section of the PR description. |
|
Also needs |
I will say that at scale, we still do face infrequent problems with e.g. live-migration that could make this change visible. It is likely sufficient to just note it as part of release notes so that operators are aware and can adapt accordingly, but at least an FYI for thought: IMHO, a harsh reality is that Nova's live-migration process is kind of fragile - in essence, it hinges on an RPC request initiated from the control plane (with RabbitMQ as the substrate) towards a source Upstream has floated the idea of this script to remove the duplicate bindings, but it does direct DB calls IIUC, which leaves the ML2 driver no opportunity to observe changes. In most cases, you'd expect the port-binding to remain ACTIVE at the source (i.e. in sync with |
Many thanks Tyler, this is great to be aware of. It still seems to me that a full periodic resync is the wrong hammer for this nail. I was already thinking that we might create a way to explicitly trigger a resync for a particular port or network, for specific cases where that is needed, and it sounds like this could be one of those cases. Or, even more specifically, perhaps this is something we should design as part of our live migration support: something like schedule a recheck for the relevant port at a time after the live migration should have completed. Net, even if we remove the periodic resync, we'll still have these other solutions in mind:
WDYT? |
This would be super! It sounds general enough that e.g. you could still trigger a resync via cron or something in a pinch if needed in addition leveraging it for more-specific breakfix issues. |
I'm happy to do whatever you think is best! What goes into a release note/is there any documentation I can follow when writing one? Thank you! |
No longer necessary in newer versions of networking calico.
Reformatted with `tox -e black`.
Better describes its use case.
Done! |
|
/sem-approve |
|
Formatting is good now, but there are a few UT issues, please look for "FAIL:" here: |
I don't think we have any formal doc about this, but it's just a sentence or two summarising the change. Like one of the bullet points here: https://docs.tigera.io/calico/latest/release-notes/#enhancements |
Thanks; so I think that will be the plan in principle. But it will probably take a good while to land because we have a lot of higher priority things to do first 😄 So it definitely makes sense to land this current PR for the interim. |
Tests pass on my end.
I think I fixed those issues and the tests pass on my end but I'm a little bit confused since the tests passed previously as well. |
|
/sem-approve |
Great, thanks. I added a release note to the PR description. |
nelljerram
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two remaining questions now:
- about the epoch increment coding
- about the ideal sleep time.
Remaining questions are more debatable
Only polls when things are going bad.
Description
This PR
periodic_resyncthread and logs an error if the interval between successful resyncs is too long.These features are necessary because there is a lack of visibility into whether the resync process is completing regularly. The monitoring thread will grant networking calico users peace of mind as they know the system will alert them via logged errors when the resync process is no longer meeting their expectations.
This PR has both been tested automatically -- see attached tests -- and manually in a development cloud where we verified that errors were only logged when the resync lasted longer than the configured maximum.
Other than increased logging (and a slight increase in threads to be scheduled), there should be no change to the functionality of networking-calico as a result of this PR.
Related issues/PRs
#11579
Todos
Release Note
Reminder for the reviewer
Make sure that this PR has the correct labels and milestone set.
Every PR needs one
docs-*label.docs-pr-required: This change requires a change to the documentation that has not been completed yet.docs-completed: This change has all necessary documentation completed.docs-not-required: This change has no user-facing impact and requires no docs.Every PR needs one
release-note-*label.release-note-required: This PR has user-facing changes. Most PRs should have this label.release-note-not-required: This PR has no user-facing changes.Other optional labels:
cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.