Add configurable max time between successful resyncs #11577

LeoBJenkins · 2025-12-17T16:43:55Z

Description

This PR

Creates a thread that monitors the activity of the periodic_resync thread and logs an error if the interval between successful resyncs is too long.
Allows users to configure the maximum interval between successful resyncs.

These features are necessary because there is a lack of visibility into whether the resync process is completing regularly. The monitoring thread will grant networking calico users peace of mind as they know the system will alert them via logged errors when the resync process is no longer meeting their expectations.

This PR has both been tested automatically -- see attached tests -- and manually in a development cloud where we verified that errors were only logged when the resync lasted longer than the configured maximum.

Other than increased logging (and a slight increase in threads to be scheduled), there should be no change to the functionality of networking-calico as a result of this PR.

Related issues/PRs

#11579

Todos

Tests
Documentation
Release note

Release Note

Adds configurable field `resync_max_interval_secs` that defines the max time allowed between successful periodic resyncs. Logs an error message if the interval exceeds the threshold.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

docs-pr-required: This change requires a change to the documentation that has not been completed yet.
docs-completed: This change has all necessary documentation completed.
docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

release-note-required: This PR has user-facing changes. Most PRs should have this label.
release-note-not-required: This PR has no user-facing changes.

Other optional labels:

cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

Logs an error if the time between resyncs surpasses the maximum. Signed-off-by: Leo Jenkins <ljenkins50@bloomberg.net>

CLAassistant · 2025-12-17T16:44:40Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Note: In case you are already a member of bloomberg, there is no need to sign the CLA again because bloomberg has already signed the (Corporate) CLA, hence just make sure that your membership is public. If you are not a member of bloomberg then you need to accept our CLA.

Leo Jenkins seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

nelljerram

Many thanks! I've made some specific comments, but I have an overall point as well: my medium term plan is to eliminate periodic resync completely, because we don't actually have any need for it. So I'd just like to check that you won't mind if all of this is ripped out again if/when that happens?

To be clear, we will still have a start of day resync, each time that the Neutron server starts or restarts, but we won't have further resyncs periodically after the initial one.

nelljerram · 2025-12-19T16:48:43Z

networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py

+                            f" {cfg.CONF.calico.resync_max_interval_secs} seconds"
+                        )
+
+                    eventlet.sleep(cfg.CONF.calico.resync_max_interval_secs / 5)


It would be more precise to sleep until self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs, which means to sleep for a period of self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs - now().

Then if self.last_resync_time is still unchanged, we will log the ERROR at the earliest appropriate time. OTOH, if cfg.CONF.calico.resync_max_interval_secs / 5 is less than the period above, then the next check will inevitably not see a problem and will just sleep again.

I understand the current polling approach is a bit inefficient but I think it's the best solution available and here are a few reasons why.

I think the calculation self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs - now() would lead to negative sleep times if the max interval is exceeded.

A similar approach that we considered would be to sleep for cfg.CONF.calico.resync_max_interval_secs every time. However, this approach leads to an extremely high difference between the desired and enforced "max interval" in the worst case. For example, if cfg.CONF.calico.resync_max_interval_secs is 5 hours, then it is possible we would not actually log an error until the duration reaches ~10 hours. This would happen if the monitor thread checks to see if the interval has exceeded the threshold right before it does so. The monitor thread would then sleep another 5 hours before checking again and observing the issue. Or, put simply, our worst case error bound on the interval is equal to the time the monitor thread sleeps.

The current solutions balances the worst case error and the inefficiency of polling by sleeping for cfg.CONF.calico.resync_max_interval_secs / 5 but that was a fairly arbitrary choice.

As you mentioned, this will cause the error to be logged multiple times if the resync still has not executed. I think this a feature as it allows us to distinguish between one-off and persistent failures. It also allows us to setup alarms that increase in severity if the issue is unresolved.

I see that this interacts with the point about repeating the error if the problem persists. With that in mind, what about the best of both worlds:

self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs - now() if this is >= 0

else cfg.CONF.calico.resync_max_interval_secs / 5

I like that -- it's the best of both worlds. I implemented this logic and added two tests for it.

nelljerram · 2025-12-19T16:58:43Z

networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_monitor_thread.py

+        self.sleep_patcher.stop()
+        super(TestResyncMonitorThread, self).tearDown()
+
+    def increment_epoch(self, curr_epoch):


This feels a bit unintuitive. Could it be named set_epoch, remove + 1 below, and add + 1 to each of the callers?

(The key problem is that it doesn't get and increment the current self._epoch.)

I renamed it to simulate_epoch_progression to better describe how it's being used in the tests. Please let me know what you think

I don't understand why this function takes a curr_epoch or init_epoch arg? Why would you not use getattr to get the current _epoch value, then increment, then setattr to set that?

Oh yeah that's not necessary. I removed the curr_epoch argument. I had to replace it with expected_sleep_time in order to test the new sleep time logic, however, I think it is much simpler/easy to understand now.

nelljerram · 2025-12-19T17:00:46Z

networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_monitor_thread.py

+    def test_monitor_does_nothing_when_not_master(self):
+        """Test that a driver that is not master does not monitor."""
+        self.driver.elector.master.return_value = False
+        self.mock_sleep.side_effect = self.increment_epoch(INITIAL_EPOCH)


I'm not following this. Is it important to set the epoch to INITIAL_EPOCH+1 ?

Oh, I guess it's a way to avoid ending up with multiple resync and monitor threads running?

Yeah so the mock_sleep object and increment_epoch logic are to turn a long-running thread like resync_monitor_thread into a function that runs once and returns. If you look at the resync_monitor_thread it loops forever while self._epoch == launch_epoch. So by incrementing self._epoch instead of sleeping, the thread will die after a single run.

nelljerram · 2025-12-19T17:07:51Z

networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_monitor_thread.py

+
+        self.log_error.assert_not_called()
+
+    def test_monitor_exception_stops_elector(self):


This one feels like it would belong better in test_election.py - WDYT? (It doesn't have any resync-related detail.)

I think it belongs where it is because it calls and tests the monitor thread. Plus, it does not test any of the actual election logic (the actual elector is entirely mocked out), it just checks that the function is called.

I'm sorry, you're right. I must have misread before.

nelljerram · 2025-12-19T17:09:53Z

networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_monitor_thread.py

+        self.compaction_patcher = mock.patch(
+            "networking_calico.plugins.ml2.drivers"
+            ".calico.mech_calico.check_request_etcd_compaction"
+        )


I don't think you'll need this, as our master code has moved compaction to a separate thread.

And we've also done that on the 3.30 and 3.31 branches, so it also won't be needed for backports to those branches.

Yes! Removed!

nelljerram · 2025-12-19T17:12:41Z

networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_monitor_thread.py

+        self.log_error.assert_called_once()
+
+    def test_errors_continue_to_log(self):
+        """Test that errors contine logging if resync does not occur."""


Suggested change

"""Test that errors contine logging if resync does not occur."""

"""Test that errors continue logging if resync does not occur."""

Also I'm wondering why this is a useful behaviour. E.g. if the resync thread has locked up, why is it helpful to get N ERRORs, instead of just 1?

I didn't see this question previously so I answered it here
#11577 (comment)

nelljerram · 2025-12-19T17:18:47Z

/sem-approve

nelljerram · 2025-12-19T17:21:31Z

/sem-approve

nelljerram · 2025-12-19T17:24:48Z

@LeoBJenkins WDYT about a release note and documentation for this? I think not the latter, as it's a small point, and one that is likely to be reversed in the nearish future. But perhaps a release note? If you agree, please fill in that section of the PR description.

nelljerram · 2025-12-19T17:26:20Z

Also needs make -C networking-calico fmtpy.

tj90241 · 2025-12-22T18:58:51Z

Many thanks! I've made some specific comments, but I have an overall point as well: my medium term plan is to eliminate periodic resync completely, because we don't actually have any need for it. So I'd just like to check that you won't mind if all of this is ripped out again if/when that happens?

To be clear, we will still have a start of day resync, each time that the Neutron server starts or restarts, but we won't have further resyncs periodically after the initial one.

I will say that at scale, we still do face infrequent problems with e.g. live-migration that could make this change visible. It is likely sufficient to just note it as part of release notes so that operators are aware and can adapt accordingly, but at least an FYI for thought:

IMHO, a harsh reality is that Nova's live-migration process is kind of fragile - in essence, it hinges on an RPC request initiated from the control plane (with RabbitMQ as the substrate) towards a source nova-compute service in a manner that is essentially unsupervised in the sense that if the source or destination nova-compute ever crashes mid-request, the state generated for the live-migration is never rectified or followed up on automatically. This can be problematic for Neutron, as one of the first steps of live-migration is unfortunately creating an intentionally duplicated port-binding at the destination for post-migration operations...

Upstream has floated the idea of this script to remove the duplicate bindings, but it does direct DB calls IIUC, which leaves the ML2 driver no opportunity to observe changes. In most cases, you'd expect the port-binding to remain ACTIVE at the source (i.e. in sync with etcd state?), but I'd be lying if I told you we were never in the situation where we had to help ourselves to the database to push the binding towards the destination with similar database updates. In such cases, we try to make it habit of live-migrating the instance (again, a second time) after a failed live-migration to allow the ML2 driver to observe any changes and flush them to etcd, but reality is direct database updates still do happen in breakfix scenarios.

nelljerram · 2025-12-23T10:11:26Z

Many thanks! I've made some specific comments, but I have an overall point as well: my medium term plan is to eliminate periodic resync completely, because we don't actually have any need for it. So I'd just like to check that you won't mind if all of this is ripped out again if/when that happens?
To be clear, we will still have a start of day resync, each time that the Neutron server starts or restarts, but we won't have further resyncs periodically after the initial one.

I will say that at scale, we still do face infrequent problems with e.g. live-migration that could make this change visible. It is likely sufficient to just note it as part of release notes so that operators are aware and can adapt accordingly, but at least an FYI for thought:

IMHO, a harsh reality is that Nova's live-migration process is kind of fragile - in essence, it hinges on an RPC request initiated from the control plane (with RabbitMQ as the substrate) towards a source nova-compute service in a manner that is essentially unsupervised in the sense that if the source or destination nova-compute ever crashes mid-request, the state generated for the live-migration is never rectified or followed up on automatically. This can be problematic for Neutron, as one of the first steps of live-migration is unfortunately creating an intentionally duplicated port-binding at the destination for post-migration operations...

Upstream has floated the idea of this script to remove the duplicate bindings, but it does direct DB calls IIUC, which leaves the ML2 driver no opportunity to observe changes. In most cases, you'd expect the port-binding to remain ACTIVE at the source (i.e. in sync with etcd state?), but I'd be lying if I told you we were never in the situation where we had to help ourselves to the database to push the binding towards the destination with similar database updates. In such cases, we try to make it habit of live-migrating the instance (again, a second time) after a failed live-migration to allow the ML2 driver to observe any changes and flush them to etcd, but reality is direct database updates still do happen in breakfix scenarios.

Many thanks Tyler, this is great to be aware of. It still seems to me that a full periodic resync is the wrong hammer for this nail. I was already thinking that we might create a way to explicitly trigger a resync for a particular port or network, for specific cases where that is needed, and it sounds like this could be one of those cases. Or, even more specifically, perhaps this is something we should design as part of our live migration support: something like schedule a recheck for the relevant port at a time after the live migration should have completed.

Net, even if we remove the periodic resync, we'll still have these other solutions in mind:

Restart the Neutron server.
Trigger an explicit resync for the relevant port or network.
Something more automatic, but focused on the particular case of concern.

WDYT?

tj90241 · 2025-12-23T16:26:23Z

I was already thinking that we might create a way to explicitly trigger a resync for a particular port or network, for specific cases where that is needed, and it sounds like this could be one of those cases

This would be super! It sounds general enough that e.g. you could still trigger a resync via cron or something in a pinch if needed in addition leveraging it for more-specific breakfix issues.

LeoBJenkins · 2025-12-29T16:33:23Z

@LeoBJenkins WDYT about a release note and documentation for this? I think not the latter, as it's a small point, and one that is likely to be reversed in the nearish future. But perhaps a release note? If you agree, please fill in that section of the PR description.

I'm happy to do whatever you think is best! What goes into a release note/is there any documentation I can follow when writing one? Thank you!

No longer necessary in newer versions of networking calico.

Reformatted with `tox -e black`.

Better describes its use case.

LeoBJenkins · 2025-12-29T17:26:59Z

Also needs make -C networking-calico fmtpy.

Done!

nelljerram · 2025-12-30T11:01:40Z

/sem-approve

nelljerram · 2025-12-30T11:46:21Z

Formatting is good now, but there are a few UT issues, please look for "FAIL:" here:
caracal-ut.log

nelljerram · 2025-12-30T11:48:40Z

@LeoBJenkins WDYT about a release note and documentation for this? I think not the latter, as it's a small point, and one that is likely to be reversed in the nearish future. But perhaps a release note? If you agree, please fill in that section of the PR description.

I'm happy to do whatever you think is best! What goes into a release note/is there any documentation I can follow when writing one? Thank you!

I don't think we have any formal doc about this, but it's just a sentence or two summarising the change. Like one of the bullet points here: https://docs.tigera.io/calico/latest/release-notes/#enhancements

nelljerram · 2025-12-30T11:51:04Z

I was already thinking that we might create a way to explicitly trigger a resync for a particular port or network, for specific cases where that is needed, and it sounds like this could be one of those cases

This would be super! It sounds general enough that e.g. you could still trigger a resync via cron or something in a pinch if needed in addition leveraging it for more-specific breakfix issues.

Thanks; so I think that will be the plan in principle. But it will probably take a good while to land because we have a lot of higher priority things to do first 😄 So it definitely makes sense to land this current PR for the interim.

Tests pass on my end.

LeoBJenkins · 2025-12-30T15:41:09Z

Formatting is good now, but there are a few UT issues, please look for "FAIL:" here: caracal-ut.log

I think I fixed those issues and the tests pass on my end

(py38) root@b79d05eecfcb:~# python -m subunit.run networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread | subunit2pyunit
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_errors_continue_to_log
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_errors_continue_to_log ... ok
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_does_nothing_when_not_master
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_does_nothing_when_not_master ... ok
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_exception_stops_elector
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_exception_stops_elector ... ok
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_logs_error_when_over_max
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_logs_error_when_over_max ... ok
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_no_error_if_interval_under_max
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_no_error_if_interval_under_max ... ok
Periodic resync thread exiting.
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_resync_resets_time
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_resync_resets_time ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.605s

OK

but I'm a little bit confused since the tests passed previously as well.

nelljerram · 2025-12-30T15:43:08Z

/sem-approve

LeoBJenkins · 2025-12-30T15:59:48Z

I don't think we have any formal doc about this, but it's just a sentence or two summarising the change. Like one of the bullet points here: https://docs.tigera.io/calico/latest/release-notes/#enhancements

Great, thanks. I added a release note to the PR description.

nelljerram

Just two remaining questions now:

about the epoch increment coding
about the ideal sleep time.

Remaining questions are more debatable

Only polls when things are going bad.

Add max time between successful resyncs config

5c7b3dc

Logs an error if the time between resyncs surpasses the maximum. Signed-off-by: Leo Jenkins <ljenkins50@bloomberg.net>

marvin-tigera added this to the Calico v3.32.0 milestone Dec 17, 2025

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Dec 17, 2025

LeoBJenkins marked this pull request as ready for review December 17, 2025 17:11

LeoBJenkins requested a review from a team as a code owner December 17, 2025 17:11

nelljerram previously requested changes Dec 19, 2025

View reviewed changes

nelljerram self-assigned this Dec 22, 2025

nelljerram mentioned this pull request Dec 22, 2025

Lack of visibility into the activity/progress of the periodic_resync thread #11579

Open

Leo Jenkins added 3 commits December 29, 2025 11:41

Remove compaction patch from monitor thread tests

164e60f

No longer necessary in newer versions of networking calico.

Style fixes

8cea22a

Reformatted with `tox -e black`.

Rename test helper function

eab2caf

Better describes its use case.

Fix unit test CFG object

666b235

Tests pass on my end.

nelljerram reviewed Jan 2, 2026

View reviewed changes

nelljerram added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Jan 2, 2026

Rewrite sleep timing logic to be more efficient

0936288

Only polls when things are going bad.


		self.log_error.assert_not_called()

		def test_monitor_exception_stops_elector(self):

	"""Test that errors contine logging if resync does not occur."""
	"""Test that errors continue logging if resync does not occur."""

Add configurable max time between successful resyncs #11577

Are you sure you want to change the base?

Add configurable max time between successful resyncs #11577

Conversation

LeoBJenkins commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues/PRs

Todos

Release Note

Reminder for the reviewer

Uh oh!

CLAassistant commented Dec 17, 2025

Uh oh!

nelljerram left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nelljerram commented Dec 19, 2025

Uh oh!

nelljerram commented Dec 19, 2025

Uh oh!

nelljerram commented Dec 19, 2025

Uh oh!

nelljerram commented Dec 19, 2025

Uh oh!

tj90241 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelljerram commented Dec 23, 2025

Uh oh!

tj90241 commented Dec 23, 2025

Uh oh!

LeoBJenkins commented Dec 29, 2025

Uh oh!

LeoBJenkins commented Dec 29, 2025

Uh oh!

nelljerram commented Dec 30, 2025

LeoBJenkins commented Dec 17, 2025 •

edited

Loading

nelljerram left a comment •

edited

Loading

tj90241 commented Dec 22, 2025 •

edited

Loading