Skip to content

Conversation

@LeoBJenkins
Copy link

@LeoBJenkins LeoBJenkins commented Dec 17, 2025

Description

This PR

  1. Creates a thread that monitors the activity of the periodic_resync thread and logs an error if the interval between successful resyncs is too long.
  2. Allows users to configure the maximum interval between successful resyncs.

These features are necessary because there is a lack of visibility into whether the resync process is completing regularly. The monitoring thread will grant networking calico users peace of mind as they know the system will alert them via logged errors when the resync process is no longer meeting their expectations.

This PR has both been tested automatically -- see attached tests -- and manually in a development cloud where we verified that errors were only logged when the resync lasted longer than the configured maximum.

Other than increased logging (and a slight increase in threads to be scheduled), there should be no change to the functionality of networking-calico as a result of this PR.

Related issues/PRs

#11579

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Adds configurable field `resync_max_interval_secs` that defines the max time allowed between successful periodic resyncs. Logs an error message if the interval exceeds the threshold.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

Logs an error if the time between resyncs surpasses the maximum.

Signed-off-by: Leo Jenkins <ljenkins50@bloomberg.net>
@marvin-tigera marvin-tigera added this to the Calico v3.32.0 milestone Dec 17, 2025
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Dec 17, 2025
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Note: In case you are already a member of bloomberg, there is no need to sign the CLA again because bloomberg has already signed the (Corporate) CLA, hence just make sure that your membership is public. If you are not a member of bloomberg then you need to accept our CLA.


Leo Jenkins seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@LeoBJenkins LeoBJenkins marked this pull request as ready for review December 17, 2025 17:11
@LeoBJenkins LeoBJenkins requested a review from a team as a code owner December 17, 2025 17:11
Copy link
Member

@nelljerram nelljerram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks! I've made some specific comments, but I have an overall point as well: my medium term plan is to eliminate periodic resync completely, because we don't actually have any need for it. So I'd just like to check that you won't mind if all of this is ripped out again if/when that happens?

To be clear, we will still have a start of day resync, each time that the Neutron server starts or restarts, but we won't have further resyncs periodically after the initial one.

f" {cfg.CONF.calico.resync_max_interval_secs} seconds"
)

eventlet.sleep(cfg.CONF.calico.resync_max_interval_secs / 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more precise to sleep until self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs, which means to sleep for a period of self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs - now().

Then if self.last_resync_time is still unchanged, we will log the ERROR at the earliest appropriate time. OTOH, if cfg.CONF.calico.resync_max_interval_secs / 5 is less than the period above, then the next check will inevitably not see a problem and will just sleep again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the current polling approach is a bit inefficient but I think it's the best solution available and here are a few reasons why.

I think the calculation self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs - now() would lead to negative sleep times if the max interval is exceeded.

A similar approach that we considered would be to sleep for cfg.CONF.calico.resync_max_interval_secs every time. However, this approach leads to an extremely high difference between the desired and enforced "max interval" in the worst case. For example, if cfg.CONF.calico.resync_max_interval_secs is 5 hours, then it is possible we would not actually log an error until the duration reaches ~10 hours. This would happen if the monitor thread checks to see if the interval has exceeded the threshold right before it does so. The monitor thread would then sleep another 5 hours before checking again and observing the issue. Or, put simply, our worst case error bound on the interval is equal to the time the monitor thread sleeps.

The current solutions balances the worst case error and the inefficiency of polling by sleeping for cfg.CONF.calico.resync_max_interval_secs / 5 but that was a fairly arbitrary choice.

As you mentioned, this will cause the error to be logged multiple times if the resync still has not executed. I think this a feature as it allows us to distinguish between one-off and persistent failures. It also allows us to setup alarms that increase in severity if the issue is unresolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this interacts with the point about repeating the error if the problem persists. With that in mind, what about the best of both worlds:

  • self.last_resync_time + cfg.CONF.calico.resync_max_interval_secs - now() if this is >= 0
  • else cfg.CONF.calico.resync_max_interval_secs / 5

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that -- it's the best of both worlds. I implemented this logic and added two tests for it.

self.sleep_patcher.stop()
super(TestResyncMonitorThread, self).tearDown()

def increment_epoch(self, curr_epoch):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit unintuitive. Could it be named set_epoch, remove + 1 below, and add + 1 to each of the callers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The key problem is that it doesn't get and increment the current self._epoch.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to simulate_epoch_progression to better describe how it's being used in the tests. Please let me know what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this function takes a curr_epoch or init_epoch arg? Why would you not use getattr to get the current _epoch value, then increment, then setattr to set that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah that's not necessary. I removed the curr_epoch argument. I had to replace it with expected_sleep_time in order to test the new sleep time logic, however, I think it is much simpler/easy to understand now.

def test_monitor_does_nothing_when_not_master(self):
"""Test that a driver that is not master does not monitor."""
self.driver.elector.master.return_value = False
self.mock_sleep.side_effect = self.increment_epoch(INITIAL_EPOCH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following this. Is it important to set the epoch to INITIAL_EPOCH+1 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I guess it's a way to avoid ending up with multiple resync and monitor threads running?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so the mock_sleep object and increment_epoch logic are to turn a long-running thread like resync_monitor_thread into a function that runs once and returns. If you look at the resync_monitor_thread it loops forever while self._epoch == launch_epoch. So by incrementing self._epoch instead of sleeping, the thread will die after a single run.


self.log_error.assert_not_called()

def test_monitor_exception_stops_elector(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one feels like it would belong better in test_election.py - WDYT? (It doesn't have any resync-related detail.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it belongs where it is because it calls and tests the monitor thread. Plus, it does not test any of the actual election logic (the actual elector is entirely mocked out), it just checks that the function is called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, you're right. I must have misread before.

self.compaction_patcher = mock.patch(
"networking_calico.plugins.ml2.drivers"
".calico.mech_calico.check_request_etcd_compaction"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you'll need this, as our master code has moved compaction to a separate thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we've also done that on the 3.30 and 3.31 branches, so it also won't be needed for backports to those branches.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Removed!

self.log_error.assert_called_once()

def test_errors_continue_to_log(self):
"""Test that errors contine logging if resync does not occur."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Test that errors contine logging if resync does not occur."""
"""Test that errors continue logging if resync does not occur."""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm wondering why this is a useful behaviour. E.g. if the resync thread has locked up, why is it helpful to get N ERRORs, instead of just 1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this question previously so I answered it here
#11577 (comment)

@nelljerram
Copy link
Member

/sem-approve

1 similar comment
@nelljerram
Copy link
Member

/sem-approve

@nelljerram
Copy link
Member

@LeoBJenkins WDYT about a release note and documentation for this? I think not the latter, as it's a small point, and one that is likely to be reversed in the nearish future. But perhaps a release note? If you agree, please fill in that section of the PR description.

@nelljerram
Copy link
Member

Also needs make -C networking-calico fmtpy.

@tj90241
Copy link
Contributor

tj90241 commented Dec 22, 2025

Many thanks! I've made some specific comments, but I have an overall point as well: my medium term plan is to eliminate periodic resync completely, because we don't actually have any need for it. So I'd just like to check that you won't mind if all of this is ripped out again if/when that happens?

To be clear, we will still have a start of day resync, each time that the Neutron server starts or restarts, but we won't have further resyncs periodically after the initial one.

I will say that at scale, we still do face infrequent problems with e.g. live-migration that could make this change visible. It is likely sufficient to just note it as part of release notes so that operators are aware and can adapt accordingly, but at least an FYI for thought:

IMHO, a harsh reality is that Nova's live-migration process is kind of fragile - in essence, it hinges on an RPC request initiated from the control plane (with RabbitMQ as the substrate) towards a source nova-compute service in a manner that is essentially unsupervised in the sense that if the source or destination nova-compute ever crashes mid-request, the state generated for the live-migration is never rectified or followed up on automatically. This can be problematic for Neutron, as one of the first steps of live-migration is unfortunately creating an intentionally duplicated port-binding at the destination for post-migration operations...

Upstream has floated the idea of this script to remove the duplicate bindings, but it does direct DB calls IIUC, which leaves the ML2 driver no opportunity to observe changes. In most cases, you'd expect the port-binding to remain ACTIVE at the source (i.e. in sync with etcd state?), but I'd be lying if I told you we were never in the situation where we had to help ourselves to the database to push the binding towards the destination with similar database updates. In such cases, we try to make it habit of live-migrating the instance (again, a second time) after a failed live-migration to allow the ML2 driver to observe any changes and flush them to etcd, but reality is direct database updates still do happen in breakfix scenarios.

@nelljerram
Copy link
Member

Many thanks! I've made some specific comments, but I have an overall point as well: my medium term plan is to eliminate periodic resync completely, because we don't actually have any need for it. So I'd just like to check that you won't mind if all of this is ripped out again if/when that happens?
To be clear, we will still have a start of day resync, each time that the Neutron server starts or restarts, but we won't have further resyncs periodically after the initial one.

I will say that at scale, we still do face infrequent problems with e.g. live-migration that could make this change visible. It is likely sufficient to just note it as part of release notes so that operators are aware and can adapt accordingly, but at least an FYI for thought:

IMHO, a harsh reality is that Nova's live-migration process is kind of fragile - in essence, it hinges on an RPC request initiated from the control plane (with RabbitMQ as the substrate) towards a source nova-compute service in a manner that is essentially unsupervised in the sense that if the source or destination nova-compute ever crashes mid-request, the state generated for the live-migration is never rectified or followed up on automatically. This can be problematic for Neutron, as one of the first steps of live-migration is unfortunately creating an intentionally duplicated port-binding at the destination for post-migration operations...

Upstream has floated the idea of this script to remove the duplicate bindings, but it does direct DB calls IIUC, which leaves the ML2 driver no opportunity to observe changes. In most cases, you'd expect the port-binding to remain ACTIVE at the source (i.e. in sync with etcd state?), but I'd be lying if I told you we were never in the situation where we had to help ourselves to the database to push the binding towards the destination with similar database updates. In such cases, we try to make it habit of live-migrating the instance (again, a second time) after a failed live-migration to allow the ML2 driver to observe any changes and flush them to etcd, but reality is direct database updates still do happen in breakfix scenarios.

Many thanks Tyler, this is great to be aware of. It still seems to me that a full periodic resync is the wrong hammer for this nail. I was already thinking that we might create a way to explicitly trigger a resync for a particular port or network, for specific cases where that is needed, and it sounds like this could be one of those cases. Or, even more specifically, perhaps this is something we should design as part of our live migration support: something like schedule a recheck for the relevant port at a time after the live migration should have completed.

Net, even if we remove the periodic resync, we'll still have these other solutions in mind:

  1. Restart the Neutron server.
  2. Trigger an explicit resync for the relevant port or network.
  3. Something more automatic, but focused on the particular case of concern.

WDYT?

@tj90241
Copy link
Contributor

tj90241 commented Dec 23, 2025

I was already thinking that we might create a way to explicitly trigger a resync for a particular port or network, for specific cases where that is needed, and it sounds like this could be one of those cases

This would be super! It sounds general enough that e.g. you could still trigger a resync via cron or something in a pinch if needed in addition leveraging it for more-specific breakfix issues.

@LeoBJenkins
Copy link
Author

@LeoBJenkins WDYT about a release note and documentation for this? I think not the latter, as it's a small point, and one that is likely to be reversed in the nearish future. But perhaps a release note? If you agree, please fill in that section of the PR description.

I'm happy to do whatever you think is best! What goes into a release note/is there any documentation I can follow when writing one? Thank you!

Leo Jenkins added 3 commits December 29, 2025 11:41
No longer necessary in newer versions of networking calico.
Reformatted with `tox -e black`.
Better describes its use case.
@LeoBJenkins
Copy link
Author

Also needs make -C networking-calico fmtpy.

Done!

@nelljerram
Copy link
Member

/sem-approve

@nelljerram
Copy link
Member

Formatting is good now, but there are a few UT issues, please look for "FAIL:" here:
caracal-ut.log

@nelljerram
Copy link
Member

@LeoBJenkins WDYT about a release note and documentation for this? I think not the latter, as it's a small point, and one that is likely to be reversed in the nearish future. But perhaps a release note? If you agree, please fill in that section of the PR description.

I'm happy to do whatever you think is best! What goes into a release note/is there any documentation I can follow when writing one? Thank you!

I don't think we have any formal doc about this, but it's just a sentence or two summarising the change. Like one of the bullet points here: https://docs.tigera.io/calico/latest/release-notes/#enhancements

@nelljerram
Copy link
Member

I was already thinking that we might create a way to explicitly trigger a resync for a particular port or network, for specific cases where that is needed, and it sounds like this could be one of those cases

This would be super! It sounds general enough that e.g. you could still trigger a resync via cron or something in a pinch if needed in addition leveraging it for more-specific breakfix issues.

Thanks; so I think that will be the plan in principle. But it will probably take a good while to land because we have a lot of higher priority things to do first 😄 So it definitely makes sense to land this current PR for the interim.

Tests pass on my end.
@LeoBJenkins
Copy link
Author

Formatting is good now, but there are a few UT issues, please look for "FAIL:" here: caracal-ut.log

I think I fixed those issues and the tests pass on my end

(py38) root@b79d05eecfcb:~# python -m subunit.run networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread | subunit2pyunit
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_errors_continue_to_log
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_errors_continue_to_log ... ok
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_does_nothing_when_not_master
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_does_nothing_when_not_master ... ok
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_exception_stops_elector
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_exception_stops_elector ... ok
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_logs_error_when_over_max
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_logs_error_when_over_max ... ok
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_no_error_if_interval_under_max
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_monitor_no_error_if_interval_under_max ... ok
Periodic resync thread exiting.
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_resync_resets_time
networking_calico.plugins.ml2.drivers.calico.test.test_monitor_thread.TestResyncMonitorThread.test_resync_resets_time ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.605s

OK

but I'm a little bit confused since the tests passed previously as well.

@nelljerram
Copy link
Member

/sem-approve

@LeoBJenkins
Copy link
Author

I don't think we have any formal doc about this, but it's just a sentence or two summarising the change. Like one of the bullet points here: https://docs.tigera.io/calico/latest/release-notes/#enhancements

Great, thanks. I added a release note to the PR description.

Copy link
Member

@nelljerram nelljerram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two remaining questions now:

  • about the epoch increment coding
  • about the ideal sleep time.

@nelljerram nelljerram dismissed their stale review January 2, 2026 11:30

Remaining questions are more debatable

@nelljerram nelljerram added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Jan 2, 2026
Only polls when things are going bad.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants