-
Notifications
You must be signed in to change notification settings - Fork 448
When purging room history, send cache replication message for _get_event_cache
#19404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…_event_cache` As described in element-hq/synapse-rust-apps#157
| self.invalidate_get_event_cache_after_txn(txn, event_id) | ||
| # Send that invalidation to replication so that other workers also invalidate | ||
| # the event cache. | ||
| self._send_invalidation_to_replication( | ||
| txn, "_get_event_cache", (event_id,) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just move the _send_invalidation_to_replication call inside invalidate_get_event_cache_after_txn instead of doing it separately?
Is there a reason that we sometimes invalidate_get_event_cache_after_txn but don't want to send it over replication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside of adding in additional replication call locations would be the potential for large amounts of replication overhead per event, which can consume a lot of resources on each worker.
In this case, it appears that the ph_cache_fake cache function is used to propagate a single replication message to invalidate events for the entire room (see below).
| self.invalidate_get_event_cache_after_txn(txn, event_id) | ||
| # Send that invalidation to replication so that other workers also invalidate | ||
| # the event cache. | ||
| self._send_invalidation_to_replication( | ||
| txn, "_get_event_cache", (event_id,) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/element-hq/synapse-rust-apps/issues/157 only calls out this spot we're fixing but there are a few more places that we call invalidate_get_event_cache_after_txn without calling _send_invalidation_to_replication 🤔
synapse/synapse/storage/databases/main/events.py
Lines 1095 to 1100 in 826a7dd
| # Once the txn completes, invalidate all of the relevant caches. Note that we do this | |
| # up here because it captures all the events_and_contexts before any are removed. | |
| for event, _ in events_and_contexts: | |
| self.store.invalidate_get_event_cache_after_txn(txn, event.event_id) | |
| if event.redacts: | |
| self.store.invalidate_get_event_cache_after_txn(txn, event.redacts) |
Related to the other discussion about doing it all the time: #19404 (comment)
| # Send that invalidation to replication so that other workers also invalidate | ||
| # the event cache. | ||
| self._send_invalidation_to_replication( | ||
| txn, "_get_event_cache", (event_id,) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test this properly, I think this would be best served by some end-to-end Complement tests (WORKERS=1) but that means we need to introduce in-repo Complement tests to utilize the Synapse admin API's.
Perhaps we can also accomplish a test with BaseMultiWorkerStreamTestCase 🤞 although that's not very satisfying.
(putting this up for review without a test to answer the other questions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we have some Sytest tests for purging history: matrix-org/sytest -> tests/48admin.pl#L91-L618
The After /purge_history users still get pushed for new messages test is known to be flaky and even failed in the CI for this PR.
Since Sytest is a nightmare to use and debug, I'd rather just work on other tests than try to add a test that stresses the correct things there. And surprising that those tests haven't been failing (perhaps cache is already being cleared somehow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests passing may not be surprising anymore based on my latest knowledge outlined below: #19404 (comment)
|
|
||
| logger.info("[purge] done") | ||
|
|
||
| self._invalidate_caches_for_room_events_and_stream(txn, room_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Maybe it's not neccessary to add a replication line above because it is handled here.
The replication logic in synapse pro doesn't handle all streams yet, so I assumed at the time that we only needed to support the _get_event cache func.
It appears that Synapse relies on this invalidation of room history purging (which uses the ph_cache_fake cache func name) going out over replication in order to invalidate the get_event cache.
I think my original assessment of needing another outgoing replication message is wrong. And instead Synapse Pro just needs to add support for invalidating the get_event cache based on purge room history cache func replication messages.
Digging down here:
self._send_invalidation_to_replication(txn, PURGE_HISTORY_CACHE_NAME, [room_id]) elif row.cache_func == PURGE_HISTORY_CACHE_NAME: self._invalidate_caches_for_room_events(room_id) self._invalidate_local_get_event_cache_room_id(room_id) # type: ignore[attr-defined]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate you diving into this 🙇
This feels like a very brittle system if we forever have to align with the various replication tasks that Synapse does. We're basically having to copy all of the process_replication_rows(...) logic into the separate Synapse Pro codebase. And it's a total cat and mouse game where a feature could ship in Synapse and we forget to patch the replication in Synapse Pro rust apps with no way to catch what's wrong.
We could slightly mitigate things by having a known list of cache_func for the caches replication stream and blow up in CI/testing when we see an new/unknown cache_func that we need to handle before shipping things. This assumes proper test coverage of all features including stressing the replication. And this doesn't help the case where a bug is fixed for an existing cache_func condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it is seeming more and more fragile the closer I look.
Perhaps this is one to discuss in our Monday sync to get more minds involved in the discussion.
| self.invalidate_get_event_cache_after_txn(txn, event_id) | ||
| # Send that invalidation to replication so that other workers also invalidate | ||
| # the event cache. | ||
| self._send_invalidation_to_replication( | ||
| txn, "_get_event_cache", (event_id,) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside of adding in additional replication call locations would be the potential for large amounts of replication overhead per event, which can consume a lot of resources on each worker.
In this case, it appears that the ph_cache_fake cache function is used to propagate a single replication message to invalidate events for the entire room (see below).
When purging room history, send
cachereplication message for_get_event_cacheAs described in https://github.com/element-hq/synapse-rust-apps/issues/157
Dev notes
Relevant code:
synapse/synapse/storage/databases/main/events_worker.py
Lines 941 to 1019 in 826a7dd
invalidate_get_event_cache_after_txn_invalidate_async_get_event_cache_invalidate_local_get_event_cache_invalidate_local_get_event_cache_room_id_invalidate_async_get_event_cache_room_id_invalidate_caches_for_room_eventsPull Request Checklist
EventStoretoEventWorkerStore.".code blocks.