When purging room history, send `cache` replication message for `_get_event_cache` #19404

MadLittleMods · 2026-01-22T23:58:17Z

When purging room history, send cache replication message for _get_event_cache

As described in https://github.com/element-hq/synapse-rust-apps/issues/157

Dev notes

Relevant code:

synapse/synapse/storage/databases/main/events_worker.py

Lines 941 to 1019 in 826a7dd

    
               def invalidate_get_event_cache_after_txn( 
        
                   self, txn: LoggingTransaction, event_id: str 
        
               ) -> None: 
        
                   """ 
        
                   Prepares a database transaction to invalidate the get event cache for a given 
        
                   event ID when executed successfully. This is achieved by attaching two callbacks 
        
                   to the transaction, one to invalidate the async cache and one for the in memory 
        
                   sync cache (importantly called in that order). 
        
                   Arguments: 
        
                       txn: the database transaction to attach the callbacks to 
        
                       event_id: the event ID to be invalidated from caches 
        
                   """ 
        
                   txn.async_call_after(self._invalidate_async_get_event_cache, event_id) 
        
                   txn.call_after(self._invalidate_local_get_event_cache, event_id) 
        
               async def _invalidate_async_get_event_cache(self, event_id: str) -> None: 
        
                   """ 
        
                   Invalidates an event in the asynchronous get event cache, which may be remote. 
        
                   Arguments: 
        
                       event_id: the event ID to invalidate 
        
                   """ 
        
                   await self._get_event_cache.invalidate((event_id,)) 
        
               def _invalidate_local_get_event_cache(self, event_id: str) -> None: 
        
                   """ 
        
                   Invalidates an event in local in-memory get event caches. 
        
                   Arguments: 
        
                       event_id: the event ID to invalidate 
        
                   """ 
        
                   self._get_event_cache.invalidate_local((event_id,)) 
        
                   self._event_ref.pop(event_id, None) 
        
                   self._current_event_fetches.pop(event_id, None) 
        
               def _invalidate_local_get_event_cache_room_id(self, room_id: str) -> None: 
        
                   """Clears the in-memory get event caches for a room. 
        
                   Used when we purge room history. 
        
                   """ 
        
                   self._get_event_cache.invalidate_on_extra_index_local((room_id,)) 
        
                   self._event_ref.clear() 
        
                   self._current_event_fetches.clear() 
        
               def _invalidate_async_get_event_cache_room_id(self, room_id: str) -> None: 
        
                   """ 
        
                   Clears the async get_event cache for a room. Currently a no-op until 
        
                   an async get_event cache is implemented - see https://github.com/matrix-org/synapse/pull/13242 
        
                   for preliminary work. 
        
                   """ 
        
               async def _get_events_from_cache( 
        
                   self, events: Iterable[str], update_metrics: bool = True 
        
               ) -> dict[str, EventCacheEntry]: 
        
                   """Fetch events from the caches, both in memory and any external. 
        
                   May return rejected events. 
        
                   Args: 
        
                       events: list of event_ids to fetch 
        
                       update_metrics: Whether to update the cache hit ratio metrics 
        
                   """ 
        
                   event_map = self._get_events_from_local_cache( 
        
                       events, update_metrics=update_metrics 
        
                   ) 
        
                   missing_event_ids = [e for e in events if e not in event_map] 
        
                   event_map.update( 
        
                       await self._get_events_from_external_cache( 
        
                           events=missing_event_ids, 
        
                           update_metrics=update_metrics, 
        
                       ) 
        
                   ) 
        
                   return event_map

invalidate_get_event_cache_after_txn
_invalidate_async_get_event_cache
_invalidate_local_get_event_cache
_invalidate_local_get_event_cache_room_id
_invalidate_async_get_event_cache_room_id

_invalidate_caches_for_room_events

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct (run the linters)

…_event_cache` As described in element-hq/synapse-rust-apps#157

MadLittleMods · 2026-01-23T00:00:38Z

synapse/storage/databases/main/purge_events.py

                self.invalidate_get_event_cache_after_txn(txn, event_id)
+                # Send that invalidation to replication so that other workers also invalidate
+                # the event cache.
+                self._send_invalidation_to_replication(
+                    txn, "_get_event_cache", (event_id,)
+                )


Can we just move the _send_invalidation_to_replication call inside invalidate_get_event_cache_after_txn instead of doing it separately?

Is there a reason that we sometimes invalidate_get_event_cache_after_txn but don't want to send it over replication?

The downside of adding in additional replication call locations would be the potential for large amounts of replication overhead per event, which can consume a lot of resources on each worker.
In this case, it appears that the ph_cache_fake cache function is used to propagate a single replication message to invalidate events for the entire room (see below).

MadLittleMods · 2026-01-23T00:04:34Z

synapse/storage/databases/main/purge_events.py

                self.invalidate_get_event_cache_after_txn(txn, event_id)
+                # Send that invalidation to replication so that other workers also invalidate
+                # the event cache.
+                self._send_invalidation_to_replication(
+                    txn, "_get_event_cache", (event_id,)
+                )


https://github.com/element-hq/synapse-rust-apps/issues/157 only calls out this spot we're fixing but there are a few more places that we call invalidate_get_event_cache_after_txn without calling _send_invalidation_to_replication 🤔

synapse/synapse/storage/databases/main/events.py

Lines 1095 to 1100 in 826a7dd

# Once the txn completes, invalidate all of the relevant caches. Note that we do this

# up here because it captures all the events_and_contexts before any are removed.

for event, _ in events_and_contexts:

self.store.invalidate_get_event_cache_after_txn(txn, event.event_id)

if event.redacts:

self.store.invalidate_get_event_cache_after_txn(txn, event.redacts)

Related to the other discussion about doing it all the time: #19404 (comment)

MadLittleMods · 2026-01-23T00:24:12Z

synapse/storage/databases/main/purge_events.py

+                # Send that invalidation to replication so that other workers also invalidate
+                # the event cache.
+                self._send_invalidation_to_replication(
+                    txn, "_get_event_cache", (event_id,)
+                )


To test this properly, I think this would be best served by some end-to-end Complement tests (WORKERS=1) but that means we need to introduce in-repo Complement tests to utilize the Synapse admin API's.

Perhaps we can also accomplish a test with BaseMultiWorkerStreamTestCase 🤞 although that's not very satisfying.

(putting this up for review without a test to answer the other questions)

It looks like we have some Sytest tests for purging history: matrix-org/sytest -> tests/48admin.pl#L91-L618

The After /purge_history users still get pushed for new messages test is known to be flaky and even failed in the CI for this PR.

Since Sytest is a nightmare to use and debug, I'd rather just work on other tests than try to add a test that stresses the correct things there. And surprising that those tests haven't been failing (perhaps cache is already being cleared somehow).

The tests passing may not be surprising anymore based on my latest knowledge outlined below: #19404 (comment)

devonh · 2026-01-23T23:27:17Z

synapse/storage/databases/main/purge_events.py


        logger.info("[purge] done")

        self._invalidate_caches_for_room_events_and_stream(txn, room_id)


Hmm. Maybe it's not neccessary to add a replication line above because it is handled here.

The replication logic in synapse pro doesn't handle all streams yet, so I assumed at the time that we only needed to support the _get_event cache func.
It appears that Synapse relies on this invalidation of room history purging (which uses the ph_cache_fake cache func name) going out over replication in order to invalidate the get_event cache.

I think my original assessment of needing another outgoing replication message is wrong. And instead Synapse Pro just needs to add support for invalidating the get_event cache based on purge room history cache func replication messages.

Digging down here:

synapse/synapse/storage/databases/main/cache.py

Line 492 in 826a7dd

self._send_invalidation_to_replication(txn, PURGE_HISTORY_CACHE_NAME, [room_id])

synapse/synapse/storage/databases/main/cache.py

Line 235 in 826a7dd

elif row.cache_func == PURGE_HISTORY_CACHE_NAME:

synapse/synapse/storage/databases/main/cache.py

Line 242 in 826a7dd

self._invalidate_caches_for_room_events(room_id)

synapse/synapse/storage/databases/main/cache.py

Line 503 in 826a7dd

self._invalidate_local_get_event_cache_room_id(room_id) # type: ignore[attr-defined]

Really appreciate you diving into this 🙇

This feels like a very brittle system if we forever have to align with the various replication tasks that Synapse does. We're basically having to copy all of the process_replication_rows(...) logic into the separate Synapse Pro codebase. And it's a total cat and mouse game where a feature could ship in Synapse and we forget to patch the replication in Synapse Pro rust apps with no way to catch what's wrong.

We could slightly mitigate things by having a known list of cache_func for the caches replication stream and blow up in CI/testing when we see an new/unknown cache_func that we need to handle before shipping things. This assumes proper test coverage of all features including stressing the replication. And this doesn't help the case where a bug is fixed for an existing cache_func condition.

Indeed it is seeming more and more fragile the closer I look.
Perhaps this is one to discuss in our Monday sync to get more minds involved in the discussion.

devonh · 2026-01-23T23:29:19Z

synapse/storage/databases/main/purge_events.py

                self.invalidate_get_event_cache_after_txn(txn, event_id)
+                # Send that invalidation to replication so that other workers also invalidate
+                # the event cache.
+                self._send_invalidation_to_replication(
+                    txn, "_get_event_cache", (event_id,)
+                )


The downside of adding in additional replication call locations would be the potential for large amounts of replication overhead per event, which can consume a lot of resources on each worker.
In this case, it appears that the ph_cache_fake cache function is used to propagate a single replication message to invalidate events for the entire room (see below).

When purging room history, send cache replication message for `_get…

3a98ca7

…_event_cache` As described in element-hq/synapse-rust-apps#157

MadLittleMods added the A-Workers label Jan 22, 2026

MadLittleMods commented Jan 23, 2026

View reviewed changes

Add changelog

4d294bd

MadLittleMods commented Jan 23, 2026

View reviewed changes

MadLittleMods marked this pull request as ready for review January 23, 2026 02:49

MadLittleMods requested a review from a team as a code owner January 23, 2026 02:49

devonh reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When purging room history, send `cache` replication message for `_get_event_cache` #19404

When purging room history, send `cache` replication message for `_get_event_cache` #19404

Uh oh!

MadLittleMods commented Jan 22, 2026 •

edited

Loading

Uh oh!

MadLittleMods Jan 23, 2026 •

edited

Loading

Uh oh!

devonh Jan 23, 2026 •

edited by MadLittleMods

Loading

Uh oh!

MadLittleMods Jan 23, 2026

Uh oh!

MadLittleMods Jan 23, 2026 •

edited

Loading

Uh oh!

MadLittleMods Jan 23, 2026

Uh oh!

devonh Jan 23, 2026 •

edited

Loading

Uh oh!

devonh Jan 23, 2026

Uh oh!

MadLittleMods Jan 24, 2026

Uh oh!

devonh Jan 24, 2026

Uh oh!

devonh Jan 23, 2026 •

edited by MadLittleMods

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def invalidate_get_event_cache_after_txn(
	self, txn: LoggingTransaction, event_id: str
	) -> None:
	"""
	Prepares a database transaction to invalidate the get event cache for a given
	event ID when executed successfully. This is achieved by attaching two callbacks
	to the transaction, one to invalidate the async cache and one for the in memory
	sync cache (importantly called in that order).

	Arguments:
	txn: the database transaction to attach the callbacks to
	event_id: the event ID to be invalidated from caches
	"""

	txn.async_call_after(self._invalidate_async_get_event_cache, event_id)
	txn.call_after(self._invalidate_local_get_event_cache, event_id)

	async def _invalidate_async_get_event_cache(self, event_id: str) -> None:
	"""
	Invalidates an event in the asynchronous get event cache, which may be remote.

	Arguments:
	event_id: the event ID to invalidate
	"""

	await self._get_event_cache.invalidate((event_id,))

	def _invalidate_local_get_event_cache(self, event_id: str) -> None:
	"""
	Invalidates an event in local in-memory get event caches.

	Arguments:
	event_id: the event ID to invalidate
	"""

	self._get_event_cache.invalidate_local((event_id,))
	self._event_ref.pop(event_id, None)
	self._current_event_fetches.pop(event_id, None)

	def _invalidate_local_get_event_cache_room_id(self, room_id: str) -> None:
	"""Clears the in-memory get event caches for a room.

	Used when we purge room history.
	"""
	self._get_event_cache.invalidate_on_extra_index_local((room_id,))
	self._event_ref.clear()
	self._current_event_fetches.clear()

	def _invalidate_async_get_event_cache_room_id(self, room_id: str) -> None:
	"""
	Clears the async get_event cache for a room. Currently a no-op until
	an async get_event cache is implemented - see https://github.com/matrix-org/synapse/pull/13242
	for preliminary work.
	"""

	async def _get_events_from_cache(
	self, events: Iterable[str], update_metrics: bool = True
	) -> dict[str, EventCacheEntry]:
	"""Fetch events from the caches, both in memory and any external.

	May return rejected events.

	Args:
	events: list of event_ids to fetch
	update_metrics: Whether to update the cache hit ratio metrics
	"""
	event_map = self._get_events_from_local_cache(
	events, update_metrics=update_metrics
	)

	missing_event_ids = [e for e in events if e not in event_map]
	event_map.update(
	await self._get_events_from_external_cache(
	events=missing_event_ids,
	update_metrics=update_metrics,
	)
	)

	return event_map

	# Once the txn completes, invalidate all of the relevant caches. Note that we do this
	# up here because it captures all the events_and_contexts before any are removed.
	for event, _ in events_and_contexts:
	self.store.invalidate_get_event_cache_after_txn(txn, event.event_id)
	if event.redacts:
	self.store.invalidate_get_event_cache_after_txn(txn, event.redacts)


		logger.info("[purge] done")

		self._invalidate_caches_for_room_events_and_stream(txn, room_id)

When purging room history, send cache replication message for _get_event_cache #19404

Are you sure you want to change the base?

When purging room history, send cache replication message for _get_event_cache #19404

Uh oh!

Conversation

MadLittleMods commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dev notes

Pull Request Checklist

Uh oh!

MadLittleMods Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devonh Jan 23, 2026 • edited by MadLittleMods Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

devonh Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devonh Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

devonh Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

devonh Jan 23, 2026 • edited by MadLittleMods Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

When purging room history, send `cache` replication message for `_get_event_cache` #19404

When purging room history, send `cache` replication message for `_get_event_cache` #19404

MadLittleMods commented Jan 22, 2026 •

edited

Loading

MadLittleMods Jan 23, 2026 •

edited

Loading

devonh Jan 23, 2026 •

edited by MadLittleMods

Loading

MadLittleMods Jan 23, 2026 •

edited

Loading

devonh Jan 23, 2026 •

edited

Loading

devonh Jan 23, 2026 •

edited by MadLittleMods

Loading