Implement series cache invalidation by JamesGuthrie · Pull Request #1752 · timescale/promscale

JamesGuthrie · 2022-11-14T16:11:51Z

This change implements invalidation of the series cache, and mechanisms to prevent the ingestion of data based on stale cache information.

In principle, the cache invalidation mechanism works as follows:

In the database, we track two values: current_epoch, and delete_epoch. These are unix timestamps (which, for reasons of backwards-compatibility, are stored in a BIGINT field), updated every time that rows in the series table are deleted. current_epoch is set from now(), and delete_epoch is set from now() - epoch_duration. epoch_duration is a configurable parameter.

When a series row is to be deleted, instead of immediately deleting it, we set the delete_epoch column of the series row to the current_epoch timestamp (the time at which we decided that it will be deleted). After epoch_duration time elapses, the row is removed.

When the connector starts, it reads current_epoch from the database and stores this value with the series cache as cache_current_epoch. The connector periodically fetches the ids of series rows where delete_epoch is not null, together with current_epoch. It removes these entries from the cache, and updates cache_current_epoch to the value of current_epoch which was fetched from the database.

As the connector receives samples to be inserted, it tracks the smallest value of cache_current_epoch that it saw for that batch. When it comes to inserting the samples in a batch into the database, it asserts (in the database) that the cache which was read from was not stale. This is expressed as: cache_current_epoch > delete_epoch. If this is not the case, the insert is aborted.

This is a companion change to 1 which implements the database-side logic required for cache invalidation.

Description

Merge requirements

Please take into account the following non-code changes that you may need to make with your PR:

CHANGELOG entry for user-facing changes
Updated the relevant documentation

antekresic

Overall looks good, couple of questions/suggestions left.

antekresic · 2022-11-21T13:50:20Z

pkg/pgmodel/ingestor/copier.go

 			err, _ = insertSeries(ctx, conn, true, reqs...)
 		}
+		if isEpochAbort(err) {
+			// Epoch abort is non-recoverable. Return error here instead of retrying.


Just to confirm, this is non-recoverable because the series metadata is dropped at this point?

antekresic · 2022-11-22T07:57:03Z

pkg/pgmodel/ingestor/ingestor.go

 		}
 		// Normalize and canonicalize ts.Labels.
 		// After this point ts.Labels should never be used again.
+


Suggested change

antekresic · 2022-11-22T07:58:40Z

pkg/pgmodel/ingestor/dispatcher.go

+		return
+	}
+	if cacheCurrentEpoch.After(dbCurrentEpoch) {
+		log.Warn("msg", "The connector's cache epoch is greater than the databases. This is unexpected", "connector_current_epoch", cacheCurrentEpoch, "database_current_epoch", dbCurrentEpoch)


I'd flush the cache in this case. Otherwise we will just get spammed by these messages and it is an invalid cache epoch.

If we do flush the cache, it would probably make sense to also reset the cache's epoch to the epoch we just got from the DB. Seems like a sensible move.

sumerman

The first pass. I still need to take a better look at the tests.

Assuming I understand the following correctly, I think it more or less matches our model:
There is always a single Ingestor per connector process. The Ingestor spawns a single Dispatcher, which, in turn, spawns several Copier goroutines. Ingestor and Dispatcher together act as one of CacheRefreshWorkers from the model.

pkg/pgmodel/cache/series_cache.go

pkg/pgmodel/ingestor/dispatcher.go

pkg/rules/adapters/ingest.go

sumerman

Looks great. There's one extra-paranoia point that we discussed offline.

sumerman · 2022-11-29T16:51:37Z

pkg/pgmodel/ingestor/dispatcher.go

+	return model.SeriesEpoch(dbCurrentEpoch), model.SeriesEpoch(dbDeleteEpoch), nil
+}
+
+func (p *PgxDispatcher) RefreshSeriesEpoch() {


My quick experiments with the TLA model suggest that it is safe as it is, but my paranoia wants to have a lock for this method to prevent more than one instance of it from running concurrently. The concurrent attempt doesn't have to block on the mutex, TryLock and bail should be sufficient.

I just observed this happening in the test setup. TryLock is a smart approach. Done.

JamesGuthrie · 2022-11-30T12:34:20Z

I've squashed down all of the changes during review and retargeted against the branch feature/series-cache-invalidation.

This change implements invalidation of the series cache, and mechanisms to prevent the ingestion of data based on stale cache information. In principle, the cache invalidation mechanism works as follows: In the database, we track two values: `current_epoch`, and `delete_epoch`. These are unix timestamps (which, for reasons of backwards-compatibility, are stored in a BIGINT field), updated every time that rows in the series table are deleted. `current_epoch` is set from `now()`, and `delete_epoch` is set from `now() - epoch_duration`. `epoch_duration` is a configurable parameter. When a series row is to be deleted, instead of immediately deleting it, we set the `delete_epoch` column of the series row to the `current_epoch` timestamp (the time at which we decided that it will be deleted). After `epoch_duration` time elapses, the row is removed. When the connector starts, it reads `current_epoch` from the database and stores this value with the series cache as `cache_current_epoch`. The connector periodically fetches the ids of series rows where `delete_epoch` is not null, together with `current_epoch`. It removes these entries from the cache, and updates `cache_current_epoch` to the value of `current_epoch` which was fetched from the database. As the connector receives samples to be inserted, it tracks the smallest value of `cache_current_epoch` that it saw for that batch. When it comes to inserting the samples in a batch into the database, it asserts (in the database) that the cache which was read from was not stale. This is expressed as: `cache_current_epoch > delete_epoch`. If this is not the case, the insert is aborted. This is a companion change to [1] which implements the database-side logic required for cache invalidation. [1]: timescale/promscale_extension#529

JamesGuthrie requested a review from a team as a code owner November 14, 2022 16:11

JamesGuthrie marked this pull request as draft November 14, 2022 16:11

JamesGuthrie mentioned this pull request Nov 14, 2022

Implement series cache invalidation #1742

Closed

2 tasks

JamesGuthrie force-pushed the jg/logical-to-time-epoch-2 branch 2 times, most recently from 71e1489 to a773990 Compare November 16, 2022 13:42

JamesGuthrie requested a review from antekresic November 16, 2022 16:34

antekresic reviewed Nov 22, 2022

View reviewed changes

JamesGuthrie force-pushed the jg/logical-to-time-epoch-2 branch from a773990 to d70d62c Compare November 24, 2022 17:13

JamesGuthrie requested review from antekresic and sumerman November 24, 2022 17:13

sumerman suggested changes Nov 28, 2022

View reviewed changes

pkg/pgmodel/cache/series_cache.go Outdated Show resolved Hide resolved

pkg/pgmodel/cache/series_cache.go Outdated Show resolved Hide resolved

pkg/pgmodel/ingestor/dispatcher.go Outdated Show resolved Hide resolved

sumerman reviewed Nov 28, 2022

View reviewed changes

pkg/rules/adapters/ingest.go Outdated Show resolved Hide resolved

JamesGuthrie requested a review from sumerman November 29, 2022 12:29

sumerman approved these changes Nov 29, 2022

View reviewed changes

JamesGuthrie force-pushed the jg/logical-to-time-epoch-2 branch 2 times, most recently from f85d794 to 80ffe83 Compare November 30, 2022 12:32

JamesGuthrie changed the base branch from master to feature/series-cache-invalidation November 30, 2022 12:32

JamesGuthrie force-pushed the jg/logical-to-time-epoch-2 branch from 80ffe83 to df28b69 Compare November 30, 2022 12:38

JamesGuthrie marked this pull request as ready for review November 30, 2022 12:38

JamesGuthrie requested a review from a team as a code owner November 30, 2022 12:38

JamesGuthrie merged commit 737160d into feature/series-cache-invalidation Nov 30, 2022

JamesGuthrie deleted the jg/logical-to-time-epoch-2 branch November 30, 2022 13:23

Comments

Conversation

JamesGuthrie commented Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Merge requirements

Uh oh!

antekresic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumerman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sumerman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JamesGuthrie commented Nov 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JamesGuthrie commented Nov 14, 2022 •

edited

Loading

sumerman left a comment •

edited

Loading