Skip to content
This repository was archived by the owner on Jan 20, 2025. It is now read-only.
This repository was archived by the owner on Jan 20, 2025. It is now read-only.

PSC-STM-B6: Add tracking of which records have been transformed #254

@tiredpixel

Description

@tiredpixel

It is not ideal to process the same records multiple times, since it may keep replacing statements.

When we are consuming from S3, we only transform each file once, and when from a Kinesis stream, we keep track of our stream pointer, so this doesn’t happen much in practice. However, when switching from bulk files over to the Kinesis stream, there is a danger of 48 hours of records or so being processed more than once.

To fix this, it would make sense to keep track of the records transformed in the previous 48 hours, so these can be safely skipped.

  • When a record has been transformed, store the etag of the processed PSC record for some length of time longer than max stream duration (eg store for 48 hours)
  • When transforming a PSC record, first check whether it has been transformed in the last 48 hours.

This will ensure that the same records don’t get processed multiple times in cases of duplicates or during the changeover.

Estimate: 6 hours

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions