-
Notifications
You must be signed in to change notification settings - Fork 189
Open
Description
Hi,
Is it possible to check for duplicates within an unbounded streaming data set, not checking against another static data source but against the data that has streamed so far?
The flow is as follows.
Source Database ---> CDC ---> Kafka ----> Stream Processing (invoke Duke for duplicate check) -> Target Database
I would like to build the index as data is streaming in from the CDC, keep incrementing the index with new data and search the index at the same time for each message coming along. What is the way to do this? Or, do we always need at least two static data sets to find duplicates?
Thank you.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels