Streaming data deduplication

Hi,
Is it possible to check for duplicates within an unbounded streaming data set, not checking against another static data source but against the data that has streamed so far? 

The flow is as follows.

Source Database ---> CDC ---> Kafka ----> Stream Processing (invoke Duke for duplicate check) -> Target Database

I would like to build the index as data is streaming in from the CDC, keep incrementing the index with new data and search the index at the same time for each message coming along. What is the way to do this?  Or, do we always need at least two static data sets to find duplicates?

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming data deduplication #265

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Streaming data deduplication #265

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions