Skip to content

Streaming data deduplication #265

@sridharpattem

Description

@sridharpattem

Hi,
Is it possible to check for duplicates within an unbounded streaming data set, not checking against another static data source but against the data that has streamed so far?

The flow is as follows.

Source Database ---> CDC ---> Kafka ----> Stream Processing (invoke Duke for duplicate check) -> Target Database

I would like to build the index as data is streaming in from the CDC, keep incrementing the index with new data and search the index at the same time for each message coming along. What is the way to do this? Or, do we always need at least two static data sets to find duplicates?

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions