Deduplication is a technique of removing duplicate copies of repeating data. It is useful in many different contexts such as:
- in storage systems to reduce storage requirement needs
- in network transfer to reduce the amount of bytes sent over the network
- in message-oriented systems to avoid processing the same message twice
- in targeted ads systems to avoid showing the user the same ad
- in product recommendation systems to avoid showing the user the same product
Deduplication systems can be categorized according to a number of criteria:
- post-process deduplication: new data is first stored on device then later a process analyzes the data looking for duplicates.
- inline deduplication: done as data is incoming on the device to look for and eliminate duplicates.
- target deduplication: deduplication is done where the data is stored/processed.
- source deduplication: deduplication is done where the data is created or originating.
I implemented three different approaches to deduplication, each with it's own benefits. They are:
- Expiring key repository deduplicator.
- Bloom filter deduplicator.
- Cuckoo filter deduplicator.