Skip to content

cmmasaba/deduplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deduplication

Deduplication is a technique of removing duplicate copies of repeating data. It is useful in many different contexts such as:

  • in storage systems to reduce storage requirement needs
  • in network transfer to reduce the amount of bytes sent over the network
  • in message-oriented systems to avoid processing the same message twice
  • in targeted ads systems to avoid showing the user the same ad
  • in product recommendation systems to avoid showing the user the same product

Deduplication systems can be categorized according to a number of criteria:

  1. post-process deduplication: new data is first stored on device then later a process analyzes the data looking for duplicates.
  2. inline deduplication: done as data is incoming on the device to look for and eliminate duplicates.
  3. target deduplication: deduplication is done where the data is stored/processed.
  4. source deduplication: deduplication is done where the data is created or originating.

I implemented three different approaches to deduplication, each with it's own benefits. They are:

  1. Expiring key repository deduplicator.
  2. Bloom filter deduplicator.
  3. Cuckoo filter deduplicator.

About

How to build a deduplicator backed by Redis with examples featuring a key-value repository, a bloom filter and a cuckoo filter.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages