Having duplicate images in a dataset creates a two kinds of problems:
- It introduces bias into your dataset, giving your deep neural network additional opportunities to learn patterns specific to the duplicates.
- It hurts the ability of your model to generalize to new images outside of what it was trained on.
Identifying duplicates in a large dataset manually is very time consuming and error-prone process. This project aims to remove duplicate images from the dataset.