This repository was archived by the owner on May 20, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
This repository was archived by the owner on May 20, 2024. It is now read-only.
Switch filter and verify function #4
Copy link
Copy link
Open
Labels
Description
Currently using a Scalable bloom filter from github.com/tylertreat/BoomFilters but suspect it might not be performing optimally/correctly?
- Amount of hash functions doesn't seem to fit with optimal amount
- Amount of
cellsdoesn't seem to fit with optimal amount (see above) - Default hash function used results in inaccurate false probability rate
- Repo seemed to be a little messy in general: some models not documented, models missing common features found in other models, unhandled errors and use of
unsafe(impressions from the time I wrote feedloggr v3 and first foundBoomFilters)
Might want to try to implement a Stable Scalable Bloom filter myself instead:
- Stable Bloom won't keep growing forever (sort of deletes items from itself to stay with a fixed size)
- Easier to track, verify and document correctness
- Performance less relevant vs correctness
- One less unknown dependency
Or there might be another filter model that fits better? Some kind of Cuckoo filter?
Think I found the original paper for a scalable bloom filter though, doi:10.1016/j.ipl.2006.10.007
Investigate:
References:
- https://en.wikipedia.org/wiki/Bloom_filter
- https://ricardoanderegg.com/posts/understanding-bloom-filters/
- https://sagi.io/bloom-filters-for-the-perplexed/
- http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf
- https://bdupras.github.io/filter-tutorial/
- https://llimllib.github.io/bloomfilter-tutorial/
- https://www.waitingforcode.com/big-data-algorithms/stable-bloom-filter/read
- https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/
- https://blog.cloudflare.com/when-bloom-filters-dont-bloom/
- https://news.ycombinator.com/item?id=33318788
- https://austinhenley.com/blog/challengingalgorithms.html seems to have some good examples? (also might want to save the article itself)
- https://gopiandcode.uk/logs/log-bloomfilters-debunked.html seems to have "the most correct formula" for calculating false positive rates?