Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@adriangb implemented a really neat bloom filter sizing optimization in
A problem with most Parquet bloom filter implementations is that to avoid wasting a lot of space, you need to know beforehand the Number of Distinct Values (NDV), which is challenging for most systems to configure correctly. This PR very nicely handles the case where the number of distinct values is not known beforehand -- the bloom filter undergoes a folding process.
So if point lookups are important, systems can now enable bloom filters while writing Parquet and the writer will automatically resize the filters to the minimal size needed for a false positive percentage. This technique requires more CPU to compute and fold the bloom filters, but is very effective and results in byte-for-byte identical bloom filters than if the NVD was known beforehand
The high level idea is mentioned in other sources
However, it was not immediately apparently that those techniques do apply to parquet
Describe the solution you'd like
If would be great to write a blog post, ideally on one of
Explaining how the technique works
The idea would be:
- Advertise that Parquet bloom filters are more useful now in arrow-rs
- Explain enough of the implementation that other Parquet implementations could implement it
Describe alternatives you've considered
Additional context
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@adriangb implemented a really neat bloom filter sizing optimization in
A problem with most Parquet bloom filter implementations is that to avoid wasting a lot of space, you need to know beforehand the Number of Distinct Values (NDV), which is challenging for most systems to configure correctly. This PR very nicely handles the case where the number of distinct values is not known beforehand -- the bloom filter undergoes a folding process.
So if point lookups are important, systems can now enable bloom filters while writing Parquet and the writer will automatically resize the filters to the minimal size needed for a false positive percentage. This technique requires more CPU to compute and fold the bloom filters, but is very effective and results in byte-for-byte identical bloom filters than if the NVD was known beforehand
The high level idea is mentioned in other sources
However, it was not immediately apparently that those techniques do apply to parquet
Describe the solution you'd like
If would be great to write a blog post, ideally on one of
Explaining how the technique works
The idea would be:
Describe alternatives you've considered
Additional context