Skip to content

[Blog] Split Bloom Filter Folding #9671

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@adriangb implemented a really neat bloom filter sizing optimization in

A problem with most Parquet bloom filter implementations is that to avoid wasting a lot of space, you need to know beforehand the Number of Distinct Values (NDV), which is challenging for most systems to configure correctly. This PR very nicely handles the case where the number of distinct values is not known beforehand -- the bloom filter undergoes a folding process.

So if point lookups are important, systems can now enable bloom filters while writing Parquet and the writer will automatically resize the filters to the minimal size needed for a false positive percentage. This technique requires more CPU to compute and fold the bloom filters, but is very effective and results in byte-for-byte identical bloom filters than if the NVD was known beforehand

The high level idea is mentioned in other sources

However, it was not immediately apparently that those techniques do apply to parquet

Describe the solution you'd like
If would be great to write a blog post, ideally on one of

Explaining how the technique works

The idea would be:

  1. Advertise that Parquet bloom filters are more useful now in arrow-rs
  2. Explain enough of the implementation that other Parquet implementations could implement it

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions