Skip to content

[parquet] Add BloomFilterProperties builder API to make bloom filter configuration explicit #9667

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This PR added bloom filter folding and improved the default sizing behavior for bloom filters.

Today, BloomFilterProperties is still a public struct with public fields, which makes the API harder to evolve cleanly. @adriangb came up with a way to use implicit internal state to avoid breaking the public API but that is somewhat confusing and not explicit in the APIs

While reviewing #9628, I also found a related behavioral issue in the current builder flow: disabling and re-enabling bloom filters does not fully reset the hidden "explicit NDV was set" state. That means re-enabling can preserve stale configuration intent instead of returning to the default behavior. The relevant disable path is also not covered by tests, and the per-column implicit-NDV resolution path is currently uncovered as well.

Describe the solution you'd like

Add a BloomFilterPropertiesBuilder and make bloom filter configuration more explicit.

In particular, this follow-on change should aim to:

  • introduce a dedicated builder for bloom filter configuration
  • make the differences between these cases clear:
    • default NDV derived from row group sizing
    • explicitly configured NDV
    • explicitly configured FPP
    • enabling/disabling bloom filters
  • Make BloomFilterProperties fields non pub
  • add relevant accessors

Describe alternatives you've considered

Keep the current API and add more documentation around the current semantics.

For example, a more explicit API could look something like:

let props = WriterProperties::builder()
    .set_column_bloom_filter_properties(
        ColumnPath::from("col"),
        BloomFilterProperties::new()
            .with_fpp(0.01)
            .with_max_ndv(10_000) // new builder PAIs
    )
    .build();

Compared with direct public-field configuration, a builder like this would make it clearer which values are explicitly configured and which are left to defaults.

Additional context

Follow-on from PR #9628:

Relevant review discussion:

Additional follow-up work for this issue should include regression tests for:

  • disable -> re-enable bloom filter configuration
  • column-specific bloom filter configuration with implicit NDV resolution
  • explicit vs default NDV behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions