Is your feature request related to a problem or challenge?
This PR added bloom filter folding and improved the default sizing behavior for bloom filters.
Today, BloomFilterProperties is still a public struct with public fields, which makes the API harder to evolve cleanly. @adriangb came up with a way to use implicit internal state to avoid breaking the public API but that is somewhat confusing and not explicit in the APIs
While reviewing #9628, I also found a related behavioral issue in the current builder flow: disabling and re-enabling bloom filters does not fully reset the hidden "explicit NDV was set" state. That means re-enabling can preserve stale configuration intent instead of returning to the default behavior. The relevant disable path is also not covered by tests, and the per-column implicit-NDV resolution path is currently uncovered as well.
Describe the solution you'd like
Add a BloomFilterPropertiesBuilder and make bloom filter configuration more explicit.
In particular, this follow-on change should aim to:
- introduce a dedicated builder for bloom filter configuration
- make the differences between these cases clear:
- default NDV derived from row group sizing
- explicitly configured NDV
- explicitly configured FPP
- enabling/disabling bloom filters
- Make
BloomFilterProperties fields non pub
- add relevant accessors
Describe alternatives you've considered
Keep the current API and add more documentation around the current semantics.
For example, a more explicit API could look something like:
let props = WriterProperties::builder()
.set_column_bloom_filter_properties(
ColumnPath::from("col"),
BloomFilterProperties::new()
.with_fpp(0.01)
.with_max_ndv(10_000) // new builder PAIs
)
.build();
Compared with direct public-field configuration, a builder like this would make it clearer which values are explicitly configured and which are left to defaults.
Additional context
Follow-on from PR #9628:
Relevant review discussion:
Additional follow-up work for this issue should include regression tests for:
- disable -> re-enable bloom filter configuration
- column-specific bloom filter configuration with implicit NDV resolution
- explicit vs default NDV behavior
Is your feature request related to a problem or challenge?
This PR added bloom filter folding and improved the default sizing behavior for bloom filters.
Today,
BloomFilterPropertiesis still a public struct with public fields, which makes the API harder to evolve cleanly. @adriangb came up with a way to use implicit internal state to avoid breaking the public API but that is somewhat confusing and not explicit in the APIsWhile reviewing #9628, I also found a related behavioral issue in the current builder flow: disabling and re-enabling bloom filters does not fully reset the hidden "explicit NDV was set" state. That means re-enabling can preserve stale configuration intent instead of returning to the default behavior. The relevant disable path is also not covered by tests, and the per-column implicit-NDV resolution path is currently uncovered as well.
Describe the solution you'd like
Add a
BloomFilterPropertiesBuilderand make bloom filter configuration more explicit.In particular, this follow-on change should aim to:
BloomFilterPropertiesfields non pubDescribe alternatives you've considered
Keep the current API and add more documentation around the current semantics.
For example, a more explicit API could look something like:
Compared with direct public-field configuration, a builder like this would make it clearer which values are explicitly configured and which are left to defaults.
Additional context
Follow-on from PR #9628:
Relevant review discussion:
Additional follow-up work for this issue should include regression tests for: