Commit c6ea0a5
Add bloom filter folding to automatically size SBBF filters (#9628)
## Summary
Bloom filters now support **folding mode**: allocate a conservatively
large filter (sized for worst-case NDV), insert all values during
writing, then fold down at flush time to meet a target FPP. This
eliminates the need to guess NDV upfront and produces optimally-sized
filters automatically.
### Changes
- `BloomFilterProperties.ndv` changed from `u64` to `Option<u64>` — when
`None` (new default), the filter is sized based on
`max_row_group_row_count`; when `Some(n)`, the explicit NDV is used
- `DEFAULT_BLOOM_FILTER_NDV` redefined to
`DEFAULT_MAX_ROW_GROUP_ROW_COUNT as u64` (was hardcoded `1_000_000`)
- Added `Sbbf::fold_to_target_fpp()` and supporting methods
(`num_folds_for_target_fpp`, `fold_n`, `num_blocks`) with comprehensive
documentation
- `flush_bloom_filter()` in both `ColumnValueEncoderImpl` and
`ByteArrayEncoder` now folds the filter before returning it
- New `create_bloom_filter()` helper in `encoder.rs` centralizes bloom
filter construction logic
### How folding works
The SBBF fold operation merges adjacent block pairs (`block[2i] |
block[2i+1]`) via bitwise OR, halving the filter size. This differs from
standard Bloom filter folding (which merges halves at distance `m/2`)
because SBBF uses multiplicative hashing for block selection:
```
block_index = ((hash >> 32) * num_blocks) >> 32
```
When `num_blocks` is halved, the new index becomes `floor(original_index
/ 2)`, so adjacent blocks map to the same position.
The number of safe folds is determined analytically from the average
per-block fill rate: after `k` folds, expected fill is `1 -
(1-f)^(2^k)`, giving `FPP = fill^8`. This requires only a single
popcount scan over the blocks (no scratch allocation), then O(log N)
floating-point ops to find the optimal fold count. The actual fold is
then performed in a single pass.
### Benchmarks
Filter sized for 1M NDV, varying actual distinct values inserted.
Measured on Apple M3 Pro.
**Fold overhead (fold_to_target_fpp only):**
| Actual NDV | Time | Throughput |
|---|---|---|
| 1,000 | 39.1 µs | 838 Melem/s |
| 10,000 | 34.2 µs | 960 Melem/s |
| 100,000 | 32.5 µs | 1.01 Gelem/s |
**End-to-end (insert + fold) vs insert-only:**
| Actual NDV | Insert only | Insert + fold | Fold overhead |
|---|---|---|---|
| 1,000 | 14.7 µs | 49.1 µs | 34.4 µs (70%) |
| 10,000 | 30.7 µs | 58.1 µs | 27.4 µs (47%) |
| 100,000 | 162.5 µs | 189.8 µs | 27.3 µs (14%) |
The fold cost is dominated by the popcount scan over the initial (large)
filter. For the common case (100K values into a 1M-NDV filter), folding
adds only ~14% overhead to the total insert+fold time.
### References
Sailhan & Stehr, ["Folding and Unfolding Bloom
Filters"](https://hal.science/hal-01126174v1/document), IEEE iThings
2012.
Liang, ["Blocked Bloom Filters: Speeding Up Point Lookups in Tiger
Postgres' Native
Columnstore"](https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore)
### Breaking changes
There are no breaking API changes
However, when bloom filters are enabled without specifying the number of
distinct values, the bloom filters are automatically sized. Previously
they would be sized using the default value of
`DEFAULT_BLOOM_FILTER_NDV`
## Test plan
- [x] All existing bloom filter unit tests pass
- [x] All existing integration tests (sync + async reader roundtrips)
pass
- [x] New unit tests: fold correctness, no false negatives after
folding, FPP target respected, minimum size guard
- [x] New unit tests: folded filter is bit-identical to a fresh filter
of the same size (proves correctness via two lemmas about SBBF hashing)
- [x] New unit tests: multi-step folding, folded FPP matches fresh FPP
empirically, fold size matches optimal fixed-size filter
- [x] New integration test: `i32_column_bloom_filter_fixed_ndv` —
roundtrip with both overestimated and underestimated NDV
- [x] Full `cargo test -p parquet` passes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com>
Co-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>1 parent 6f02fcf commit c6ea0a5
File tree
7 files changed
+876
-50
lines changed- parquet
- benches
- src
- arrow/arrow_writer
- bloom_filter
- column/writer
- file
7 files changed
+876
-50
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
275 | 275 | | |
276 | 276 | | |
277 | 277 | | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
278 | 282 | | |
279 | 283 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | | - | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
21 | 23 | | |
22 | 24 | | |
23 | 25 | | |
| |||
423 | 425 | | |
424 | 426 | | |
425 | 427 | | |
| 428 | + | |
426 | 429 | | |
427 | 430 | | |
428 | 431 | | |
429 | 432 | | |
430 | 433 | | |
431 | 434 | | |
432 | 435 | | |
433 | | - | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
434 | 439 | | |
435 | 440 | | |
436 | 441 | | |
| |||
443 | 448 | | |
444 | 449 | | |
445 | 450 | | |
446 | | - | |
447 | | - | |
448 | | - | |
449 | | - | |
| 451 | + | |
450 | 452 | | |
451 | 453 | | |
452 | 454 | | |
| |||
456 | 458 | | |
457 | 459 | | |
458 | 460 | | |
| 461 | + | |
459 | 462 | | |
460 | 463 | | |
461 | 464 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2681 | 2681 | | |
2682 | 2682 | | |
2683 | 2683 | | |
| 2684 | + | |
2684 | 2685 | | |
2685 | 2686 | | |
2686 | 2687 | | |
| |||
2692 | 2693 | | |
2693 | 2694 | | |
2694 | 2695 | | |
| 2696 | + | |
2695 | 2697 | | |
2696 | 2698 | | |
2697 | 2699 | | |
| |||
2712 | 2714 | | |
2713 | 2715 | | |
2714 | 2716 | | |
| 2717 | + | |
2715 | 2718 | | |
2716 | 2719 | | |
2717 | 2720 | | |
| |||
2750 | 2753 | | |
2751 | 2754 | | |
2752 | 2755 | | |
2753 | | - | |
| 2756 | + | |
2754 | 2757 | | |
2755 | 2758 | | |
2756 | 2759 | | |
2757 | 2760 | | |
2758 | 2761 | | |
2759 | 2762 | | |
2760 | | - | |
2761 | | - | |
| 2763 | + | |
| 2764 | + | |
| 2765 | + | |
| 2766 | + | |
| 2767 | + | |
2762 | 2768 | | |
2763 | 2769 | | |
2764 | 2770 | | |
| |||
3142 | 3148 | | |
3143 | 3149 | | |
3144 | 3150 | | |
| 3151 | + | |
| 3152 | + | |
| 3153 | + | |
| 3154 | + | |
| 3155 | + | |
| 3156 | + | |
| 3157 | + | |
| 3158 | + | |
| 3159 | + | |
| 3160 | + | |
| 3161 | + | |
| 3162 | + | |
| 3163 | + | |
| 3164 | + | |
| 3165 | + | |
| 3166 | + | |
| 3167 | + | |
| 3168 | + | |
| 3169 | + | |
| 3170 | + | |
| 3171 | + | |
| 3172 | + | |
| 3173 | + | |
| 3174 | + | |
| 3175 | + | |
| 3176 | + | |
| 3177 | + | |
| 3178 | + | |
| 3179 | + | |
| 3180 | + | |
| 3181 | + | |
| 3182 | + | |
| 3183 | + | |
| 3184 | + | |
| 3185 | + | |
3145 | 3186 | | |
3146 | 3187 | | |
3147 | 3188 | | |
| |||
0 commit comments