perf: add Bloom Filter for negative term lookups#130
Open
matheusvir wants to merge 4 commits intoSygil-Dev:mainfrom
Open
perf: add Bloom Filter for negative term lookups#130matheusvir wants to merge 4 commits intoSygil-Dev:mainfrom
matheusvir wants to merge 4 commits intoSygil-Dev:mainfrom
Conversation
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
b5022a8 to
367cd19
Compare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


What was done
Implemented Bloom filters to reduce latency on negative term lookups. When searching for a term that does not exist in the index, the engine previously had to perform expensive on-disk hash table reads before confirming the term's absence. With this change, a compact in-memory Bloom filter is checked first — if the term is definitely not present, the disk read is skipped entirely.
Key changes:
src/whoosh/support/bloom.py— StandaloneBloomFilterclass with double-hashing scheme (MD5 + SHA-256), configurable false positive rate, serialization/deserialization support, and filter merging capability.src/whoosh/codec/whoosh3.py— Integration into the W3 codec:W3FieldWritercollects all term keys during indexing and writes a.blmfile on close.W3TermsReaderloads the Bloom filter on initialization and short-circuits__contains__,term_info,frequency, anddoc_frequencyfor absent terms.W3Codecexposesbloom_enabledandbloom_false_positive_rateparameters (enabled by default, 1% FPR).tests/test_bloom.py— 8 classes, 35 tests (see Tests section below).Design decisions:
.blmfiles are handled gracefully (Bloom check is simply skipped).Tests
A new test file
tests/test_bloom.pywas added with 8 classes and 35 tests:TestBloomFilterUnit— standaloneBloomFiltercorrectness:add()/__contains__, zero-false-negative guarantee over 500 items, false positive rate within bounds, UTF-8 key encoding,size_bytes,__repr__.TestBloomFilterSerialization—to_bytes()/from_bytes()round-trips; invalid magic bytes and truncated data raiseValueError.TestBloomFilterMerge— merging two filters combines their items; mismatched parameters raiseValueError.TestBloomFilterOptimalParams—_optimal_num_bits()and_optimal_num_hashes()produce correct values; edge inputs (0, -1) are handled.TestBloomCodecIntegration—W3Codecwrites a.blmfile when enabled and skips it when disabled; negative term lookups are rejected by the filter;term_info,frequency, anddoc_frequencyshort-circuit for absent terms; backward compatibility when no.blmfile exists.TestBloomEndToEnd— full indexing and searching across TEXT, KEYWORD, and ID fields; multi-segment indexes; segment optimization (merge) preserves the filter.All existing tests continue to pass with no regressions.
Performance
All benchmarks were executed inside Docker containers to isolate the runtime environment and eliminate host-specific variance from CPU scheduling, OS caching, and library versions.
Methodology
time.perf_counter_ns()with GC disabled during measurement.Rationale
Whoosh already has internal term caching, which partially mitigates the cost of repeated negative lookups on the same terms. The Bloom filter provides an earlier, cheaper pre-check that runs before any cache or disk operation, allowing confirmed misses to be resolved in O(k) hash computations alone.
Results
Analysis
The mean improvement is 11.93%. The difference between means (3.38 ms) is smaller than the baseline standard deviation (7.49 ms), so the result does not clear the strict statistical confirmation threshold used in this study. Even so, the improvement is consistent: the optimized median is 23.60 ms vs 28.36 ms in the baseline, and the standard deviation is reduced from 7.49 ms to 5.15 ms.
The modest absolute gain compared to the TinyDB Bloom Filter reflects that Whoosh already performs internal caching, which absorbs part of the negative-lookup cost in the baseline.
Note: as a probabilistic data structure, exact timings will vary across runs and machines. The no-false-negative guarantee is preserved: the filter never hides an existing term. False positives (at most 1% by default) simply fall back to the normal lookup path.
Reproducing the benchmark
The full benchmark infrastructure is available in the research repository at matheusvir/eda-oss-performance.
Relevant files:
setup/whoosh-reloaded/Dockerfileexperiments/whoosh-reloaded/baseline_whoosh-reloaded_bloom-filter.pyexperiments/whoosh-reloaded/experiment_whoosh-reloaded_bloom-filter.pyexperiments/whoosh-reloaded/run_bloom_filter.shTo run:
# From the root of eda-oss-performance bash experiments/whoosh-reloaded/run_bloom_filter.shThe runner checks out the baseline and experiment git refs, builds a separate Docker image for each, runs both containers, and writes results to
results/whoosh-reloaded/result_whoosh-reloaded_bloom-filter.json.Feedback on the
.blmfile format, filter sizing strategy, and segment merge behavior is welcome.