Skip to content

perf: add Bloom Filter for negative term lookups#130

Open
matheusvir wants to merge 4 commits intoSygil-Dev:mainfrom
matheusvir:optimization/bloom-filter-negative-lookup
Open

perf: add Bloom Filter for negative term lookups#130
matheusvir wants to merge 4 commits intoSygil-Dev:mainfrom
matheusvir:optimization/bloom-filter-negative-lookup

Conversation

@matheusvir
Copy link

What was done

Implemented Bloom filters to reduce latency on negative term lookups. When searching for a term that does not exist in the index, the engine previously had to perform expensive on-disk hash table reads before confirming the term's absence. With this change, a compact in-memory Bloom filter is checked first — if the term is definitely not present, the disk read is skipped entirely.

Key changes:

  • src/whoosh/support/bloom.py — Standalone BloomFilter class with double-hashing scheme (MD5 + SHA-256), configurable false positive rate, serialization/deserialization support, and filter merging capability.
  • src/whoosh/codec/whoosh3.py — Integration into the W3 codec:
    • W3FieldWriter collects all term keys during indexing and writes a .blm file on close.
    • W3TermsReader loads the Bloom filter on initialization and short-circuits __contains__, term_info, frequency, and doc_frequency for absent terms.
    • W3Codec exposes bloom_enabled and bloom_false_positive_rate parameters (enabled by default, 1% FPR).
  • tests/test_bloom.py — 8 classes, 35 tests (see Tests section below).

Design decisions:

  • Backward compatible: indexes without .blm files are handled gracefully (Bloom check is simply skipped).
  • Per-segment filtering: each segment maintains its own Bloom filter, working naturally with multi-segment and optimized indexes.
  • Zero false negatives guaranteed: the filter only avoids unnecessary disk reads, never hides existing terms.

Tests

A new test file tests/test_bloom.py was added with 8 classes and 35 tests:

  • TestBloomFilterUnit — standalone BloomFilter correctness: add()/__contains__, zero-false-negative guarantee over 500 items, false positive rate within bounds, UTF-8 key encoding, size_bytes, __repr__.
  • TestBloomFilterSerializationto_bytes()/from_bytes() round-trips; invalid magic bytes and truncated data raise ValueError.
  • TestBloomFilterMerge — merging two filters combines their items; mismatched parameters raise ValueError.
  • TestBloomFilterOptimalParams_optimal_num_bits() and _optimal_num_hashes() produce correct values; edge inputs (0, -1) are handled.
  • TestBloomCodecIntegrationW3Codec writes a .blm file when enabled and skips it when disabled; negative term lookups are rejected by the filter; term_info, frequency, and doc_frequency short-circuit for absent terms; backward compatibility when no .blm file exists.
  • TestBloomEndToEnd — full indexing and searching across TEXT, KEYWORD, and ID fields; multi-segment indexes; segment optimization (merge) preserves the filter.

All existing tests continue to pass with no regressions.


Performance

All benchmarks were executed inside Docker containers to isolate the runtime environment and eliminate host-specific variance from CPU scheduling, OS caching, and library versions.

Methodology

  • 50 total runs; first 10 (warmup) and last 10 (cooldown) discarded; 30 effective runs measured.
  • Index: 500,000 documents.
  • Workload: 1,000 queries per run using terms guaranteed to be absent from the index, directly targeting the negative-lookup path.
  • Fixed seed (42) for reproducibility.
  • Timing: time.perf_counter_ns() with GC disabled during measurement.

Rationale

Whoosh already has internal term caching, which partially mitigates the cost of repeated negative lookups on the same terms. The Bloom filter provides an earlier, cheaper pre-check that runs before any cache or disk operation, allowing confirmed misses to be resolved in O(k) hash computations alone.

Results

Variant Mean (ms) Std dev (ms) Runs
Baseline 28.36 7.49 30
Optimized 24.98 5.15 30
Improvement 11.93%

Bloom Filter benchmark comparison

Analysis

The mean improvement is 11.93%. The difference between means (3.38 ms) is smaller than the baseline standard deviation (7.49 ms), so the result does not clear the strict statistical confirmation threshold used in this study. Even so, the improvement is consistent: the optimized median is 23.60 ms vs 28.36 ms in the baseline, and the standard deviation is reduced from 7.49 ms to 5.15 ms.

The modest absolute gain compared to the TinyDB Bloom Filter reflects that Whoosh already performs internal caching, which absorbs part of the negative-lookup cost in the baseline.

Note: as a probabilistic data structure, exact timings will vary across runs and machines. The no-false-negative guarantee is preserved: the filter never hides an existing term. False positives (at most 1% by default) simply fall back to the normal lookup path.

Reproducing the benchmark

The full benchmark infrastructure is available in the research repository at matheusvir/eda-oss-performance.

Relevant files:

To run:

# From the root of eda-oss-performance
bash experiments/whoosh-reloaded/run_bloom_filter.sh

The runner checks out the baseline and experiment git refs, builds a separate Docker image for each, runs both containers, and writes results to results/whoosh-reloaded/result_whoosh-reloaded_bloom-filter.json.


Feedback on the .blm file format, filter sizing strategy, and segment merge behavior is welcome.

Predd0o and others added 3 commits March 11, 2026 22:28
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br>
Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br>
Co-authored-by: Pedro <pedroalmeida1896@gmail.com>
Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com>
Co-authored-by: RailtonDantas <railtondantas.code@gmail.com>
Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br>
Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br>
Co-authored-by: Pedro <pedroalmeida1896@gmail.com>
Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com>
Co-authored-by: RailtonDantas <railtondantas.code@gmail.com>
Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br>
Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br>
Co-authored-by: Pedro <pedroalmeida1896@gmail.com>
Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com>
Co-authored-by: RailtonDantas <railtondantas.code@gmail.com>
Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
@matheusvir matheusvir force-pushed the optimization/bloom-filter-negative-lookup branch from b5022a8 to 367cd19 Compare March 12, 2026 01:38
@matheusvir matheusvir changed the title perf(whoosh): add Bloom Filter for negative term lookups perf: add Bloom Filter for negative term lookups Mar 12, 2026
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants