perf: add Bloom Filter for negative term lookups by matheusvir · Pull Request #130 · Sygil-Dev/whoosh-reloaded

matheusvir · 2026-03-12T01:32:20Z

What was done

Implemented Bloom filters to reduce latency on negative term lookups. When searching for a term that does not exist in the index, the engine previously had to perform expensive on-disk hash table reads before confirming the term's absence. With this change, a compact in-memory Bloom filter is checked first — if the term is definitely not present, the disk read is skipped entirely.

Key changes:

src/whoosh/support/bloom.py — Standalone BloomFilter class with double-hashing scheme (MD5 + SHA-256), configurable false positive rate, serialization/deserialization support, and filter merging capability.
src/whoosh/codec/whoosh3.py — Integration into the W3 codec:
- W3FieldWriter collects all term keys during indexing and writes a .blm file on close.
- W3TermsReader loads the Bloom filter on initialization and short-circuits __contains__, term_info, frequency, and doc_frequency for absent terms.
- W3Codec exposes bloom_enabled and bloom_false_positive_rate parameters (enabled by default, 1% FPR).
tests/test_bloom.py — 8 classes, 35 tests (see Tests section below).

Design decisions:

Backward compatible: indexes without .blm files are handled gracefully (Bloom check is simply skipped).
Per-segment filtering: each segment maintains its own Bloom filter, working naturally with multi-segment and optimized indexes.
Zero false negatives guaranteed: the filter only avoids unnecessary disk reads, never hides existing terms.

Tests

A new test file tests/test_bloom.py was added with 8 classes and 35 tests:

TestBloomFilterUnit — standalone BloomFilter correctness: add()/__contains__, zero-false-negative guarantee over 500 items, false positive rate within bounds, UTF-8 key encoding, size_bytes, __repr__.
TestBloomFilterSerialization — to_bytes()/from_bytes() round-trips; invalid magic bytes and truncated data raise ValueError.
TestBloomFilterMerge — merging two filters combines their items; mismatched parameters raise ValueError.
TestBloomFilterOptimalParams — _optimal_num_bits() and _optimal_num_hashes() produce correct values; edge inputs (0, -1) are handled.
TestBloomCodecIntegration — W3Codec writes a .blm file when enabled and skips it when disabled; negative term lookups are rejected by the filter; term_info, frequency, and doc_frequency short-circuit for absent terms; backward compatibility when no .blm file exists.
TestBloomEndToEnd — full indexing and searching across TEXT, KEYWORD, and ID fields; multi-segment indexes; segment optimization (merge) preserves the filter.

All existing tests continue to pass with no regressions.

Performance

All benchmarks were executed inside Docker containers to isolate the runtime environment and eliminate host-specific variance from CPU scheduling, OS caching, and library versions.

Methodology

50 total runs; first 10 (warmup) and last 10 (cooldown) discarded; 30 effective runs measured.
Index: 500,000 documents.
Workload: 1,000 queries per run using terms guaranteed to be absent from the index, directly targeting the negative-lookup path.
Fixed seed (42) for reproducibility.
Timing: time.perf_counter_ns() with GC disabled during measurement.

Rationale

Whoosh already has internal term caching, which partially mitigates the cost of repeated negative lookups on the same terms. The Bloom filter provides an earlier, cheaper pre-check that runs before any cache or disk operation, allowing confirmed misses to be resolved in O(k) hash computations alone.

Results

Variant	Mean (ms)	Std dev (ms)	Runs
Baseline	28.36	7.49	30
Optimized	24.98	5.15	30
Improvement			11.93%

Analysis

The mean improvement is 11.93%. The difference between means (3.38 ms) is smaller than the baseline standard deviation (7.49 ms), so the result does not clear the strict statistical confirmation threshold used in this study. Even so, the improvement is consistent: the optimized median is 23.60 ms vs 28.36 ms in the baseline, and the standard deviation is reduced from 7.49 ms to 5.15 ms.

The modest absolute gain compared to the TinyDB Bloom Filter reflects that Whoosh already performs internal caching, which absorbs part of the negative-lookup cost in the baseline.

Note: as a probabilistic data structure, exact timings will vary across runs and machines. The no-false-negative guarantee is preserved: the filter never hides an existing term. False positives (at most 1% by default) simply fall back to the normal lookup path.

Reproducing the benchmark

The full benchmark infrastructure is available in the research repository at matheusvir/eda-oss-performance.

Relevant files:

Dockerfile: setup/whoosh-reloaded/Dockerfile
Baseline script: experiments/whoosh-reloaded/baseline_whoosh-reloaded_bloom-filter.py
Experiment script: experiments/whoosh-reloaded/experiment_whoosh-reloaded_bloom-filter.py
Runner script: experiments/whoosh-reloaded/run_bloom_filter.sh

To run:

# From the root of eda-oss-performance
bash experiments/whoosh-reloaded/run_bloom_filter.sh

The runner checks out the baseline and experiment git refs, builds a separate Docker image for each, runs both containers, and writes results to results/whoosh-reloaded/result_whoosh-reloaded_bloom-filter.json.

Feedback on the .blm file format, filter sizing strategy, and segment merge behavior is welcome.

Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>

sonarqubecloud · 2026-03-12T03:16:21Z

Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

Predd0o and others added 3 commits March 11, 2026 22:28

matheusvir force-pushed the optimization/bloom-filter-negative-lookup branch from b5022a8 to 367cd19 Compare March 12, 2026 01:38

matheusvir changed the title ~~perf(whoosh): add Bloom Filter for negative term lookups~~ perf: add Bloom Filter for negative term lookups Mar 12, 2026

feat(whoosh): implement bloom filter for negative term lookups

34caf23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: add Bloom Filter for negative term lookups#130

perf: add Bloom Filter for negative term lookups#130
matheusvir wants to merge 4 commits intoSygil-Dev:mainfrom
matheusvir:optimization/bloom-filter-negative-lookup

matheusvir commented Mar 12, 2026

Uh oh!

sonarqubecloud bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

matheusvir commented Mar 12, 2026

What was done

Tests

Performance

Methodology

Rationale

Results

Analysis

Reproducing the benchmark

Uh oh!

sonarqubecloud bot commented Mar 12, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants