[GLUTEN-11708][VL] Translate might_contain as a subfield filter for scan-level bloom filter pushdown by acvictor · Pull Request #11711 · apache/gluten

acvictor · 2026-03-06T06:45:39Z

What changes are proposed in this pull request?

This PR adds support for pushing might_contain(bloomFilter, value) down into Velox's subfield filter system via SparkExprToSubfieldFilterParser. Previously, might_contain was evaluated as a post-scan expression. With this change, the bloom filter check can be applied at the storage scan level allowing entire row groups to be skipped before data is fully decoded.

Velox has two incompatible bloom filter implementations:

BloomFilter: used by bloom_filter_agg / might_contain (groups-of-64-bits, 4 hash functions)
SplitBlockBloomFilter: used by the existing BigintValuesUsingBloomFilter filter class (SIMD split-block)

Since these are not interchangeable, a new SparkBloomFilter filter class is introduced that wraps the serialized BloomFilter<> data and implements testInt64() using BloomFilter<>::mayContain() with folly::hasher<int64_t>() which is the same code path used by the JNI mightContainLongOnSerializedBloom.

How was this patch tested?

Added new test suite covering basic filtering, null bloom filter, negation, non-column value, range test, and clone behavior.

Was this patch authored or co-authored using generative AI tooling?

No

Related issue #11708

acvictor · 2026-03-09T05:39:50Z

@zhztheplayer can you please review this PR?

zhztheplayer

Thanks.

zhztheplayer · 2026-03-09T09:36:11Z

cpp/velox/operators/functions/SparkExprToSubfieldFilterParser.cc

+  try {
+    evaluator->evaluate(exprSet.get(), rows, input, result);
+  } catch (const VeloxUserError&) {
+    return nullptr;
+  }


Why error is swallowed?

Errors are intentionally swallowed because a non-evaluable expression simply means the filter cannot be pushed down. So if the filter cannot be pushed down, leafCallToSubfieldFilter returns std::nullopt, and Velox will evaluate might_contain as a regular post-scan expression which is the same behavior as before. Does this flow make sense? I have updated the comment as well.

zhztheplayer · 2026-03-09T09:37:04Z

cpp/velox/operators/functions/SparkExprToSubfieldFilterParser.cc

+}
+
+/// Filter backed by Velox's BloomFilter<> serialized data from bloom_filter_agg.
+class SparkBloomFilter final : public common::Filter {


Let's use SparkMightContain or so to align with Spark's function name.

zhztheplayer · 2026-03-09T09:39:23Z

cpp/velox/operators/functions/SparkExprToSubfieldFilterParser.cc

+        serializedData_(std::move(serializedData)) {}
+
+  bool testInt64(int64_t value) const final {
+    return BloomFilter<>::mayContain(serializedData_.data(), folly::hasher<int64_t>()(value));


We can create a bloom filter object as class member, since bloomFilter.mayContain is faster than BloomFilter<>::mayContain.

github-actions bot added the VELOX label Mar 6, 2026

Translate might_contain as a subfield filter

01a1105

acvictor force-pushed the acvictor/mightContain branch from afcb288 to 15f1a6f Compare March 6, 2026 07:12

Add tests for might_contain subfield filter

0258168

acvictor force-pushed the acvictor/mightContain branch from 15f1a6f to 0258168 Compare March 6, 2026 07:24

acvictor changed the title ~~[VL] Translate might_contain as a subfield filter for scan-level bloom filter pushdown~~ [GLUTEN-11708][VL] Translate might_contain as a subfield filter for scan-level bloom filter pushdown Mar 6, 2026

acvictor marked this pull request as ready for review March 6, 2026 09:29

acvictor mentioned this pull request Mar 6, 2026

[VL] Translate might_contain as a subfield filter #11708

Open

zhztheplayer reviewed Mar 9, 2026

View reviewed changes

Address comments

49b2ef5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-11708][VL] Translate might_contain as a subfield filter for scan-level bloom filter pushdown#11711

[GLUTEN-11708][VL] Translate might_contain as a subfield filter for scan-level bloom filter pushdown#11711
acvictor wants to merge 3 commits intoapache:mainfrom
acvictor:acvictor/mightContain

acvictor commented Mar 6, 2026 •

edited

Loading

Uh oh!

acvictor commented Mar 9, 2026

Uh oh!

zhztheplayer left a comment

Uh oh!

zhztheplayer Mar 9, 2026

Uh oh!

acvictor Mar 9, 2026

Uh oh!

zhztheplayer Mar 9, 2026

Uh oh!

acvictor Mar 9, 2026

Uh oh!

zhztheplayer Mar 9, 2026

Uh oh!

acvictor Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

acvictor commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

acvictor commented Mar 9, 2026

Uh oh!

zhztheplayer left a comment

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

acvictor Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

acvictor Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

acvictor Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

acvictor commented Mar 6, 2026 •

edited

Loading