Skip to content

[GLUTEN-11708][VL] Translate might_contain as a subfield filter for scan-level bloom filter pushdown#11711

Open
acvictor wants to merge 3 commits intoapache:mainfrom
acvictor:acvictor/mightContain
Open

[GLUTEN-11708][VL] Translate might_contain as a subfield filter for scan-level bloom filter pushdown#11711
acvictor wants to merge 3 commits intoapache:mainfrom
acvictor:acvictor/mightContain

Conversation

@acvictor
Copy link
Contributor

@acvictor acvictor commented Mar 6, 2026

What changes are proposed in this pull request?

This PR adds support for pushing might_contain(bloomFilter, value) down into Velox's subfield filter system via SparkExprToSubfieldFilterParser. Previously, might_contain was evaluated as a post-scan expression. With this change, the bloom filter check can be applied at the storage scan level allowing entire row groups to be skipped before data is fully decoded.

Velox has two incompatible bloom filter implementations:

  • BloomFilter: used by bloom_filter_agg / might_contain (groups-of-64-bits, 4 hash functions)
  • SplitBlockBloomFilter: used by the existing BigintValuesUsingBloomFilter filter class (SIMD split-block)

Since these are not interchangeable, a new SparkBloomFilter filter class is introduced that wraps the serialized BloomFilter<> data and implements testInt64() using BloomFilter<>::mayContain() with folly::hasher<int64_t>() which is the same code path used by the JNI mightContainLongOnSerializedBloom.

How was this patch tested?

Added new test suite covering basic filtering, null bloom filter, negation, non-column value, range test, and clone behavior.

Was this patch authored or co-authored using generative AI tooling?

No

Related issue #11708

@github-actions github-actions bot added the VELOX label Mar 6, 2026
@acvictor acvictor force-pushed the acvictor/mightContain branch from afcb288 to 15f1a6f Compare March 6, 2026 07:12
@acvictor acvictor force-pushed the acvictor/mightContain branch from 15f1a6f to 0258168 Compare March 6, 2026 07:24
@acvictor acvictor changed the title [VL] Translate might_contain as a subfield filter for scan-level bloom filter pushdown [GLUTEN-11708][VL] Translate might_contain as a subfield filter for scan-level bloom filter pushdown Mar 6, 2026
@acvictor acvictor marked this pull request as ready for review March 6, 2026 09:29
@acvictor
Copy link
Contributor Author

acvictor commented Mar 9, 2026

@zhztheplayer can you please review this PR?

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Comment on lines +37 to +41
try {
evaluator->evaluate(exprSet.get(), rows, input, result);
} catch (const VeloxUserError&) {
return nullptr;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why error is swallowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors are intentionally swallowed because a non-evaluable expression simply means the filter cannot be pushed down. So if the filter cannot be pushed down, leafCallToSubfieldFilter returns std::nullopt, and Velox will evaluate might_contain as a regular post-scan expression which is the same behavior as before. Does this flow make sense? I have updated the comment as well.

}

/// Filter backed by Velox's BloomFilter<> serialized data from bloom_filter_agg.
class SparkBloomFilter final : public common::Filter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use SparkMightContain or so to align with Spark's function name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

serializedData_(std::move(serializedData)) {}

bool testInt64(int64_t value) const final {
return BloomFilter<>::mayContain(serializedData_.data(), folly::hasher<int64_t>()(value));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create a bloom filter object as class member, since bloomFilter.mayContain is faster than BloomFilter<>::mayContain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants