[GLUTEN-11383][VL] Allow bloom filter pushdown in hash probe by infvg · Pull Request #11392 · apache/gluten

infvg · 2026-01-12T10:28:43Z

Added spark.gluten.sql.columnar.backend.velox.hash_probe_bloom_filter_pushdown_max_size and spark.gluten.sql.columnar.backend.velox.hash_probe_dynamic_filter_pushdown_enabled as config options for velox

Resolves: #11383

marin-ma

Please run dev/gen-all-config-docs.sh when adding or modifying the configurations.

These configurations are not actually passed to and used by the Velox Backend. Do you plan to enable it in this patch?

marin-ma · 2026-01-12T10:35:23Z

backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala

+    buildConf("spark.gluten.sql.columnar.backend.velox.hash_probe_bloom_filter_pushdown_max_size")
+      .doc("The maximum byte size of Bloom filter that can be generated from hash probe. When set to 0, no Bloom filter will be generated. To achieve optimal performance, this should not be too larger than the CPU cache size on the host.")
+      .intConf
+      .createWithDefault(0)


The configuration type and default value are incorrect.

Seems like the variable name and the configuration don't match.

val COLUMNAR_VELOX_HASH_PROBE_DYNAMIC_FILTER_PUSHDOWN_ENABLED = buildConf("spark.gluten.sql.columnar.backend.velox.hash_probe_bloom_filter_pushdown_max_size")

marin-ma · 2026-01-12T10:35:32Z

backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala

+    buildConf("spark.gluten.sql.columnar.backend.velox.hash_probe_dynamic_filter_pushdown_enabled")
+      .doc("Whether hash probe can generate any dynamic filter (including Bloom filter) and push down to upstream operators.")
+      .booleanConf
+      .createWithDefault(true)


rui-mo

+1. Could you please pass this configuration to Velox’s query context so that Gluten users can control the behavior via Scala configuration?

PHILO-HE · 2026-01-13T03:23:52Z

backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala

      .createWithDefault(4194304L)

+  val COLUMNAR_VELOX_HASH_PROBE_BLOOM_FILTER_PUSHDOWN_MAX_SIZE =
+    buildConf("spark.gluten.sql.columnar.backend.velox.hash_probe_bloom_filter_pushdown_max_size")


Should we follow the camel case naming convention for consistency instead of using underscore case?

The other variables in this scope are all underscore though

@infvg, let me clarify. I meant the use of hash_probe_bloom_filter_pushdown_max_size in the config string. Should we convert it to camel case for consistency?

+1 Let's follow the camel case naming for the newly added configurations. If the configuration name is too long, you can break it into multiple parts. e.g. .hashProbe.bloomFilterPushDownMaxSize

For the variable name, please shorten it to HASH_PROBE_BLOOM_FILTER_PUSHDOWN_MAX_SIZE

Ah I see, I changed it. I set it to hashProbe.bloomFilterPushdown.maxSize since it feels more natural since we are setting the max size. Please let me know what you think

marin-ma · 2026-01-13T09:33:25Z

backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala


  def parquetUseColumnNames: Boolean = getConf(PARQUET_USE_COLUMN_NAMES)
+
+  def veloxHashProbeBloomFilterPushdownMaxSize: Long = getConf(COLUMNAR_VELOX_HASH_PROBE_BLOOM_FILTER_PUSHDOWN_MAX_SIZE)


veloxHashProbeBloomFilterPushdownMaxSize -> hashProbeBloomFilterPushdownMaxSize

rui-mo · 2026-01-27T17:27:15Z

backends-velox/src/test/scala/org/apache/gluten/execution/VeloxHashJoinSuite.scala

+            val metrics = join.get.metrics
+
+            assert(metrics.contains("hashProbeDynamicFiltersProduced"))
+            assert(metrics("hashProbeDynamicFiltersProduced").value > 0)


I investigated the cause of the test failure and identified the following issues:

The test uses a broadcast hash join, which causes it to fail at L335. We can set the configuration spark.sql.autoBroadcastJoinThreshold to -1 to force the use of a shuffled hash join instead.

After applying this change, the metrics assertion still fails because metrics("hashProbeDynamicFiltersProduced").value is zero. This happens because, in this scenario, the Velox plan consists of ValuesStream operators as the children of the HashJoin, which is a typical join plan in Gluten when shuffle exists, but does not support dynamic filter pushdown. Dynamic filters only take effect when the HashProbe child is a TableScan. I'm not sure whether such a case can be constructed in a Gluten unit test.

-- Project[4][expressions: (n4_2:BIGINT, "n3_2")] -> n4_2:BIGINT -- Project[3][expressions: (n3_2:BIGINT, "n0_0"), (n3_3:BIGINT, "n1_0")] -> n3_2:BIGINT, n3_3:BIGINT -- HashJoin[2][INNER n0_0=n1_0] -> n0_0:BIGINT, n1_0:BIGINT -- ValueStream[0][] -> n0_0:BIGINT -- ValueStream[1][] -> n1_0:BIGINT

rui-mo · 2026-01-28T12:13:04Z

backends-velox/src/test/scala/org/apache/gluten/execution/VeloxHashJoinSuite.scala

+            assert(metrics("hashProbeReplacedWithDynamicFilterRows").value > 0)
+        }
+      }
+    }


@infvg I made a few changes to the test above and confirmed that bloom filter pushdown is working by printing the following metrics in Velox.

bloomFilter->blocksByteSize(): 144704 numFiltersProduced: 1

A good next step would be to pass the bloomFilter->blocksByteSize() metric from Velox to Gluten (see WholeStageResultIterator::collectMetrics()), and then verify in this test that both blocksByteSize and numFiltersProduced are greater than 0.

This is the test with my modifications:

withSQLConf( VeloxConfig.HASH_PROBE_DYNAMIC_FILTER_PUSHDOWN_ENABLED.key -> "true", VeloxConfig.HASH_PROBE_BLOOM_FILTER_PUSHDOWN_MAX_SIZE.key -> "1048576" ) { withTable("probe_table", "build_table") { spark.sql(""" CREATE TABLE probe_table USING PARQUET AS SELECT id as a FROM range(110001) """) spark.sql(""" CREATE TABLE build_table USING PARQUET AS SELECT id * 1000 as b FROM range(220002) """) runQueryAndCompare( "SELECT a FROM probe_table JOIN build_table ON a = b" ) { df => val join = find(df.queryExecution.executedPlan) { case _: BroadcastHashJoinExecTransformer => true case _ => false } assert(join.isDefined) val metrics = join.get.metrics // TODO: assert the relevant metrics. } } }

rui-mo

Thanks.

rui-mo · 2026-02-04T15:32:04Z

backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala

+        "set to 0, no Bloom filter will be generated. To achieve optimal performance, this should" +
+        " not be too larger than the CPU cache size on the host.")
+      .bytesConf(ByteUnit.BYTE)
+      .createWithDefault(0)


Do we need to disable the bloom filter pushdown by default?

Velox has the default set to true & 0:
https://github.com/facebookincubator/velox/blob/10bdc0688f892dcda83cfbcf723f656ba5e1e6b4/velox/core/QueryConfig.h#L1304-L1308

rui-mo · 2026-02-04T16:07:49Z

backends-velox/src/main/scala/org/apache/gluten/metrics/HashAggregateMetricsUpdater.scala

    aggSpilledFiles += aggMetrics.spilledFiles
    flushRowCount += aggMetrics.flushRowCount
    loadedToValueHook += aggMetrics.loadedToValueHook
+    bloomFilterBlocksByteSize += aggMetrics.bloomFilterBlocksByteSize


It seems like the bloom filter pushdown in aggregate hasn’t been confirmed to actually take effect yet, right?

Yep true - removed

Added ``spark.gluten.sql.columnar.backend.velox.hash_probe_bloom_filter_pushdown_max_size`` and ``spark.gluten.sql.columnar.backend.velox.hash_probe_dynamic_filter_pushdown_enabled`` as config options for velox

marin-ma reviewed Jan 12, 2026

View reviewed changes

github-actions bot added the VELOX label Jan 12, 2026

rui-mo reviewed Jan 12, 2026

View reviewed changes

infvg force-pushed the bloomfilter branch 2 times, most recently from f69a085 to 19ba6e9 Compare January 12, 2026 16:37

PHILO-HE changed the title ~~Add config new Velox configs~~ [VL] Add config new Velox configs Jan 13, 2026

PHILO-HE reviewed Jan 13, 2026

View reviewed changes

marin-ma reviewed Jan 13, 2026

View reviewed changes

infvg force-pushed the bloomfilter branch from 19ba6e9 to 7043fde Compare January 13, 2026 10:35

github-actions bot added the DOCS label Jan 13, 2026

infvg force-pushed the bloomfilter branch 2 times, most recently from d499080 to e61ad52 Compare January 15, 2026 11:16

rui-mo reviewed Jan 27, 2026

View reviewed changes

rui-mo reviewed Jan 28, 2026

View reviewed changes

infvg force-pushed the bloomfilter branch 4 times, most recently from 31ed8b4 to a9fe475 Compare February 4, 2026 13:44

rui-mo reviewed Feb 4, 2026

View reviewed changes

rui-mo changed the title ~~[VL] Add config new Velox configs~~ [GLUTEN-11383][VL] Allow bloom filter pushdown Feb 4, 2026

rui-mo reviewed Feb 4, 2026

View reviewed changes

Add config new Velox configs

d5b825f

Added ``spark.gluten.sql.columnar.backend.velox.hash_probe_bloom_filter_pushdown_max_size`` and ``spark.gluten.sql.columnar.backend.velox.hash_probe_dynamic_filter_pushdown_enabled`` as config options for velox

infvg force-pushed the bloomfilter branch from a9fe475 to d5b825f Compare February 5, 2026 11:15

rui-mo changed the title ~~[GLUTEN-11383][VL] Allow bloom filter pushdown~~ [GLUTEN-11383][VL] Allow bloom filter pushdown in hash probe Feb 5, 2026

rui-mo approved these changes Feb 5, 2026

View reviewed changes

rui-mo merged commit 24ce3f9 into apache:main Feb 6, 2026
107 of 108 checks passed


		def parquetUseColumnNames: Boolean = getConf(PARQUET_USE_COLUMN_NAMES)

		def veloxHashProbeBloomFilterPushdownMaxSize: Long = getConf(COLUMNAR_VELOX_HASH_PROBE_BLOOM_FILTER_PUSHDOWN_MAX_SIZE)

Conversation

infvg commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marin-ma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marin-ma Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rui-mo Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

infvg Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

infvg commented Jan 12, 2026 •

edited

Loading

marin-ma Jan 13, 2026 •

edited

Loading

rui-mo Jan 27, 2026 •

edited

Loading

infvg Feb 4, 2026 •

edited

Loading