[Performance] Velox Bloom Filter Inefficiency vs. Photon at 1TB Scale #11554

shadowmmu · 2026-02-03T12:35:55Z

shadowmmu
Feb 3, 2026

While benchmarking TPC-H 1TB on Gluten+Velox, we identified a major performance bottleneck compared to Databricks Photon in shuffle-heavy queries (Q8, Q9, Q17, Q18, Q21).

The Evidence

Photon: Effectively prunes 95-99% of lineitem records using Bloom Filters, keeping Disk I/O below 10%.
Velox: Shows negligible filtering, leading to 30-50% Disk I/O.
Total Runtime Impact: Photon achieved a 1.83x speedup overall, but when I/O-heavy queries are excluded, the gap narrows to 1.13x.

The Root Cause
Spark has a default limit of 10MB for Bloom Filter, so when we increased the limit to 1GB to make comparable with Databricks Photon, bloom filter is created after config change but filter is negligable.

The Velox backend seems limited by a hardcoded Bloom Filter size of 4,194,304 bits (). This limit is too low to maintain low false-positive rates for 1TB cardinality, rendering the filter ineffective.

Questions for Maintainers

Are there plans to allow the maxNumBits for Bloom Filters to scale beyond the current hardcoded limit?
Why is Velox failing to generate/utilize the filter as effectively as Photon at this scale?

Spark Plan for Q17

Databricks

Velox

shadowmmu · 2026-02-03T13:01:55Z

shadowmmu
Feb 3, 2026
Author

@FelixYBW can you please look.

0 replies

zhouyuan · 2026-02-03T16:48:36Z

zhouyuan
Feb 3, 2026
Collaborator

@shadowmmu Thanks for sharing the findings and analysis.
Gluten actually allows to config this via:
spark.gluten.sql.columnar.backend.velox.bloomFilter.maxNumBits

Looks like in your example run for Q17 with Gluten, Spark run time filter is not triggered. Have you also tried to lower the application side threshold?

spark.sql.optimizer.runtime.bloomFilter.applicationSideScanSizeThreshold = 0

Please also note by default DBX enabled the local caching feature which can significantly improve performance for subsequent queries in a power test run. It also collects runtime statistics, helping later queries generate better execution plans(their CBO is enabled by default)

7 replies

jinchengchenghh Feb 4, 2026
Collaborator

These configs should respect Spark config except maxNumBits since memory issue. CC @zhouyuan
spark.gluten.sql.columnar.backend.velox.bloomFilter.maxNumBits

  val RUNTIME_BLOOM_FILTER_MAX_NUM_BITS =
    buildConf("spark.sql.optimizer.runtime.bloomFilter.maxNumBits")
      .doc("The max number of bits to use for the runtime bloom filter")
      .version("3.3.0")
      .longConf
      .createWithDefault(67108864L)

shadowmmu Feb 4, 2026
Author

Yes @jinchengchenghh, and also we have dedicated gluten configs as well for the above configs. But like is there any way to increase the values bloomFilter.maxNumBits beyond its default values to check its efficiency on larger scale.
Thanks -khem

jinchengchenghh Feb 4, 2026
Collaborator

Please update the value 256 to a big value https://github.com/facebookincubator/velox/blob/main/velox/common/memory/MemoryAllocator.h#L492 to relax the memory restriction and take a try.

I don't try that, but this config default value is set by the size class memory limit.

shadowmmu Feb 4, 2026
Author

I think it will also require to update here as well to increase the hard-coded limit https://github.com/facebookincubator/velox/blob/da0d2fdba364c5e14ea98536ecae57d3d7562041/velox/core/QueryConfig.h#L1245

jinchengchenghh Feb 4, 2026
Collaborator

yes

shadowmmu · 2026-02-03T17:22:33Z

shadowmmu
Feb 3, 2026
Author

@zhouyuan do you have any suggestion about how can I overcome this runtime filter bottleneck?
Also what should be the performance difference between photon and Velox.

4 replies

zhouyuan Feb 3, 2026
Collaborator

Hi @shadowmmu Thanks for the detailed information — I see the issue now. This appears to be a hard limit in the memory allocator, and modifying line may help(https://github.com/facebookincubator/velox/blob/main/velox/common/memory/MemoryAllocator.h#L492)
Cc: @zhli1142015

To improve BHJ performance, Gluten has a WIP patch (#8931
) that should significantly improve performance for large BHJ workloads.

On the query planning side, DBX performs much better than vanilla Spark, especially on the DS benchmark. There are also several useful optimizations from the Spark community, but not been merged. Some of the cloud vendors also improved on the spark catalyst in their product to improve the planner.

For the Aggregation perf diff, could you please also the the example queries?

thanks, -yuan

shadowmmu Feb 3, 2026
Author

Thanks @zhouyuan for you effort,
Its now all clear about the BHJ, bloom filter and Optimized query plan.

For the bloom filter part, need to study Velox code if we can increase the limit without breaking anything.
if you have anything that might help, please share it here.

For the aggregation We have observed the difference in TPCH Q18 for instance.
Here is the related query plans
Velox

Photon

zhli1142015 Feb 4, 2026
Collaborator

Hi @shadowmmu Thanks for the detailed information — I see the issue now. This appears to be a hard limit in the memory allocator, and modifying line may help(https://github.com/facebookincubator/velox/blob/main/velox/common/memory/MemoryAllocator.h#L492) Cc: @zhli1142015

To improve BHJ performance, Gluten has a WIP patch (#8931 ) that should significantly improve performance for large BHJ workloads.

On the query planning side, DBX performs much better than vanilla Spark, especially on the DS benchmark. There are also several useful optimizations from the Spark community, but not been merged. Some of the cloud vendors also improved on the spark catalyst in their product to improve the planner.

For the Aggregation perf diff, could you please also the the example queries?

thanks, -yuan

Internally, we’ve removed this limitation and are using Spark’s default value of 64 MB for bloomFilter.maxNumBits. @shadowmmu Is 1 GB the default setting for all queries? What is the overall impact of this?

shadowmmu Feb 4, 2026
Author

Hi @zhli1142015 , the default limit is 10MB from spark side, I increased the limit to 1GB to compare it with Databrick's photon but it wasn't effective at all, like even after filtering it was nearly returning all the rows.
Performance wise Velox is nearly 5x slower for Q17 compared to Photon with the default limit of 10 MB for bloom filter.
And on full TPCH suite its 2x slower.
Changing to 1GB hasn't effected anything.

And btw how did you increased the bloomFilter.maxNumBits?

jinchengchenghh · 2026-02-04T05:32:33Z

jinchengchenghh
Feb 4, 2026
Collaborator

Velox uses a very simple and different BloomFilter https://github.com/facebookincubator/velox/blob/main/velox/common/base/BloomFilter.h#L120 vs Spark BloomFilter, this may affect the filter efficiency.

0 replies

[Performance] Velox Bloom Filter Inefficiency vs. Photon at 1TB Scale #11554

Uh oh!

shadowmmu Feb 3, 2026

Replies: 4 comments · 11 replies

Uh oh!

shadowmmu Feb 3, 2026 Author

Uh oh!

zhouyuan Feb 3, 2026 Collaborator

Uh oh!

jinchengchenghh Feb 4, 2026 Collaborator

Uh oh!

shadowmmu Feb 4, 2026 Author

Uh oh!

jinchengchenghh Feb 4, 2026 Collaborator

Uh oh!

shadowmmu Feb 4, 2026 Author

Uh oh!

jinchengchenghh Feb 4, 2026 Collaborator

Uh oh!

shadowmmu Feb 3, 2026 Author

Uh oh!

zhouyuan Feb 3, 2026 Collaborator

Uh oh!

shadowmmu Feb 3, 2026 Author

Uh oh!

zhli1142015 Feb 4, 2026 Collaborator

Uh oh!

shadowmmu Feb 4, 2026 Author

Uh oh!

jinchengchenghh Feb 4, 2026 Collaborator

shadowmmu
Feb 3, 2026

Replies: 4 comments 11 replies

shadowmmu
Feb 3, 2026
Author

zhouyuan
Feb 3, 2026
Collaborator

jinchengchenghh Feb 4, 2026
Collaborator

shadowmmu Feb 4, 2026
Author

jinchengchenghh Feb 4, 2026
Collaborator

shadowmmu Feb 4, 2026
Author

jinchengchenghh Feb 4, 2026
Collaborator

shadowmmu
Feb 3, 2026
Author

zhouyuan Feb 3, 2026
Collaborator

shadowmmu Feb 3, 2026
Author

zhli1142015 Feb 4, 2026
Collaborator

shadowmmu Feb 4, 2026
Author

jinchengchenghh
Feb 4, 2026
Collaborator