Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[Performance] Velox Bloom Filter Inefficiency vs. Photon at 1TB Scale #11554

shadowmmu started this conversation in General
Discussion options

While benchmarking TPC-H 1TB on Gluten+Velox, we identified a major performance bottleneck compared to Databricks Photon in shuffle-heavy queries (Q8, Q9, Q17, Q18, Q21).

The Evidence

  • Photon: Effectively prunes 95-99% of lineitem records using Bloom Filters, keeping Disk I/O below 10%.
  • Velox: Shows negligible filtering, leading to 30-50% Disk I/O.
  • Total Runtime Impact: Photon achieved a 1.83x speedup overall, but when I/O-heavy queries are excluded, the gap narrows to 1.13x.

The Root Cause
Spark has a default limit of 10MB for Bloom Filter, so when we increased the limit to 1GB to make comparable with Databricks Photon, bloom filter is created after config change but filter is negligable.

The Velox backend seems limited by a hardcoded Bloom Filter size of 4,194,304 bits (). This limit is too low to maintain low false-positive rates for 1TB cardinality, rendering the filter ineffective.

Questions for Maintainers

  1. Are there plans to allow the maxNumBits for Bloom Filters to scale beyond the current hardcoded limit?
  2. Why is Velox failing to generate/utilize the filter as effectively as Photon at this scale?

Spark Plan for Q17

Databricks
Historical Spark UI for cluster 5202-183210-uyaoq4aw, driver 7894865519245697362 - Details for Query 19

Velox
localhost_18080_history_app-20260202185100-0001_SQL_execution__id=266

You must be logged in to vote

Replies: 4 comments 11 replies

Comment options

@FelixYBW can you please look.

You must be logged in to vote
0 replies
Comment options

@shadowmmu Thanks for sharing the findings and analysis.
Gluten actually allows to config this via:
spark.gluten.sql.columnar.backend.velox.bloomFilter.maxNumBits

Looks like in your example run for Q17 with Gluten, Spark run time filter is not triggered. Have you also tried to lower the application side threshold?

spark.sql.optimizer.runtime.bloomFilter.applicationSideScanSizeThreshold = 0

Please also note by default DBX enabled the local caching feature which can significantly improve performance for subsequent queries in a power test run. It also collects runtime statistics, helping later queries generate better execution plans(their CBO is enabled by default)

You must be logged in to vote
7 replies
Comment options

These configs should respect Spark config except maxNumBits since memory issue. CC @zhouyuan
spark.gluten.sql.columnar.backend.velox.bloomFilter.maxNumBits

 val RUNTIME_BLOOM_FILTER_MAX_NUM_BITS =
 buildConf("spark.sql.optimizer.runtime.bloomFilter.maxNumBits")
 .doc("The max number of bits to use for the runtime bloom filter")
 .version("3.3.0")
 .longConf
 .createWithDefault(67108864L)
Comment options

Yes @jinchengchenghh, and also we have dedicated gluten configs as well for the above configs. But like is there any way to increase the values bloomFilter.maxNumBits beyond its default values to check its efficiency on larger scale.
Thanks -khem

Comment options

Please update the value 256 to a big value https://github.com/facebookincubator/velox/blob/main/velox/common/memory/MemoryAllocator.h#L492 to relax the memory restriction and take a try.

I don't try that, but this config default value is set by the size class memory limit.

Comment options

I think it will also require to update here as well to increase the hard-coded limit https://github.com/facebookincubator/velox/blob/da0d2fdba364c5e14ea98536ecae57d3d7562041/velox/core/QueryConfig.h#L1245

Comment options

yes

Comment options

@zhouyuan do you have any suggestion about how can I overcome this runtime filter bottleneck?
Also what should be the performance difference between photon and Velox.

You must be logged in to vote
4 replies
Comment options

zhouyuan Feb 3, 2026
Collaborator

Hi @shadowmmu Thanks for the detailed information — I see the issue now. This appears to be a hard limit in the memory allocator, and modifying line may help(https://github.com/facebookincubator/velox/blob/main/velox/common/memory/MemoryAllocator.h#L492)
Cc: @zhli1142015

To improve BHJ performance, Gluten has a WIP patch (#8931
) that should significantly improve performance for large BHJ workloads.

On the query planning side, DBX performs much better than vanilla Spark, especially on the DS benchmark. There are also several useful optimizations from the Spark community, but not been merged. Some of the cloud vendors also improved on the spark catalyst in their product to improve the planner.

For the Aggregation perf diff, could you please also the the example queries?

thanks, -yuan

Comment options

Thanks @zhouyuan for you effort,
Its now all clear about the BHJ, bloom filter and Optimized query plan.

For the bloom filter part, need to study Velox code if we can increase the limit without breaking anything.
if you have anything that might help, please share it here.

For the aggregation We have observed the difference in TPCH Q18 for instance.
Here is the related query plans
Velox
image

Photon
image

Comment options

Hi @shadowmmu Thanks for the detailed information — I see the issue now. This appears to be a hard limit in the memory allocator, and modifying line may help(https://github.com/facebookincubator/velox/blob/main/velox/common/memory/MemoryAllocator.h#L492) Cc: @zhli1142015

To improve BHJ performance, Gluten has a WIP patch (#8931 ) that should significantly improve performance for large BHJ workloads.

On the query planning side, DBX performs much better than vanilla Spark, especially on the DS benchmark. There are also several useful optimizations from the Spark community, but not been merged. Some of the cloud vendors also improved on the spark catalyst in their product to improve the planner.

For the Aggregation perf diff, could you please also the the example queries?

thanks, -yuan

Internally, we’ve removed this limitation and are using Spark’s default value of 64 MB for bloomFilter.maxNumBits. @shadowmmu Is 1 GB the default setting for all queries? What is the overall impact of this?

Comment options

Hi @zhli1142015 , the default limit is 10MB from spark side, I increased the limit to 1GB to compare it with Databrick's photon but it wasn't effective at all, like even after filtering it was nearly returning all the rows.
Performance wise Velox is nearly 5x slower for Q17 compared to Photon with the default limit of 10 MB for bloom filter.
And on full TPCH suite its 2x slower.
Changing to 1GB hasn't effected anything.

And btw how did you increased the bloomFilter.maxNumBits?

Comment options

Velox uses a very simple and different BloomFilter https://github.com/facebookincubator/velox/blob/main/velox/common/base/BloomFilter.h#L120 vs Spark BloomFilter, this may affect the filter efficiency.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /