[GLUTEN-7548][VL] Optimize BHJ in velox backend#8931
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/apache/incubator-gluten/issues Then could you also rename commit message and pull request title in the following format? See also: |
|
Run Gluten Clickhouse CI on x86 |
a6e8905 to
2b90118
Compare
|
Run Gluten Clickhouse CI on x86 |
|
In long term, we need to implement the Spark way. Broadcast hashtable instead of raw table data. |
@FelixYBW Yes, we will support broadcasting the hash table approach after adding serialization/deserialization support to Velox's HashTable. |
2b90118 to
8e677fa
Compare
|
Run Gluten Clickhouse CI on x86 |
8e677fa to
8fed157
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
@zhztheplayer Is there memory management issue in this solution? Is the memory allocated in storage memory? @JkSelf will this solution helpful to the final solution? |
@FelixYBW Yes, the primary difference between Design 1 and Design 2 is the need for serialization and deserialization of Velox's hash table. Most of the remaining code can be shared between the two designs. We will evaluate the TPCH performance after addressing the result mismatch issue. If the performance does not meet expectations, we will proceed with implementing Design 2. |
e7100cf to
7606ed3
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
880ef19 to
d2b9c03
Compare
|
Run Gluten Clickhouse CI on x86 |
d2b9c03 to
fd92027
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Any test code to cover all of our changes to broadcast code? Given the tests are mostly based on Spark local mode where no broadcasting will actually happen. cc @zjuwangg |
|
@zhztheplayer Thank you for your review. It appears that the existing tests cover the broadcast changes introduced in this PR. I have added logging to the native broadcast hash table build code, and these logs will be printed during the CI process. |
|
Run Gluten Clickhouse CI on x86 |
|
@jinchengchenghh @liujiayi771 @zhztheplayer I'm merging this PR now to unblock our internal tests. I'll follow up with another PR to resolve your remaining comments. Thanks for the comments! |
zhztheplayer
left a comment
There was a problem hiding this comment.
Some non-blocking comments.
backends-velox/src/main/scala/org/apache/spark/rpc/GlutenRpcMessages.scala
Show resolved
Hide resolved
| isNullAwareAntiJoin, | ||
| bloomFilterPushdownSize, | ||
| threadBatches, | ||
| defaultLeafVeloxMemoryPool()); |
There was a problem hiding this comment.
defaultLeafVeloxMemoryPool should already be counted into off-heap memory pool, but I suggest to verify it if you'd like to.
What changes were proposed in this pull request?
This PR implements the optimization from the CK backend in the BHJ to the Velox backend, ensuring that the hash table is built only once per executor. The detailed design document can be found here.
This PR enhances performance by 1.29x in 3TB TPC-DS Q23a when compared to using broadcast thresholds of 100MB and 10MB. Additionally, it addresses the out-of-memory (OOM) issue encountered in Q24a and Q24b with a 100MB threshold.
However, there is still potential for further optimization by implementing concurrent hash table building to mitigate the performance degradation observed in Q67. We will continue to pursue further optimizations in subsequent PRs.
Note: This PR eliminates the need for #5563
How was this patch tested?
Existing tests