Skip to content

[GLUTEN-10660][VL] Adding configuration for hash table build#10634

Merged
zhouyuan merged 6 commits intoapache:mainfrom
zhouyuan:wip_config_hashbuild
Sep 29, 2025
Merged

[GLUTEN-10660][VL] Adding configuration for hash table build#10634
zhouyuan merged 6 commits intoapache:mainfrom
zhouyuan:wip_config_hashbuild

Conversation

@zhouyuan
Copy link
Member

@zhouyuan zhouyuan commented Sep 5, 2025

What changes are proposed in this pull request?

Adding configurations for hash map build optimization in facebookincubator/velox#7066

How was this patch tested?

@github-actions github-actions bot added the VELOX label Sep 5, 2025
@zhouyuan zhouyuan force-pushed the wip_config_hashbuild branch 4 times, most recently from 718068d to e3578e4 Compare September 8, 2025 09:12
@liujiayi771
Copy link
Contributor

We need to set spark.gluten.velox.abandonbuild.noduphashminpct=100 to verify the behavior when deduplication is enabled.

@zhouyuan zhouyuan force-pushed the wip_config_hashbuild branch from 238f97b to 6682628 Compare September 8, 2025 13:09
@github-actions github-actions bot added the DOCS label Sep 8, 2025
@zhouyuan zhouyuan force-pushed the wip_config_hashbuild branch from 5294814 to 02acee6 Compare September 9, 2025 11:20
@github-actions github-actions bot removed the DOCS label Sep 9, 2025
@zhouyuan zhouyuan changed the title [VL] Adding configuration for hash table build [GLUTEN-10660][VL] Adding configuration for hash table build Sep 9, 2025
@zhouyuan zhouyuan marked this pull request as ready for review September 9, 2025 13:32
@github-actions
Copy link

github-actions bot commented Sep 9, 2025

#10660

@liujiayi771
Copy link
Contributor

liujiayi771 commented Sep 10, 2025

After the testing, should we enable this feature by default?
In addition, the config key may still change in the upstream Velox. It's better to wait until the code review comments are finalized.

@zhouyuan
Copy link
Member Author

zhouyuan commented Sep 11, 2025

@liujiayi771 I vote for enabling this by default. The staging patch in our Velox fork is not updated. We will update on that after facebookincubator/velox#7066 merged

@zhouyuan zhouyuan requested a review from FelixYBW September 11, 2025 08:40
@FelixYBW
Copy link
Contributor

At least for TPCDS test, I didn't see any performance regression with spark.gluten.velox.abandonbuild.noduphashminpct=100. I think we can enable it by default.

@liujiayi771
Copy link
Contributor

TPCDS has very few queries that contain semi/anti joins, and all of these involve duplicate join keys.


val VELOX_HASHMAP_ABANDON_BUILD_DUPHASH_MIN_PCT =
buildConf("spark.gluten.velox.abandonbuild.noduphashminpct")
.internal()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Signed-off-by: Yuan <yuanzhou@apache.org>
Signed-off-by: Yuan <yuanzhou@apache.org>
Signed-off-by: Yuan <yuanzhou@apache.org>
@zhouyuan zhouyuan force-pushed the wip_config_hashbuild branch from 7c1a2b6 to f030c73 Compare September 18, 2025 14:35
Signed-off-by: Yuan <yuanzhou@apache.org>
@github-actions github-actions bot added the DOCS label Sep 18, 2025
@zhouyuan
Copy link
Member Author

@liujiayi771 are you suggesting it's better to disable this feature by default?

val VELOX_HASHMAP_ABANDON_BUILD_DUPHASH_MIN_PCT =
buildConf("spark.gluten.velox.abandonbuild.noduphashminpct")
.experimental()
.doc("Experimental: abandon hashmap build if duplicated rows are more than this pct.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use "percentile" to avoid abbreviations in the doc.

Copy link
Contributor

@marin-ma marin-ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@liujiayi771
Copy link
Contributor

@liujiayi771 are you suggesting it's better to disable this feature by default?

Yes, for the Q95 build side join keys, the duplication rate is very high, so the optimization is very significant. However, for production jobs without a high duplication rate, there could be a performance regression. It's necessary to adjust the abandon percent based on the specific job conditions.

Signed-off-by: Yuan <yuanzhou@apache.org>
@zhouyuan zhouyuan force-pushed the wip_config_hashbuild branch from 4f5bb36 to 0b3bee5 Compare September 26, 2025 13:08
Signed-off-by: Yuan <yuanzhou@apache.org>
@zhouyuan zhouyuan merged commit cc34b49 into apache:main Sep 29, 2025
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants