Why do we recommend to disable "preferSortMergeJoin"? #6133

xumingming · 2024-06-18T14:17:14Z

xumingming
Jun 18, 2024

https://github.com/apache/incubator-gluten/blob/800cadd0f4f71d0ebedb5fbf6428442ae52b77ac/docs/Configuration.md?plain=1#L21

Just curious, why do we recommend to disable preferSortMergeJoin? Do we have some kind of benchmark result? Would be great if you can share the benchmark results 👍

The reason I ask this is that Spark claims SortMergeJoin works better for large tables:

  val PREFER_SORTMERGEJOIN = buildConf("spark.sql.join.preferSortMergeJoin")
    .internal()
    .doc("When true, prefer sort merge join over shuffled hash join. " +
      "Sort merge join consumes less memory than shuffled hash join and it works efficiently " +
      "when both join tables are large. On the other hand, shuffled hash join can improve " +
      "performance (e.g., of full outer joins) when one of join tables is much smaller.")
    .version("2.0.0")
    .booleanConf
    .createWithDefault(true)

xumingming · 2024-07-08T09:16:02Z

xumingming
Jul 8, 2024
Author

@zhouyuan @z123 @PHILO-HE Do you have information to share? e.g. Do you use SortMergeJoin or ShuffledHashJoin in production?

0 replies

zhouyuan · 2024-07-17T06:51:07Z

zhouyuan
Jul 17, 2024
Collaborator

Hi, @xumingming
This is mostly due to the SortMergeJoin implementation in Velox is not optimal vs. HashJoin. In our TPCH benchmark the performance improvement of using HashJoin is up to ~2x (depends on query). For vanilla Spark I think their Hash Join impl. does not support spill, while Sort Merge Join does support spill. That maybe the main reason why Sort Merge Join is promoted.

I didn't have much information the production env, but for functionality and performance in Gluten/Velox - Hash Join is better. We are also improving the merge join code path in Velox recently but still requires more tests and validations from Gluten users.

thanks, -yuan

1 reply

xumingming Jul 17, 2024
Author

@zhouyuan Thanks for the information, appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do we recommend to disable "preferSortMergeJoin"? #6133

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why do we recommend to disable "preferSortMergeJoin"? #6133

Uh oh!

Uh oh!

xumingming Jun 18, 2024

Replies: 2 comments · 1 reply

Uh oh!

xumingming Jul 8, 2024 Author

Uh oh!

zhouyuan Jul 17, 2024 Collaborator

Uh oh!

xumingming Jul 17, 2024 Author

xumingming
Jun 18, 2024

Replies: 2 comments 1 reply

xumingming
Jul 8, 2024
Author

zhouyuan
Jul 17, 2024
Collaborator

xumingming Jul 17, 2024
Author