Skip to content

[SPARK-44846][SQL] Convert the lower redundant Aggregate to Project i…#865

Open
shanxuecheng wants to merge 1 commit intoKyligence:kyspark-3.3.x-4.x-qafrom
shanxuecheng:AL-10326-4x
Open

[SPARK-44846][SQL] Convert the lower redundant Aggregate to Project i…#865
shanxuecheng wants to merge 1 commit intoKyligence:kyspark-3.3.x-4.x-qafrom
shanxuecheng:AL-10326-4x

Conversation

@shanxuecheng
Copy link

…n RemoveRedundantAggregates

This PR provides a safe way to remove a redundant Aggregate in rule RemoveRedundantAggregates. Just convert the lower redundant Aggregate to Project.

The aggregate contains complex grouping expressions after RemoveRedundantAggregates, if aggregateExpressions has (if / case) branches, it is possible that groupingExpressions is no longer a subexpression of aggregateExpressions after execute PushFoldableIntoBranches rule, Then cause boundReference error. For example

SELECT c * 2 AS d
FROM (
         SELECT if(b > 1, 1, b) AS c
         FROM (
                  SELECT if(a < 0, 0, a) AS b
                  FROM VALUES (-1), (1), (2) AS t1(a)
              ) t2
         GROUP BY b
     ) t3
GROUP BY c

Before pr

== Optimized Logical Plan ==
Aggregate [if ((b#0 > 1)) 1 else b#0], [if ((b#0 > 1)) 2 else (b#0 * 2) AS d#2]
+- Project [if ((a#3 < 0)) 0 else a#3 AS b#0]
   +- LocalRelation [a#3]
== Error ==
Couldn't find b#0 in [if ((b#0 > 1)) 1 else b#0#7]
java.lang.IllegalStateException: Couldn't find b#0 in [if ((b#0 > 1)) 1 else b#0#7]
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
	at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241)
	at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240)
	at org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653)
        ......

After pr

== Optimized Logical Plan ==
Aggregate [c#1], [(c#1 * 2) AS d#2]
+- Project [if ((b#0 > 1)) 1 else b#0 AS c#1]
   +- Project [if ((a#3 < 0)) 0 else a#3 AS b#0]
      +- LocalRelation [a#3]

No

UT

Closes apache#42633 from zml1206/SPARK-44846-2.

Authored-by: zml1206 zhuml1206@gmail.com

(cherry picked from commit 32a87f0)

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

…n RemoveRedundantAggregates

This PR provides a safe way to remove a redundant `Aggregate` in rule `RemoveRedundantAggregates`. Just convert the lower redundant `Aggregate` to `Project`.

The aggregate contains complex grouping expressions after `RemoveRedundantAggregates`, if `aggregateExpressions` has (if / case) branches, it is possible that `groupingExpressions` is no longer a subexpression of `aggregateExpressions` after execute `PushFoldableIntoBranches` rule, Then cause `boundReference` error.
For example
```
SELECT c * 2 AS d
FROM (
         SELECT if(b > 1, 1, b) AS c
         FROM (
                  SELECT if(a < 0, 0, a) AS b
                  FROM VALUES (-1), (1), (2) AS t1(a)
              ) t2
         GROUP BY b
     ) t3
GROUP BY c
```
Before pr
```
== Optimized Logical Plan ==
Aggregate [if ((b#0 > 1)) 1 else b#0], [if ((b#0 > 1)) 2 else (b#0 * 2) AS d#2]
+- Project [if ((a#3 < 0)) 0 else a#3 AS b#0]
   +- LocalRelation [a#3]
```
```
== Error ==
Couldn't find b#0 in [if ((b#0 > 1)) 1 else b#0#7]
java.lang.IllegalStateException: Couldn't find b#0 in [if ((b#0 > 1)) 1 else b#0#7]
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
	at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241)
	at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240)
	at org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653)
        ......
```
After pr
```
== Optimized Logical Plan ==
Aggregate [c#1], [(c#1 * 2) AS d#2]
+- Project [if ((b#0 > 1)) 1 else b#0 AS c#1]
   +- Project [if ((a#3 < 0)) 0 else a#3 AS b#0]
      +- LocalRelation [a#3]
```
No

UT

Closes apache#42633 from zml1206/SPARK-44846-2.

Authored-by: zml1206 <zhuml1206@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(cherry picked from commit 32a87f0)
Signed-off-by: Yuming Wang <yumwang@ebay.com>
@github-actions github-actions bot added the SQL label Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants