-
Notifications
You must be signed in to change notification settings - Fork 142
SNOW-2895675: Skip aliases when source/destination column are identical #4037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| isinstance(expr.child, (Attribute, UnresolvedAttribute)) | ||
| and origin == quoted_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not move this check here?
snowpark-python/src/snowflake/snowpark/_internal/analyzer/analyzer_utils.py
Lines 364 to 365 in 266334b
| def alias_expression(origin: str, alias: str) -> str: | |
| return origin + AS + alias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt it would be clearer to put this check outside the call to a direct SQL generation function. IMO it would be very unintuitive if a call to alias_expression had control logic that could emit SQL that did not represent an actual alias expression.
4c19d78 to
3b25773
Compare
| expr.child, df_aliased_col_name_to_real_col_name, parse_local_name | ||
| ) | ||
| if ( | ||
| isinstance(expr.child, (Attribute, UnresolvedAttribute)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are comparing origin against quoted_name, from analyzer code:
| if isinstance(expr, UnresolvedAttribute): |
I don't see UnresolvedAttribute is guaranteed to have quoted name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that all UnresolvedAttribute constructor calls are expected to quote the name of the attribute. You can verify this is the case with rg 'UnresolvedAttribute\(' -A5 src; every instance of the constructor call comes with quoting.
| c, from_.df_aliased_col_name_to_real_col_name, parse_local_name=True | ||
| ).strip(" "): | ||
| # SNOW-2895675: Always treat Aliases as "changed", even if it is an identity. | ||
| # The fact this check is needed may be a bug in column state analysis, and we should revisit it later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a case that would fail if we don't treat aliases as changed?
and what's the generated sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/integ/scala/test_dataframe_join_suite.py::test_name_alias_on_multiple_join was previously failing without this, but it looks like it no longer fails after your disambiguation change. I think the underlying bug still exists though, so I'll see if I can find another reproducer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is no longer necessary because of your changes to populate the alias map in dataframe.py. I've reverted this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, CI revealed some failures:
FAILED tests/integ/test_cte.py::test_sql_simplifier - assert 2 == 1
FAILED tests/integ/scala/test_dataframe_suite.py::test_rename_join_dataframe - snowflake.snowpark.exceptions.SnowparkSQLException: (1304): 01c22c14-0813-e...
FAILED tests/integ/test_dataframe.py::test_dataframe_alias - snowflake.snowpark.exceptions.SnowparkSQLAmbiguousJoinException: (1303): Th...
I'll add a comment mentioning them.
| expr.child, df_aliased_col_name_to_real_col_name, parse_local_name | ||
| ) | ||
| if ( | ||
| isinstance(expr.child, (Attribute, UnresolvedAttribute)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question on quotes
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-2895675
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
When an alias clause would emit an alias that does not change an input column's name, as in
SELECT "A" AS "A", the alias is elided to justSELECT "A". This PR also includes a fix from @sfc-gh-aling to emitSELECT *(rather thanSELECT "col1", "col2") in join operations if no aliasing is necessary.As a result, the SQL for the final query emitted by
test_dataframe_join_suite.py::test_name_alias_on_multiple_joinis reduced from 1045B -> 799B, a 24% reduction. This fix was requested for SCOS; a sample query emulating a user workload similarly experiences a 15% reduction in query size in SCOS (2175B -> 1857B), and a 60% reduction for the Snowpark equivalent (328B -> 128B).Implementation Details
Though the original ask was specifically to implement this change for JOIN operations, this PR applies the optimization to all generated queries. It does so with changes in three locations:
unary_expression_extractorinanalyzer.py: Avoids emitting SQL for an alias in locations when possible, when traversing a query plan. This is the simplest location to make this change, as it avoids the need to track down all call sites that generate anAliasnode.derive_column_states_from_subqueryinselect_statement.py: This method compares the analyzed query strings of expressions to see if their values have changed. Previously, aliases always fully resolved ("A" AS "A"), so aliased columns were always assigned the CHANGED_EXP state; now, since redundant aliases resolve to just the column name ("A"), this method assigns UNCHANGED_EXP instead. My understanding is that this behavior should be correct, but this produced bugs in nested joins where a top-level projection did not properly use an alias for an ambiguous column._disambiguateindataframe.py: If two frames have no column names in common at all, then there is no need for disambiguation, and we can simply emit aSELECT *. This fix was proposed by @sfc-gh-aling in SNOW-2895675: removing redundant alias when join #4044.I could not track down the root cause of the aliasing issue, as it appeared even with simpler fixes like modifying
_alias_if_neededindataframe.py, and replacing analyzer calls withparse_local_nameas suggested by comments withinselect_statement.py. Based on running through the codebase (and asking Cursor), it is likely that a subquery alias mapping somewhere is not being populated somewhere during analysis, but it is not clear what step of the analysis process would be responsible for this.The fix I chose provides the largest benefit (removing redundant aliasing for all queries, not just joins) while adhering as closely as possible to previous behavior when analyzing column change states.
Why do we need to change both the analyzer and join disambiguation?
In cases where frames have no overlapping column names, the fix in #4044 to emit
SELECT *instead of individual column names is sufficient. This can also lead to a reduction in issued DESCRIBE queries. Example:However, if frames do have overlapping column names, explicit removal of aliasing is necessary to reduce query text size. This is a very realistic scenario, as JOIN operations frequently combine tables based on a common ID column or other key. Example:
The changes in this PR do not reduce the number of queries issued for this case, but can lead to a substantial reduction in text size when a joined DF is used repeatedly in a sub-query.