SNOW-2895675: Skip aliases when source/destination column are identical #4037

sfc-gh-joshi · 2025-12-16T20:30:50Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-2895675
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
- If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
Please describe how your code solves the related issue.

When an alias clause would emit an alias that does not change an input column's name, as in SELECT "A" AS "A", the alias is elided to just SELECT "A". This PR also includes a fix from @sfc-gh-aling to emit SELECT * (rather than SELECT "col1", "col2") in join operations if no aliasing is necessary.
As a result, the SQL for the final query emitted by test_dataframe_join_suite.py::test_name_alias_on_multiple_join is reduced from 1045B -> 799B, a 24% reduction. This fix was requested for SCOS; a sample query emulating a user workload similarly experiences a 15% reduction in query size in SCOS (2175B -> 1857B), and a 60% reduction for the Snowpark equivalent (328B -> 128B).

Implementation Details

Though the original ask was specifically to implement this change for JOIN operations, this PR applies the optimization to all generated queries. It does so with changes in three locations:

unary_expression_extractor in analyzer.py: Avoids emitting SQL for an alias in locations when possible, when traversing a query plan. This is the simplest location to make this change, as it avoids the need to track down all call sites that generate an Alias node.
derive_column_states_from_subquery in select_statement.py: This method compares the analyzed query strings of expressions to see if their values have changed. Previously, aliases always fully resolved ("A" AS "A"), so aliased columns were always assigned the CHANGED_EXP state; now, since redundant aliases resolve to just the column name ("A"), this method assigns UNCHANGED_EXP instead. My understanding is that this behavior should be correct, but this produced bugs in nested joins where a top-level projection did not properly use an alias for an ambiguous column.
_disambiguate in dataframe.py: If two frames have no column names in common at all, then there is no need for disambiguation, and we can simply emit a SELECT *. This fix was proposed by @sfc-gh-aling in SNOW-2895675: removing redundant alias when join #4044.

I could not track down the root cause of the aliasing issue, as it appeared even with simpler fixes like modifying _alias_if_needed in dataframe.py, and replacing analyzer calls with parse_local_name as suggested by comments within select_statement.py. Based on running through the codebase (and asking Cursor), it is likely that a subquery alias mapping somewhere is not being populated somewhere during analysis, but it is not clear what step of the analysis process would be responsible for this.

The fix I chose provides the largest benefit (removing redundant aliasing for all queries, not just joins) while adhering as closely as possible to previous behavior when analyzing column change states.

Why do we need to change both the analyzer and join disambiguation?

In cases where frames have no overlapping column names, the fix in #4044 to emit SELECT * instead of individual column names is sufficient. This can also lead to a reduction in issued DESCRIBE queries. Example:

def without_common_columns():
    session.create_dataframe([(1,2,3,4)], schema=["a", "b", "c", "d"]).write.save_as_table(table_name="temptable1", table_type="temporary")
    session.create_dataframe([(5,6,7,8)], schema=["e", "f", "g", "h"]).write.save_as_table(table_name="temptable2", table_type="temporary")

    df1 = session.table("temptable1")
    df2 = session.table("temptable2")
    dfm = df1.join(df2)
    dfm.show()

However, if frames do have overlapping column names, explicit removal of aliasing is necessary to reduce query text size. This is a very realistic scenario, as JOIN operations frequently combine tables based on a common ID column or other key. Example:

def with_common_columns():
    df1 = session.create_dataframe([[0, 1, 3, 4, 5]], schema=["b", "a", "c", "d", "x"])
    df2 = session.create_dataframe([[0, 2, 3, 4, 5]], schema=["b", "a", "c", "e", "y"])
    df3 = session.create_dataframe([[0, 3, 3, 4, 5]], schema=["b", "a", "c", "f", "z"])
    result = (
        df1
            .join(df2, df1["b"] == df2["b"])
            .join(df3, df1["c"] == df3["c"])
            .filter(df1["a"] > 0)
            .select(df1["a"], df2["a"], df1["b"])
    )
    result.show()

The changes in this PR do not reduce the number of queries issued for this case, but can lead to a substantial reduction in text size when a joined DF is used repeatedly in a sub-query.

sfc-gh-aalam · 2025-12-16T21:50:50Z

src/snowflake/snowpark/_internal/analyzer/analyzer.py

+                isinstance(expr.child, (Attribute, UnresolvedAttribute))
+                and origin == quoted_name


why not move this check here?

snowpark-python/src/snowflake/snowpark/_internal/analyzer/analyzer_utils.py

Lines 364 to 365 in 266334b

def alias_expression(origin: str, alias: str) -> str:

return origin + AS + alias

I felt it would be clearer to put this check outside the call to a direct SQL generation function. IMO it would be very unintuitive if a call to alias_expression had control logic that could emit SQL that did not represent an actual alias expression.

…ve-join-alias

sfc-gh-aling · 2026-02-03T19:07:00Z

src/snowflake/snowpark/_internal/analyzer/analyzer.py

+                expr.child, df_aliased_col_name_to_real_col_name, parse_local_name
+            )
+            if (
+                isinstance(expr.child, (Attribute, UnresolvedAttribute))


we are comparing origin against quoted_name, from analyzer code:

snowpark-python/src/snowflake/snowpark/_internal/analyzer/analyzer.py

Line 416 in 729fd3e

if isinstance(expr, UnresolvedAttribute):

I don't see UnresolvedAttribute is guaranteed to have quoted name

My understanding is that all UnresolvedAttribute constructor calls are expected to quote the name of the attribute. You can verify this is the case with rg 'UnresolvedAttribute\(' -A5 src; every instance of the constructor call comes with quoting.

sfc-gh-aling · 2026-02-03T19:09:57Z

src/snowflake/snowpark/_internal/analyzer/select_statement.py

-                c, from_.df_aliased_col_name_to_real_col_name, parse_local_name=True
-            ).strip(" "):
+            # SNOW-2895675: Always treat Aliases as "changed", even if it is an identity.
+            # The fact this check is needed may be a bug in column state analysis, and we should revisit it later.


do you have a case that would fail if we don't treat aliases as changed?
and what's the generated sql

tests/integ/scala/test_dataframe_join_suite.py::test_name_alias_on_multiple_join was previously failing without this, but it looks like it no longer fails after your disambiguation change. I think the underlying bug still exists though, so I'll see if I can find another reproducer.

It looks like this is no longer necessary because of your changes to populate the alias map in dataframe.py. I've reverted this file.

Never mind, CI revealed some failures:

FAILED tests/integ/test_cte.py::test_sql_simplifier - assert 2 == 1 FAILED tests/integ/scala/test_dataframe_suite.py::test_rename_join_dataframe - snowflake.snowpark.exceptions.SnowparkSQLException: (1304): 01c22c14-0813-e... FAILED tests/integ/test_dataframe.py::test_dataframe_alias - snowflake.snowpark.exceptions.SnowparkSQLAmbiguousJoinException: (1303): Th...

I'll add a comment mentioning them.

sfc-gh-aling · 2026-02-03T19:10:14Z

src/snowflake/snowpark/mock/_analyzer.py

+                expr.child, df_aliased_col_name_to_real_col_name, parse_local_name
            )
+            if (
+                isinstance(expr.child, (Attribute, UnresolvedAttribute))


same question on quotes

…move-join-alias

sfc-gh-joshi requested review from a team as code owners December 16, 2025 20:30

sfc-gh-joshi requested review from sfc-gh-jdu, sfc-gh-jrose and sfc-gh-mayliu December 16, 2025 20:30

github-actions bot added the local testing Local Testing issues/PRs label Dec 16, 2025

sfc-gh-aalam reviewed Dec 16, 2025

View reviewed changes

sfc-gh-joshi and others added 13 commits January 28, 2026 16:51

commit with wip debugging

d8a0809

cleanup

c733274

add back newline

d546694

fix test_query_line_intervals

84f272e

try removing redundant alias

cdfd5a9

fix alias map

4118eba

fix test

ef9c6d7

fix local testing

c58903f

fix local test

8ce0d02

one more optimization

05847ae

Apply suggestions from code review

bdea3fa

Merge branch 'aling-test-removing-alias' into joshi-SNOW-2895675-remo…

01c64e3

…ve-join-alias

minor comments

3b25773

sfc-gh-joshi force-pushed the joshi-SNOW-2895675-remove-join-alias branch from 4c19d78 to 3b25773 Compare January 30, 2026 18:46

fix removed alias

7201f80

sfc-gh-aling reviewed Feb 3, 2026

View reviewed changes

sfc-gh-joshi added 4 commits February 3, 2026 11:55

Revert analyzer change

7bbb3e0

Merge remote-tracking branch 'origin/main' into joshi-SNOW-2895675-re…

27ffdf9

…move-join-alias

changelog

55b64c6

add back sql simplifier check

58abbe6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-2895675: Skip aliases when source/destination column are identical #4037

SNOW-2895675: Skip aliases when source/destination column are identical #4037

sfc-gh-joshi commented Dec 16, 2025 •

edited

Loading

Uh oh!

sfc-gh-aalam Dec 16, 2025

Uh oh!

sfc-gh-joshi Dec 16, 2025

Uh oh!

sfc-gh-aling Feb 3, 2026

Uh oh!

sfc-gh-joshi Feb 3, 2026

Uh oh!

sfc-gh-aling Feb 3, 2026 •

edited

Loading

Uh oh!

sfc-gh-joshi Feb 3, 2026

Uh oh!

sfc-gh-joshi Feb 3, 2026

Uh oh!

sfc-gh-joshi Feb 3, 2026 •

edited

Loading

Uh oh!

sfc-gh-aling Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		isinstance(expr.child, (Attribute, UnresolvedAttribute))
		and origin == quoted_name

	def alias_expression(origin: str, alias: str) -> str:
	return origin + AS + alias

SNOW-2895675: Skip aliases when source/destination column are identical #4037

Are you sure you want to change the base?

SNOW-2895675: Skip aliases when source/destination column are identical #4037

Conversation

sfc-gh-joshi commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation Details

Why do we need to change both the analyzer and join disambiguation?

Uh oh!

sfc-gh-aalam Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

sfc-gh-joshi Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

sfc-gh-aling Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-joshi Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-aling Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-joshi Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-joshi Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-joshi Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-aling Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sfc-gh-joshi commented Dec 16, 2025 •

edited

Loading

sfc-gh-aling Feb 3, 2026 •

edited

Loading

sfc-gh-joshi Feb 3, 2026 •

edited

Loading