Skip to content

20426: fix: Prefer numeric in type coercion for comparisons#294

Open
martin-augment wants to merge 12 commits intomainfrom
pr-20426-2026-03-25-21-13-54
Open

20426: fix: Prefer numeric in type coercion for comparisons#294
martin-augment wants to merge 12 commits intomainfrom
pr-20426-2026-03-25-21-13-54

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

20426: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

Walkthrough

This pull request refactors type coercion logic in DataFusion to distinguish between comparison and type-union contexts. A new type_union_coercion function handles type unification for operations like UNION, CASE expressions, and NVL2, preferring string types. The existing comparison_coercion function is updated to prefer numeric types when comparing mixed numeric and string operands. Related coercion helpers like dictionary_coercion and ree_coercion are generalized to accept a coercion strategy parameter. Callers throughout the codebase are updated to invoke the appropriate coercion function for their context. Test expectations are modified to reflect numeric-preferred comparison semantics, and new test coverage is added for the updated coercion behavior.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-20426-2026-03-25-21-13-54

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines DataFusion's type coercion system, particularly for operations involving mixed numeric and string data types. The primary goal is to ensure that comparison operations (like filtering or joining) correctly interpret string representations of numbers as numeric values, preventing unexpected lexicographical comparisons. Concurrently, it clarifies and separates the coercion logic for type unification scenarios (like UNION or CASE result branches), where string preference is generally more appropriate. This change leads to more intuitive and SQL-compliant behavior, while also introducing stricter type checking that will now surface errors for invalid string-to-numeric casts.

Highlights

  • Type Coercion Logic Refinement: The type coercion rules for comparison operations (e.g., =, >, IN, BETWEEN, CASE WHEN) have been updated to prefer numeric types over string types when comparing mixed numeric and string values. This ensures comparisons are performed numerically rather than lexicographically.
  • Introduction of type_union_coercion: A new coercion function, type_union_coercion, has been introduced for contexts where type unification is needed (e.g., UNION, CASE THEN/ELSE, NVL2). This function prefers string types when unifying mixed numeric and string values, aligning with standard SQL behavior for these operations.
  • Refactored Coercion Functions: The comparison_coercion_numeric function has been removed, and its logic is now integrated into the main comparison_coercion function. Generic coercion functions for struct, map, dictionary, and RunEndEncoded types were refactored to accept a coercion strategy, allowing them to be used with both comparison-based and union-based coercion.
  • Enhanced Test Coverage and Documentation: Extensive new SQL logic tests (string_numeric_coercion.slt) have been added to validate the new coercion behavior across various scenarios, including error cases for invalid casts. The upgrade guide (53.0.0.md) has been updated with a detailed explanation of the changes, affected queries, and migration advice.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link
Copy Markdown

claude bot commented Mar 25, 2026

PR Review: fix: Prefer numeric in type coercion for comparisons

This is a well-executed fix for a genuine semantic bug. The core issue — string-based lexicographic comparison causing 5 > '100' to evaluate as true — is a real correctness problem worth fixing. Overall the PR is in good shape. Below are my observations.


What the PR Does

The PR introduces a distinction between two coercion contexts:

  • Comparison contexts (=, <, >, IN, BETWEEN, CASE WHEN): now prefer numeric types
  • Type unification contexts (UNION, CASE THEN/ELSE, NVL2): continue to prefer string types

This fixes incorrect lexicographic comparison behavior while preserving sensible behavior for type unification.


Correctness / Potential Issues

1. Intentional breaking changes are significant but clearly documented.

Queries like str_col IN ('a', 1) and int_col = 'hello' that previously succeeded (with silent coercion to string) will now error. This is the correct behavior, but users should be explicitly warned in the PR description and migration notes. ✅ The migration guide in upgrading/53.0.0.md covers this well.

2. Dictionary and RunEndEncoded type coercion flip.

Dictionary(Int32, Utf8) compared to Int16 now coerces to Int16 rather than Utf8. This is semantically correct but is a silent behavior change — not an error, just different results. Consider whether a warning or stricter check is warranted here, since dictionary-encoded string columns compared to integers could indicate a schema design issue.

3. Removed tests in push_down_filter_parquet.slt.

Several tests for non-castable string literals were removed rather than updated to assert the new error behavior. It would be better to convert them to error-assertion tests to maintain coverage of those code paths.

4. in_list_types test rename and behavior change.

The old test in_list_types tested a successful plan; the new in_list_types_mixed_string_int_error tests that it errors. This is correct, but consider adding a new passing test that confirms int_col IN (1, 2, 3) (pure numeric IN list) still works as expected after the refactor.


Code Quality

  • The refactoring is clean. The old pair (comparison_coercion / comparison_coercion_numeric) has been reorganized into a clear intent-driven pair (comparison_coercion for numeric-preferring, type_union_coercion for string-preferring).
  • Using function pointers in dictionary_coercion, struct_coercion, map_coercion, ree_coercion to avoid duplicating logic is a good design choice.
  • Comments explaining why different coercion strategies are used in different contexts are clear and helpful.

One minor nit: the naming type_union_coercion could be confused with SQL UNION type coercion specifically, but it's used for any context that prefers string widening. A name like widening_coercion or string_preferring_coercion might be clearer — though this is a minor style point.


Test Coverage

The new string_numeric_coercion.slt file (496 lines) is comprehensive:

  • Covers <, >, <=, >=, =, !=
  • Includes error cases (non-numeric strings, overflow, decimal vs. int)
  • Tests UNION, BETWEEN, IN, CASE WHEN contexts
  • Tests complex types: structs, lists, maps, dictionaries, run-end-encoded

The unit tests in comparison.rs (test_comparison_coercion_prefers_numeric, test_type_union_coercion_prefers_string, test_binary_comparison_string_numeric_coercion) round this out well.

Gap: There are no tests for NULL-handling edge cases (e.g., NULL IN (1, 'a') or NULL > '5') in the new coercion contexts. These are worth verifying.


Performance

No concerns. Type coercion happens at plan time, and the changes don't affect execution kernels. The refactored code is if anything slightly simpler.


Documentation

The upgrading/53.0.0.md migration guide is clear and actionable. Good work including a before/after behavior table.


Summary

Area Assessment
Correctness ✅ Fixes a real semantic bug
Breaking changes ⚠️ Intentional and documented, but significant
Code quality ✅ Clean refactor with good separation of concerns
Test coverage ✅ Comprehensive, minor gap in NULL edge cases
Documentation ✅ Migration guide is clear
Performance ✅ No concerns

Suggested action items before merge:

  1. Convert removed push_down_filter_parquet.slt tests to error-assertion form instead of deleting them.
  2. Add a NULL + mixed-type IN/comparison test case to string_numeric_coercion.slt.
  3. Consider whether Dictionary(Utf8) vs Int silent behavior change needs a deprecation note or log warning.

Otherwise the change is solid.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 25, 2026

🤖 Augment PR Summary

Summary: This PR changes type coercion so comparisons between numeric and string values prefer numeric semantics, fixing incorrect lexicographic ordering/equality.

Changes:

  • Updates comparison coercion to cast the string side to the numeric type for =, <, >, <=, >=, <>, IN, BETWEEN, and CASE expr WHEN contexts
  • Introduces type_union_coercion for type-unification contexts (e.g., UNION, CASE THEN/ELSE results, NVL2) where strings remain preferred
  • Refactors struct/map/dictionary/REE coercion to be parameterized over the chosen coercion function
  • Updates optimizer/analyzer coercion paths for CASE and UNION schema coercion to use the appropriate rule
  • Adjusts/expands tests, including new sqllogictest coverage for string-numeric comparisons and nested types
  • Updates Clickbench unparser test setup to register raw parquet then create a view with corrected EventDate typing
  • Updates docs (53.0.0 upgrade guide) describing the behavioral change and the removal of comparison_coercion_numeric

Technical Notes: Mixed string/numeric comparisons may now raise cast errors at runtime when strings are not parseable as the target numeric type; UNION/branch unification continues to prefer strings via type_union_coercion.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

.or_else(|| ree_coercion(lhs_type, rhs_type, true, type_union_coercion))
.or_else(|| temporal_coercion_nonstrict_timezone(lhs_type, rhs_type))
.or_else(|| string_coercion(lhs_type, rhs_type))
.or_else(|| list_coercion(lhs_type, rhs_type))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In type_union_coercion, list types are handled via list_coercion, whose child-type resolution uses type_union_resolution (numeric-preferring for string/numeric). This may cause UNION/CASE-THEN coercion for List(Int64) vs List(Utf8) to pick List(Int64) and error on non-numeric strings, which seems to contradict the string-preferring intent described above.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! type_union_resolution() prefers numbers than strings and this affects how the list_coercion() behaves too. Prevents wrong behavior for list related operations.

----
[20, 30]

# List in UNION uses type union coercion (string preferred).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This UNION test for list types doesn't assert the resulting element type, and [5, 10] formatting could pass even if the union coerces to List(Int64) by casting strings. Consider asserting via arrow_typeof on the UNION output (or adding a non-numeric string case) to ensure list UNION truly uses string-preferred coercion.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The SQL Logic test claims that the result is List(Utf8) but does no assert that and there is no way to confirm this and to prevent any regressions.

// int column vs string literal: the string literal is cast to Int64
create_expr_test(col("i").eq(lit("202410")), "i@1 = CAST(202410 AS Int64)");
create_expr_test(lit("202410").eq(col("i")), "CAST(202410 AS Int64) = i@1");
// when simplified, the literal cast is constant-folded
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the literal cast is constant-folded in the simplified expression, but the expected simplified string still includes CAST(202410 AS Int64). If folding is intended, this test may not currently validate it (or the comment may be outdated).

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The comment claims that the result is folded into a constant but the unit test result still uses CAST(...), so the comment is not correct and needs to be improved.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant change to DataFusion's type coercion rules, distinguishing between 'comparison' contexts (e.g., =, >, IN, BETWEEN, CASE WHEN) where numeric types are now preferred, and 'type unification' contexts (e.g., UNION, CASE THEN/ELSE, NVL2) where string types are preferred. This refactoring ensures more intuitive and correct numeric semantics for mixed-type comparisons, raising errors when string literals cannot be cast to the numeric type. The changes involve updating core coercion functions, adapting their usage across the expression and physical planning layers, and updating extensive tests and documentation to reflect the new behavior.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.


/// Coercion rules for list types.
/// Coerce two list types by coercing their element types via
/// [`type_union_resolution`].
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List coercion not parameterized unlike struct and map

Medium Severity

struct_coercion and map_coercion were correctly parameterized with a coerce_fn to distinguish comparison vs. union contexts, but list_coercion was not. It always delegates to type_union_resolution, which internally calls string_numeric_coercion (numeric-preferring). This means list element coercion in type_union_coercion contexts (UNION, CASE THEN/ELSE) incorrectly uses numeric preference instead of string preference — opposite to how struct and map behave in the same context.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
datafusion/sqllogictest/test_files/string/string_query.slt.part (1)

50-54: Consider loosening error-text matching to reduce test brittleness.

Matching the exact bad value ('Andrew') can make this test fragile if evaluation order changes. A regex that asserts cast-failure class/message is usually more stable.

♻️ Suggested test expectation hardening
-statement error Arrow error: Cast error: Cannot cast string 'Andrew' to value of Int64 type
+statement error .*Cannot cast string '.*' to value of Int64 type
 select ascii_1 = 1 as col1, 1 = ascii_1 as col2 from test_basic_operator;
 
-statement error Arrow error: Cast error: Cannot cast string 'Andrew' to value of Int64 type
+statement error .*Cannot cast string '.*' to value of Int64 type
 select ascii_1 <> 1 as col1, 1 <> ascii_1 as col2 from test_basic_operator;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/sqllogictest/test_files/string/string_query.slt.part` around lines
50 - 54, The test currently matches the full error text including the specific
bad value "'Andrew'", which is brittle; update the expectations in
string_query.slt.part for the two failing statements (the ones starting with
"select ascii_1 = 1 as col1, 1 = ascii_1 as col2 from test_basic_operator;" and
"select ascii_1 <> 1 as col1, 1 <> ascii_1 as col2 from test_basic_operator;")
to use a looser regex or substring that asserts the cast-failure class/message
(e.g. match "Arrow error: Cast error" and/or "Cannot cast .* to value of Int64"
without the concrete literal) so the test no longer depends on the exact value
text.
datafusion/sqllogictest/test_files/string_numeric_coercion.slt (1)

491-496: Make these two negative cases assert the coercion failure, not just any failure.

Bare query error will pass on unrelated parser or planner errors too. Since this file is pinning coercion behavior, please match a stable substring here like the array case above.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/sqllogictest/test_files/string_numeric_coercion.slt` around lines
491 - 496, Replace the bare "query error" lines before the two MAP tests with a
targeted failure assertion that matches the coercion failure (not any
parser/planner error); for the queries SELECT MAP {[1,2,3]:1, ['a', 'b']:2}; and
SELECT MAP {'a':[1,2,3], 'b':['a', 'b']}; change the preceding marker to
something like "query error: cannot coerce" (or another stable substring used
elsewhere for array/type coercion failures) so the test asserts a
coercion-specific error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@datafusion/expr-common/src/type_coercion/binary.rs`:
- Line 750: The call to string_numeric_coercion in
type_union_resolution_coercion should be replaced with
string_numeric_union_coercion so multi-type union resolution prefers string
types for numeric/string pairs; update the expression chain that currently calls
string_numeric_coercion (the .or_else(...) branch) to call
string_numeric_union_coercion instead, ensuring behavior matches
type_union_coercion's documented string-preference policy.

In `@datafusion/optimizer/src/analyzer/type_coercion.rs`:
- Around line 44-50: The SetComparison branch (Expr::SetComparison) still
applies the old numeric/string early-return guard and must instead use the new
comparison-vs-union semantics: remove the numeric/string early-return in the
Expr::SetComparison handling and invoke the same comparison_coercion (or
type_union_coercion where appropriate) logic used by plain
comparisons/IN/BETWEEN so = ANY / < ALL follow the new coercion paths; update
the branch to call the comparison coercion helpers (e.g., comparison_coercion or
type_union_coercion) and handle UDF/field coercions consistently
(UDFCoercionExt, fields_with_udf) and add a regression test that constructs a
mixed numeric/string subquery comparison (e.g., value = ANY(subquery with mixed
numeric and string types) ) verifying it now coerces instead of erroring.

---

Nitpick comments:
In `@datafusion/sqllogictest/test_files/string_numeric_coercion.slt`:
- Around line 491-496: Replace the bare "query error" lines before the two MAP
tests with a targeted failure assertion that matches the coercion failure (not
any parser/planner error); for the queries SELECT MAP {[1,2,3]:1, ['a', 'b']:2};
and SELECT MAP {'a':[1,2,3], 'b':['a', 'b']}; change the preceding marker to
something like "query error: cannot coerce" (or another stable substring used
elsewhere for array/type coercion failures) so the test asserts a
coercion-specific error.

In `@datafusion/sqllogictest/test_files/string/string_query.slt.part`:
- Around line 50-54: The test currently matches the full error text including
the specific bad value "'Andrew'", which is brittle; update the expectations in
string_query.slt.part for the two failing statements (the ones starting with
"select ascii_1 = 1 as col1, 1 = ascii_1 as col2 from test_basic_operator;" and
"select ascii_1 <> 1 as col1, 1 <> ascii_1 as col2 from test_basic_operator;")
to use a looser regex or substring that asserts the cast-failure class/message
(e.g. match "Arrow error: Cast error" and/or "Cannot cast .* to value of Int64"
without the concrete literal) so the test no longer depends on the exact value
text.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7f2fe6eb-5b52-4385-84b7-15352b532733

📥 Commits

Reviewing files that changed from the base of the PR and between cebfbf1 and b73e41a.

📒 Files selected for processing (22)
  • datafusion/core/src/physical_planner.rs
  • datafusion/core/tests/expr_api/mod.rs
  • datafusion/core/tests/sql/unparser.rs
  • datafusion/expr-common/src/interval_arithmetic.rs
  • datafusion/expr-common/src/signature.rs
  • datafusion/expr-common/src/type_coercion/binary.rs
  • datafusion/expr-common/src/type_coercion/binary/tests/comparison.rs
  • datafusion/expr-common/src/type_coercion/binary/tests/dictionary.rs
  • datafusion/expr-common/src/type_coercion/binary/tests/run_end_encoded.rs
  • datafusion/expr/src/type_coercion/functions.rs
  • datafusion/expr/src/type_coercion/other.rs
  • datafusion/functions/src/core/nvl2.rs
  • datafusion/optimizer/src/analyzer/type_coercion.rs
  • datafusion/physical-expr/src/expressions/case.rs
  • datafusion/sqllogictest/test_files/delete.slt
  • datafusion/sqllogictest/test_files/dictionary.slt
  • datafusion/sqllogictest/test_files/expr.slt
  • datafusion/sqllogictest/test_files/push_down_filter_parquet.slt
  • datafusion/sqllogictest/test_files/string/string_query.slt.part
  • datafusion/sqllogictest/test_files/string_numeric_coercion.slt
  • datafusion/substrait/tests/cases/roundtrip_logical_plan.rs
  • docs/source/library-user-guide/upgrading/53.0.0.md
💤 Files with no reviewable changes (1)
  • datafusion/sqllogictest/test_files/push_down_filter_parquet.slt

.or_else(|| temporal_coercion_nonstrict_timezone(lhs_type, rhs_type))
.or_else(|| string_coercion(lhs_type, rhs_type))
.or_else(|| numeric_string_coercion(lhs_type, rhs_type))
.or_else(|| string_numeric_coercion(lhs_type, rhs_type))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the actual usage of type_union_resolution vs type_union_coercion
# to determine if the inconsistency affects real scenarios

echo "=== Usages of type_union_resolution ==="
rg -n "type_union_resolution\b" --type rust -C 2

echo ""
echo "=== Usages of type_union_resolution_coercion ==="
rg -n "type_union_resolution_coercion\b" --type rust -C 2

Repository: martin-augment/datafusion

Length of output: 13558


🏁 Script executed:

cat -n datafusion/expr-common/src/type_coercion/binary.rs | sed -n '650,750p'

Repository: martin-augment/datafusion

Length of output: 4653


🏁 Script executed:

cat -n datafusion/expr-common/src/type_coercion/binary.rs | sed -n '860,900p'

Repository: martin-augment/datafusion

Length of output: 2171


🏁 Script executed:

rg -n "fn string_numeric_coercion|fn string_numeric_union_coercion" --type rust -A 10 datafusion/expr-common/src/type_coercion/binary.rs

Repository: martin-augment/datafusion

Length of output: 985


🏁 Script executed:

cat -n datafusion/expr-common/src/type_coercion/binary.rs | sed -n '590,610p'

Repository: martin-augment/datafusion

Length of output: 1075


🏁 Script executed:

cat -n datafusion/expr-common/src/type_coercion/binary.rs | sed -n '850,875p'

Repository: martin-augment/datafusion

Length of output: 1811


Replace string_numeric_coercion with string_numeric_union_coercion at line 750.

type_union_resolution_coercion is used by type_union_resolution, which handles multi-type unification in UNION, CASE, COALESCE, MAKE_ARRAY, CONCAT, and similar operations. However, it uses string_numeric_coercion (which prefers numeric types) instead of string_numeric_union_coercion (which prefers string types).

This creates an inconsistency: the two-type unification function type_union_coercion (line 868) explicitly documents and implements string preference for numeric/string pairs, but the multi-type version does the opposite. This causes operations like UNION to produce different results depending on the number of branches, and violates the stated design principle that type unification should prefer strings because every number can be represented as text.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/expr-common/src/type_coercion/binary.rs` at line 750, The call to
string_numeric_coercion in type_union_resolution_coercion should be replaced
with string_numeric_union_coercion so multi-type union resolution prefers string
types for numeric/string pairs; update the expression chain that currently calls
string_numeric_coercion (the .or_else(...) branch) to call
string_numeric_union_coercion instead, ensuring behavior matches
type_union_coercion's documented string-preference policy.

Comment on lines +44 to +50
use datafusion_expr::type_coercion::binary::{
comparison_coercion, like_coercion, type_union_coercion,
};
use datafusion_expr::type_coercion::functions::{UDFCoercionExt, fields_with_udf};
use datafusion_expr::type_coercion::other::{
get_coerce_type_for_case_expression, get_coerce_type_for_list,
get_coerce_type_for_case_expression, get_coerce_type_for_case_when,
get_coerce_type_for_list,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Carry the new comparison semantics through SetComparison.

This file now adopts the comparison-vs-union split, but the Expr::SetComparison branch still has the old numeric/string early return at Lines 521-527. = ANY / < ALL will therefore keep erroring on pairs that plain comparisons, IN, and BETWEEN now coerce successfully. Please drop that guard and add a regression test for a mixed numeric/string subquery comparison.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/optimizer/src/analyzer/type_coercion.rs` around lines 44 - 50, The
SetComparison branch (Expr::SetComparison) still applies the old numeric/string
early-return guard and must instead use the new comparison-vs-union semantics:
remove the numeric/string early-return in the Expr::SetComparison handling and
invoke the same comparison_coercion (or type_union_coercion where appropriate)
logic used by plain comparisons/IN/BETWEEN so = ANY / < ALL follow the new
coercion paths; update the branch to call the comparison coercion helpers (e.g.,
comparison_coercion or type_union_coercion) and handle UDF/field coercions
consistently (UDFCoercionExt, fields_with_udf) and add a regression test that
constructs a mixed numeric/string subquery comparison (e.g., value =
ANY(subquery with mixed numeric and string types) ) verifying it now coerces
instead of erroring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants