20426: fix: Prefer numeric in type coercion for comparisons#294
20426: fix: Prefer numeric in type coercion for comparisons#294martin-augment wants to merge 12 commits intomainfrom
Conversation
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
WalkthroughThis pull request refactors type coercion logic in DataFusion to distinguish between comparison and type-union contexts. A new ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refines DataFusion's type coercion system, particularly for operations involving mixed numeric and string data types. The primary goal is to ensure that comparison operations (like filtering or joining) correctly interpret string representations of numbers as numeric values, preventing unexpected lexicographical comparisons. Concurrently, it clarifies and separates the coercion logic for type unification scenarios (like Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
PR Review:
|
| Area | Assessment |
|---|---|
| Correctness | ✅ Fixes a real semantic bug |
| Breaking changes | |
| Code quality | ✅ Clean refactor with good separation of concerns |
| Test coverage | ✅ Comprehensive, minor gap in NULL edge cases |
| Documentation | ✅ Migration guide is clear |
| Performance | ✅ No concerns |
Suggested action items before merge:
- Convert removed
push_down_filter_parquet.slttests to error-assertion form instead of deleting them. - Add a NULL + mixed-type IN/comparison test case to
string_numeric_coercion.slt. - Consider whether
Dictionary(Utf8) vs Intsilent behavior change needs a deprecation note or log warning.
Otherwise the change is solid.
🤖 Augment PR SummarySummary: This PR changes type coercion so comparisons between numeric and string values prefer numeric semantics, fixing incorrect lexicographic ordering/equality. Changes:
Technical Notes: Mixed string/numeric comparisons may now raise cast errors at runtime when strings are not parseable as the target numeric type; UNION/branch unification continues to prefer strings via 🤖 Was this summary useful? React with 👍 or 👎 |
| .or_else(|| ree_coercion(lhs_type, rhs_type, true, type_union_coercion)) | ||
| .or_else(|| temporal_coercion_nonstrict_timezone(lhs_type, rhs_type)) | ||
| .or_else(|| string_coercion(lhs_type, rhs_type)) | ||
| .or_else(|| list_coercion(lhs_type, rhs_type)) |
There was a problem hiding this comment.
In type_union_coercion, list types are handled via list_coercion, whose child-type resolution uses type_union_resolution (numeric-preferring for string/numeric). This may cause UNION/CASE-THEN coercion for List(Int64) vs List(Utf8) to pick List(Int64) and error on non-numeric strings, which seems to contradict the string-preferring intent described above.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! type_union_resolution() prefers numbers than strings and this affects how the list_coercion() behaves too. Prevents wrong behavior for list related operations.
| ---- | ||
| [20, 30] | ||
|
|
||
| # List in UNION uses type union coercion (string preferred). |
There was a problem hiding this comment.
This UNION test for list types doesn't assert the resulting element type, and [5, 10] formatting could pass even if the union coerces to List(Int64) by casting strings. Consider asserting via arrow_typeof on the UNION output (or adding a non-numeric string case) to ensure list UNION truly uses string-preferred coercion.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! The SQL Logic test claims that the result is List(Utf8) but does no assert that and there is no way to confirm this and to prevent any regressions.
| // int column vs string literal: the string literal is cast to Int64 | ||
| create_expr_test(col("i").eq(lit("202410")), "i@1 = CAST(202410 AS Int64)"); | ||
| create_expr_test(lit("202410").eq(col("i")), "CAST(202410 AS Int64) = i@1"); | ||
| // when simplified, the literal cast is constant-folded |
There was a problem hiding this comment.
The comment says the literal cast is constant-folded in the simplified expression, but the expected simplified string still includes CAST(202410 AS Int64). If folding is intended, this test may not currently validate it (or the comment may be outdated).
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! The comment claims that the result is folded into a constant but the unit test result still uses CAST(...), so the comment is not correct and needs to be improved.
There was a problem hiding this comment.
Code Review
This pull request introduces a significant change to DataFusion's type coercion rules, distinguishing between 'comparison' contexts (e.g., =, >, IN, BETWEEN, CASE WHEN) where numeric types are now preferred, and 'type unification' contexts (e.g., UNION, CASE THEN/ELSE, NVL2) where string types are preferred. This refactoring ensures more intuitive and correct numeric semantics for mixed-type comparisons, raising errors when string literals cannot be cast to the numeric type. The changes involve updating core coercion functions, adapting their usage across the expression and physical planning layers, and updating extensive tests and documentation to reflect the new behavior.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
|
|
||
| /// Coercion rules for list types. | ||
| /// Coerce two list types by coercing their element types via | ||
| /// [`type_union_resolution`]. |
There was a problem hiding this comment.
List coercion not parameterized unlike struct and map
Medium Severity
struct_coercion and map_coercion were correctly parameterized with a coerce_fn to distinguish comparison vs. union contexts, but list_coercion was not. It always delegates to type_union_resolution, which internally calls string_numeric_coercion (numeric-preferring). This means list element coercion in type_union_coercion contexts (UNION, CASE THEN/ELSE) incorrectly uses numeric preference instead of string preference — opposite to how struct and map behave in the same context.
Additional Locations (1)
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
datafusion/sqllogictest/test_files/string/string_query.slt.part (1)
50-54: Consider loosening error-text matching to reduce test brittleness.Matching the exact bad value (
'Andrew') can make this test fragile if evaluation order changes. A regex that asserts cast-failure class/message is usually more stable.♻️ Suggested test expectation hardening
-statement error Arrow error: Cast error: Cannot cast string 'Andrew' to value of Int64 type +statement error .*Cannot cast string '.*' to value of Int64 type select ascii_1 = 1 as col1, 1 = ascii_1 as col2 from test_basic_operator; -statement error Arrow error: Cast error: Cannot cast string 'Andrew' to value of Int64 type +statement error .*Cannot cast string '.*' to value of Int64 type select ascii_1 <> 1 as col1, 1 <> ascii_1 as col2 from test_basic_operator;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/sqllogictest/test_files/string/string_query.slt.part` around lines 50 - 54, The test currently matches the full error text including the specific bad value "'Andrew'", which is brittle; update the expectations in string_query.slt.part for the two failing statements (the ones starting with "select ascii_1 = 1 as col1, 1 = ascii_1 as col2 from test_basic_operator;" and "select ascii_1 <> 1 as col1, 1 <> ascii_1 as col2 from test_basic_operator;") to use a looser regex or substring that asserts the cast-failure class/message (e.g. match "Arrow error: Cast error" and/or "Cannot cast .* to value of Int64" without the concrete literal) so the test no longer depends on the exact value text.datafusion/sqllogictest/test_files/string_numeric_coercion.slt (1)
491-496: Make these two negative cases assert the coercion failure, not just any failure.Bare
query errorwill pass on unrelated parser or planner errors too. Since this file is pinning coercion behavior, please match a stable substring here like the array case above.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/sqllogictest/test_files/string_numeric_coercion.slt` around lines 491 - 496, Replace the bare "query error" lines before the two MAP tests with a targeted failure assertion that matches the coercion failure (not any parser/planner error); for the queries SELECT MAP {[1,2,3]:1, ['a', 'b']:2}; and SELECT MAP {'a':[1,2,3], 'b':['a', 'b']}; change the preceding marker to something like "query error: cannot coerce" (or another stable substring used elsewhere for array/type coercion failures) so the test asserts a coercion-specific error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@datafusion/expr-common/src/type_coercion/binary.rs`:
- Line 750: The call to string_numeric_coercion in
type_union_resolution_coercion should be replaced with
string_numeric_union_coercion so multi-type union resolution prefers string
types for numeric/string pairs; update the expression chain that currently calls
string_numeric_coercion (the .or_else(...) branch) to call
string_numeric_union_coercion instead, ensuring behavior matches
type_union_coercion's documented string-preference policy.
In `@datafusion/optimizer/src/analyzer/type_coercion.rs`:
- Around line 44-50: The SetComparison branch (Expr::SetComparison) still
applies the old numeric/string early-return guard and must instead use the new
comparison-vs-union semantics: remove the numeric/string early-return in the
Expr::SetComparison handling and invoke the same comparison_coercion (or
type_union_coercion where appropriate) logic used by plain
comparisons/IN/BETWEEN so = ANY / < ALL follow the new coercion paths; update
the branch to call the comparison coercion helpers (e.g., comparison_coercion or
type_union_coercion) and handle UDF/field coercions consistently
(UDFCoercionExt, fields_with_udf) and add a regression test that constructs a
mixed numeric/string subquery comparison (e.g., value = ANY(subquery with mixed
numeric and string types) ) verifying it now coerces instead of erroring.
---
Nitpick comments:
In `@datafusion/sqllogictest/test_files/string_numeric_coercion.slt`:
- Around line 491-496: Replace the bare "query error" lines before the two MAP
tests with a targeted failure assertion that matches the coercion failure (not
any parser/planner error); for the queries SELECT MAP {[1,2,3]:1, ['a', 'b']:2};
and SELECT MAP {'a':[1,2,3], 'b':['a', 'b']}; change the preceding marker to
something like "query error: cannot coerce" (or another stable substring used
elsewhere for array/type coercion failures) so the test asserts a
coercion-specific error.
In `@datafusion/sqllogictest/test_files/string/string_query.slt.part`:
- Around line 50-54: The test currently matches the full error text including
the specific bad value "'Andrew'", which is brittle; update the expectations in
string_query.slt.part for the two failing statements (the ones starting with
"select ascii_1 = 1 as col1, 1 = ascii_1 as col2 from test_basic_operator;" and
"select ascii_1 <> 1 as col1, 1 <> ascii_1 as col2 from test_basic_operator;")
to use a looser regex or substring that asserts the cast-failure class/message
(e.g. match "Arrow error: Cast error" and/or "Cannot cast .* to value of Int64"
without the concrete literal) so the test no longer depends on the exact value
text.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7f2fe6eb-5b52-4385-84b7-15352b532733
📒 Files selected for processing (22)
datafusion/core/src/physical_planner.rsdatafusion/core/tests/expr_api/mod.rsdatafusion/core/tests/sql/unparser.rsdatafusion/expr-common/src/interval_arithmetic.rsdatafusion/expr-common/src/signature.rsdatafusion/expr-common/src/type_coercion/binary.rsdatafusion/expr-common/src/type_coercion/binary/tests/comparison.rsdatafusion/expr-common/src/type_coercion/binary/tests/dictionary.rsdatafusion/expr-common/src/type_coercion/binary/tests/run_end_encoded.rsdatafusion/expr/src/type_coercion/functions.rsdatafusion/expr/src/type_coercion/other.rsdatafusion/functions/src/core/nvl2.rsdatafusion/optimizer/src/analyzer/type_coercion.rsdatafusion/physical-expr/src/expressions/case.rsdatafusion/sqllogictest/test_files/delete.sltdatafusion/sqllogictest/test_files/dictionary.sltdatafusion/sqllogictest/test_files/expr.sltdatafusion/sqllogictest/test_files/push_down_filter_parquet.sltdatafusion/sqllogictest/test_files/string/string_query.slt.partdatafusion/sqllogictest/test_files/string_numeric_coercion.sltdatafusion/substrait/tests/cases/roundtrip_logical_plan.rsdocs/source/library-user-guide/upgrading/53.0.0.md
💤 Files with no reviewable changes (1)
- datafusion/sqllogictest/test_files/push_down_filter_parquet.slt
| .or_else(|| temporal_coercion_nonstrict_timezone(lhs_type, rhs_type)) | ||
| .or_else(|| string_coercion(lhs_type, rhs_type)) | ||
| .or_else(|| numeric_string_coercion(lhs_type, rhs_type)) | ||
| .or_else(|| string_numeric_coercion(lhs_type, rhs_type)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the actual usage of type_union_resolution vs type_union_coercion
# to determine if the inconsistency affects real scenarios
echo "=== Usages of type_union_resolution ==="
rg -n "type_union_resolution\b" --type rust -C 2
echo ""
echo "=== Usages of type_union_resolution_coercion ==="
rg -n "type_union_resolution_coercion\b" --type rust -C 2Repository: martin-augment/datafusion
Length of output: 13558
🏁 Script executed:
cat -n datafusion/expr-common/src/type_coercion/binary.rs | sed -n '650,750p'Repository: martin-augment/datafusion
Length of output: 4653
🏁 Script executed:
cat -n datafusion/expr-common/src/type_coercion/binary.rs | sed -n '860,900p'Repository: martin-augment/datafusion
Length of output: 2171
🏁 Script executed:
rg -n "fn string_numeric_coercion|fn string_numeric_union_coercion" --type rust -A 10 datafusion/expr-common/src/type_coercion/binary.rsRepository: martin-augment/datafusion
Length of output: 985
🏁 Script executed:
cat -n datafusion/expr-common/src/type_coercion/binary.rs | sed -n '590,610p'Repository: martin-augment/datafusion
Length of output: 1075
🏁 Script executed:
cat -n datafusion/expr-common/src/type_coercion/binary.rs | sed -n '850,875p'Repository: martin-augment/datafusion
Length of output: 1811
Replace string_numeric_coercion with string_numeric_union_coercion at line 750.
type_union_resolution_coercion is used by type_union_resolution, which handles multi-type unification in UNION, CASE, COALESCE, MAKE_ARRAY, CONCAT, and similar operations. However, it uses string_numeric_coercion (which prefers numeric types) instead of string_numeric_union_coercion (which prefers string types).
This creates an inconsistency: the two-type unification function type_union_coercion (line 868) explicitly documents and implements string preference for numeric/string pairs, but the multi-type version does the opposite. This causes operations like UNION to produce different results depending on the number of branches, and violates the stated design principle that type unification should prefer strings because every number can be represented as text.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@datafusion/expr-common/src/type_coercion/binary.rs` at line 750, The call to
string_numeric_coercion in type_union_resolution_coercion should be replaced
with string_numeric_union_coercion so multi-type union resolution prefers string
types for numeric/string pairs; update the expression chain that currently calls
string_numeric_coercion (the .or_else(...) branch) to call
string_numeric_union_coercion instead, ensuring behavior matches
type_union_coercion's documented string-preference policy.
| use datafusion_expr::type_coercion::binary::{ | ||
| comparison_coercion, like_coercion, type_union_coercion, | ||
| }; | ||
| use datafusion_expr::type_coercion::functions::{UDFCoercionExt, fields_with_udf}; | ||
| use datafusion_expr::type_coercion::other::{ | ||
| get_coerce_type_for_case_expression, get_coerce_type_for_list, | ||
| get_coerce_type_for_case_expression, get_coerce_type_for_case_when, | ||
| get_coerce_type_for_list, |
There was a problem hiding this comment.
Carry the new comparison semantics through SetComparison.
This file now adopts the comparison-vs-union split, but the Expr::SetComparison branch still has the old numeric/string early return at Lines 521-527. = ANY / < ALL will therefore keep erroring on pairs that plain comparisons, IN, and BETWEEN now coerce successfully. Please drop that guard and add a regression test for a mixed numeric/string subquery comparison.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@datafusion/optimizer/src/analyzer/type_coercion.rs` around lines 44 - 50, The
SetComparison branch (Expr::SetComparison) still applies the old numeric/string
early-return guard and must instead use the new comparison-vs-union semantics:
remove the numeric/string early-return in the Expr::SetComparison handling and
invoke the same comparison_coercion (or type_union_coercion where appropriate)
logic used by plain comparisons/IN/BETWEEN so = ANY / < ALL follow the new
coercion paths; update the branch to call the comparison coercion helpers (e.g.,
comparison_coercion or type_union_coercion) and handle UDF/field coercions
consistently (UDFCoercionExt, fields_with_udf) and add a regression test that
constructs a mixed numeric/string subquery comparison (e.g., value =
ANY(subquery with mixed numeric and string types) ) verifying it now coerces
instead of erroring.


20426: To review by AI