2817: feat: Support ANSI mode avg expr by martin-augment · Pull Request #29 · martin-augment/datafusion-comet

martin-augment · 2025-11-24T16:25:34Z

2817: To review by AI

coderabbitai · 2025-11-24T16:25:50Z

Walkthrough

The changes integrate evaluation mode handling into the AVG aggregation function across the entire stack. The protobuf definition for AVG is updated, replacing a boolean fail_on_error field with an EvalMode enum field at tag 4. The planner, Spark expression implementation, and accumulator structures are modified to accept and thread the evaluation mode parameter through the aggregation pipeline. The serialization logic converts Spark's EvalMode to the protobuf EvalMode format. A new test case validates AVG and try_avg functionality across ANSI and non-ANSI modes with and without GROUP BY operations.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-2817-2025-11-24-16-25-30

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cursor · 2025-11-24T16:27:50Z

native/core/src/execution/planner.rs

-                        // cast to the result data type of AVG if the result data type is different
-                        // from the input type, e.g. AVG(Int32). We should not expect a cast
-                        // failure since it should have already been checked at Spark side.
+                        // For all other numeric types (Int8/16/32/64, Float32/64):


Bug: Decimal AVG ignores eval_mode parameter

The eval_mode is extracted from the protobuf at line 1896 but not passed to AvgDecimal::new for Decimal128 types. This means decimal averages won't respect ANSI mode settings for overflow handling, while non-decimal averages correctly receive the eval_mode parameter. The AvgDecimal implementation needs to be updated to accept and use eval_mode for consistent behavior across all numeric types.

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct that the eval_mode is ignored for the AvgDecimal implementation. https://github.com/apache/spark/blob/211dd995b221f135340375159672dcb77ef90ef4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala#L113 shows that it is used in the Spark implementation. Prevents wrong behavior in DataFusion Comet compared to Spark.

claude · 2025-11-24T16:28:55Z

Code Review: Support ANSI mode avg expr

Summary

This PR adds support for ANSI evaluation mode to the AVG aggregate expression. The changes look good overall and follow the established patterns in the codebase.

✅ Strengths

Good alignment with established patterns: The implementation follows the same approach used for other expressions with eval_mode support (e.g., arithmetic operations, cast).
Comprehensive test coverage: The test in CometAggregateSuite.scala:1474-1507 covers both ANSI and LEGACY modes, with and without GROUP BY, try_avg functionality, and NULL handling.
Proper use of shims: Correctly uses CometEvalModeUtil.fromSparkEvalMode() for cross-version compatibility.
Protobuf update: Changed from unused fail_on_error boolean to proper eval_mode enum.

🔍 Key Observations & Questions

1. Eval mode is stored but not used in Float64 path

In native/spark-expr/src/agg_funcs/avg.rs lines 143, 225: eval_mode field is marked with #[allow(dead_code)]

Question: Is INFINITY handling different between ANSI and LEGACY modes? Should ANSI mode throw an error on overflow instead of returning INFINITY? If the behavior should differ, the eval_mode needs to be used in the accumulation logic.

2. Decimal AVG doesn't receive eval_mode

In native/core/src/execution/planner.rs:1899-1901, the decimal path doesn't pass eval_mode to AvgDecimal::new(), while the Float64 path does.

Question: Should decimal AVG also support eval_mode for consistency? If decimals can overflow during averaging, ANSI mode should handle this differently than LEGACY mode.

🐛 Potential Issues

1. Overflow behavior not implemented

The PR removes the restriction on ANSI mode (getSupportLevel check removed from aggregates.scala:153-162) but doesn't implement ANSI-specific overflow checking in the Rust code.

Recommendation: Either add overflow checking in ANSI mode, document that Float64 AVG treats overflow as INFINITY in all modes, or add integration tests specifically for overflow cases.

2. Missing test for overflow/error cases

Current tests don't cover what happens with extreme values, error propagation in ANSI mode vs NULL in LEGACY mode.

📝 Code Quality Suggestions

Remove unnecessary comment removals: In avg.rs, several helpful comments were removed (e.g., line 136: "An accumulator to compute the average"). Consider restoring informative comments.
Consistent use of eval_mode: If eval_mode is truly unused for Float64, add a comment explaining why or remove the parameter if it will never be used.
Documentation: The commit message should clarify what specific ANSI behaviors are now supported and any known limitations.

🔒 Security Considerations

No security concerns identified. The changes don't introduce new attack vectors.

⚡ Performance Considerations

No performance impact expected - the eval_mode parameter is stored but doesn't add runtime branches.

📋 Recommendations Summary

Before merging:

Verify whether Float64 overflow should behave differently in ANSI vs LEGACY mode
Add test coverage for overflow/extreme value cases
Consider adding eval_mode to AvgDecimal for consistency
Document why eval_mode is unused in Float64 implementation (if intentional)

Nice to have:

Restore helpful comments that were removed
Add inline documentation about ANSI vs LEGACY behavior differences

🎯 Overall Assessment

This is a solid PR that makes good progress on ANSI mode support. The main concern is ensuring that the removal of the ANSI mode restriction is backed by proper behavior implementation or clear documentation of current limitations.

Great work on following the project's conventions and adding comprehensive tests! 🚀

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala (1)

1474-1508: New AVG/try_avg test covers basic modes; consider adding an error-path case later

This exercise of avg/try_avg with and without GROUP BY under both ANSI_ENABLED settings is a good sanity check that eval_mode is plumbed correctly and that Comet still executes. Once stricter ANSI/TRY semantics for AVG are implemented, it may be worth adding a follow-up test that triggers an actual eval-mode-sensitive condition (e.g., overflow) to distinguish the behaviors.

native/core/src/execution/planner.rs (1)

1892-1924: Planner AVG path correctly rebinds to Float64 and carries eval_mode

The AggExprStruct::Avg branch now (a) reads eval_mode from the proto, (b) casts non-decimal inputs to Float64, and (c) constructs Avg::new("avg", DataType::Float64, eval_mode). This keeps the partial/final state types coherent ((Float64 sum, Int64 count) for non-decimal AVG) while threading the evaluation mode into the UDF, and leaves the decimal AvgDecimal path untouched.

native/spark-expr/src/agg_funcs/avg.rs (1)

18-24: EvalMode is correctly threaded through AVG UDF and accumulators (currently stored only)

Avg now accepts and stores an EvalMode, and both the scalar AvgAccumulator and grouped AvgGroupsAccumulator are constructed with that mode, even though it’s not yet used in their logic. The Float64-based accumulation, state layout (sum as Float64, count as Int64), and null handling via values() + is_null remain unchanged and correct, so behavior matches the prior implementation while giving you a clean hook to implement ANSI/TRY-specific semantics later. Based on learnings, the array access pattern also correctly avoids reading values from null slots.

Also applies to: 39-137, 139-245, 251-347

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4f9377 and 5192933.

📒 Files selected for processing (5)

native/core/src/execution/planner.rs (1 hunks)
native/proto/src/proto/expr.proto (1 hunks)
native/spark-expr/src/agg_funcs/avg.rs (10 hunks)
spark/src/main/scala/org/apache/comet/serde/aggregates.scala (2 hunks)
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-11-11T20:44:05.014Z

Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 19
File: native/spark-expr/src/array_funcs/array_insert.rs:221-238
Timestamp: 2025-11-11T20:44:05.014Z
Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.

Applied to files:

native/spark-expr/src/agg_funcs/avg.rs

📚 Learning: 2025-11-04T14:26:48.750Z

Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().

Applied to files:

native/spark-expr/src/agg_funcs/avg.rs

🧬 Code graph analysis (2)

native/core/src/execution/planner.rs (2)

native/spark-expr/src/agg_funcs/avg.rs (2)

new (50-60)

new (148-154)

native/spark-expr/src/conversion_funcs/cast.rs (2)

new (791-801)

new (827-837)

spark/src/main/scala/org/apache/comet/serde/aggregates.scala (1)

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala (2)

QueryPlanSerde (45-717)

evalModeToProto (439-446)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Cursor Bugbot
GitHub Check: claude-review

🔇 Additional comments (2)

spark/src/main/scala/org/apache/comet/serde/aggregates.scala (1)

32-34: AVG serde now correctly threads eval_mode into the proto

The new imports and builder.setEvalMode(evalModeToProto(CometEvalModeUtil.fromSparkEvalMode(avg.evalMode))) cleanly propagate Spark’s evalMode for AVG/try_avg into the wire format, aligning with how other expressions handle EvalMode. No further serde changes seem necessary here.

Also applies to: 181-187

native/proto/src/proto/expr.proto (1)

136-141: AVG proto now uses EvalMode at tag 4, consistent with other expressions

Switching Avg’s field 4 from bool fail_on_error to EvalMode eval_mode brings AVG in line with MathExpr and Cast, matching the new Scala serde and Rust planner paths that pass an explicit evaluation mode. This is a wire-incompatible change across versions, but is fine as long as all Comet components are upgraded together, as in this repo.

Also applies to: 206-210, 212-224

augmentcode

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-11-24T16:32:57Z

native/core/src/execution/planner.rs

                let builder = match datatype {
                    DataType::Decimal128(_, _) => {
                        let func =
                            AggregateUDF::new_from_impl(AvgDecimal::new(datatype, input_datatype));


eval_mode is computed for AVG but not used in the Decimal branch (AvgDecimal::new), so ANSI/TRY mode won’t affect Decimal AVG (e.g., overflow behavior). Consider propagating eval_mode to the decimal implementation to ensure semantics match Spark’s modes (also applies given getSupportLevel removal in Scala).

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback: The Augment AI reviewer is correct that the eval_mode is ignored for the AvgDecimal implementation. https://github.com/apache/spark/blob/211dd995b221f135340375159672dcb77ef90ef4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala#L113 shows that it is used in the Spark implementation. Prevents wrong behavior in DataFusion Comet compared to Spark.

martin-augment · 2025-11-25T13:55:12Z

2. Decimal AVG doesn't receive eval_mode

In native/core/src/execution/planner.rs:1899-1901, the decimal path doesn't pass eval_mode to AvgDecimal::new(), while the Float64 path does.

Question: Should decimal AVG also support eval_mode for consistency? If decimals can overflow during averaging, ANSI mode should handle this differently than LEGACY mode.

value:useful; category:bug; feedback: The Claude AI reviewer is correct that the eval_mode is ignored for the AvgDecimal implementation. https://github.com/apache/spark/blob/211dd995b221f135340375159672dcb77ef90ef4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala#L113 shows that it is used in the Spark implementation. Prevents wrong behavior in DataFusion Comet compared to Spark.

coderfender and others added 4 commits November 20, 2025 11:29

support_ansi_avg_wip

12b8fb2

support_ansi_avg_wip

6b3ba80

support_ansi_avg

bd83978

Merge branch 'main' into add_support_ansi_mode_avg_expr

5192933

cursor bot reviewed Nov 24, 2025

View reviewed changes

coderabbitai bot reviewed Nov 24, 2025

View reviewed changes

augmentcode bot reviewed Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2817: feat: Support ANSI mode avg expr#29

2817: feat: Support ANSI mode avg expr#29
martin-augment wants to merge 4 commits intomainfrom
pr-2817-2025-11-24-16-25-30

martin-augment commented Nov 24, 2025

Uh oh!

coderabbitai bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

cursor bot Nov 24, 2025

Uh oh!

martin-augment Nov 25, 2025

Uh oh!

claude bot commented Nov 24, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

augmentcode bot left a comment

Uh oh!

augmentcode bot Nov 24, 2025

Uh oh!

martin-augment Nov 25, 2025

Uh oh!

martin-augment commented Nov 25, 2025

2. Decimal AVG doesn't receive eval_mode

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martin-augment commented Nov 24, 2025

Uh oh!

coderabbitai bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

cursor bot Nov 24, 2025

Choose a reason for hiding this comment

Bug: Decimal AVG ignores eval_mode parameter

Uh oh!

martin-augment Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Nov 24, 2025

Code Review: Support ANSI mode avg expr

Summary

✅ Strengths

🔍 Key Observations & Questions

1. Eval mode is stored but not used in Float64 path

2. Decimal AVG doesn't receive eval_mode

🐛 Potential Issues

1. Overflow behavior not implemented

2. Missing test for overflow/error cases

📝 Code Quality Suggestions

🔒 Security Considerations

⚡ Performance Considerations

📋 Recommendations Summary

🎯 Overall Assessment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Nov 25, 2025

2. Decimal AVG doesn't receive eval_mode

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Nov 24, 2025 •

edited

Loading