Skip to content

2826: feat: Support ANSI mode SUM (Decimal types)#33

Open
martin-augment wants to merge 8 commits intomainfrom
pr-2826-2025-12-04-17-40-21
Open

2826: feat: Support ANSI mode SUM (Decimal types)#33
martin-augment wants to merge 8 commits intomainfrom
pr-2826-2025-12-04-17-40-21

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

2826: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 4, 2025

Walkthrough

The changes introduce EvalMode support to the SUM aggregation function for Decimal128 types. The protobuf Sum message field is replaced from a boolean fail_on_error to an EvalMode enum field. The SumDecimal UDF is updated to accept and store an eval_mode parameter throughout its API, including constructors, accumulators, and aggregator implementations. Overflow handling logic is refactored to respect the EvalMode: in ANSI mode, overflows raise errors; in non-ANSI modes, they set the sum to None. The Scala serialization layer is updated to map Spark's EvalMode to protobuf representation. New test cases are added to verify decimal SUM and try_sum behavior under ANSI and non-ANSI modes with null values, group-by operations, and overflow scenarios.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-2826-2025-12-04-17-40-21

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Dec 4, 2025

Code Review for PR #33: ANSI Mode Support for Decimal SUM

Summary

This PR adds ANSI mode support to the SUM aggregate function for decimal types, implementing proper overflow handling that either throws errors (ANSI/TRY modes) or returns NULL (Legacy mode).

✅ Strengths

Code Quality

  • Clean refactoring: The transition from tracking nullability with boolean buffers to using Option<i128> is elegant and more idiomatic Rust
  • Consistent implementation: Both Accumulator and GroupsAccumulator implementations follow the same logic
  • Good documentation: Inline comments explain Spark compatibility decisions (e.g., shouldTrackIsEmpty behavior)

Test Coverage

  • Comprehensive testing: Tests cover null handling, overflow scenarios, GROUP BY operations, and try_sum variants
  • Both modes tested: ANSI and Legacy modes are properly validated
  • Edge cases: Partial overflow scenarios with GROUP BY are tested (line 1623)

Architecture

  • Proper separation of concerns: EvalMode is threaded through all layers (Scala → Proto → Rust)
  • Maintains compatibility: State serialization format is consistent (always 2 arrays)

🔍 Issues Found

1. ⚠️ Critical: Missing TRY Mode Logic

Location: native/spark-expr/src/agg_funcs/sum_decimal.rs:166, 294, 344, 505

The implementation only checks for EvalMode::Ansi but doesn't handle EvalMode::Try. According to the Scala changes, TRY mode should return NULL on overflow (not throw), but your current code will fall through to the Legacy path.

// Current code (line 166)
if self.eval_mode == EvalMode::Ansi {
    return Err(DataFusionError::from(arithmetic_overflow_error("decimal")));
}

Recommendation: Update all overflow checks to:

if self.eval_mode == EvalMode::Ansi {
    return Err(DataFusionError::from(arithmetic_overflow_error("decimal")));
}
// TRY mode: continue with NULL assignment (existing behavior)

Or add an explicit comment if TRY and Legacy are intentionally identical:

if self.eval_mode == EvalMode::Ansi {
    return Err(DataFusionError::from(arithmetic_overflow_error("decimal")));
}
// TRY and Legacy modes: return NULL on overflow

2. ⚠️ Minor: Inconsistent Overflow State Management

Location: native/spark-expr/src/agg_funcs/sum_decimal.rs:189-192

The early return check in update_batch prevents processing values after overflow:

if !self.is_empty && self.sum.is_none() {
    return Ok(());
}

However, this bypasses the logic that updates is_empty based on incoming values. While functionally correct (overflow state is permanent), it's slightly inconsistent with the rest of the flow.

Impact: Low - The behavior is correct, but the logic flow could be clearer.

3. 📝 Documentation: Proto Field Comment Outdated

Location: native/proto/src/proto/expr.proto:123

The proto field changed from bool fail_on_error to EvalMode eval_mode, but there's no inline documentation explaining the new field or what the enum values mean.

Recommendation: Add a comment:

message Sum {
  Expr child = 1;
  DataType datatype = 2;
  EvalMode eval_mode = 3; // Controls overflow behavior: Legacy (NULL), Ansi (error), Try (NULL)
}

4. 🔧 Code Quality: Magic Number in Tests

Location: spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala:1554-1556

val maxDec38_0 = new java.math.BigDecimal("99999999999999999999")
(1 to 50).flatMap(_ => Seq((maxDec38_0, 1)))

The decimal value and count (50) are magic numbers without explanation. Why this specific value? How does it guarantee overflow?

Recommendation: Add a comment explaining the overflow calculation or use named constants.

🚀 Performance Considerations

Positive

  • Optimal memory layout: Vec<Option<i128>> is more cache-friendly than separate vectors for values and null flags
  • Early exit on overflow: The check at the beginning of update functions prevents unnecessary computation

Consideration

  • Memory overhead: Option<i128> uses 24 bytes (16 for i128 + 8 for discriminant padding) vs 17 bytes for separate tracking. For large GROUP BY cardinalities, this is a ~41% increase per group.
    • Verdict: This is acceptable given the code clarity benefits and Rust's optimization capabilities

🔒 Security Concerns

No security issues identified

  • Overflow handling is proper and prevents undefined behavior
  • No unsafe code added beyond existing value_unchecked (which is appropriate given bounds checks)
  • No injection risks or unsafe memory access

📊 Test Coverage Assessment

Well Covered

  • ✅ Null handling (aggregates and group by)
  • ✅ Overflow with ANSI mode
  • ✅ Overflow with Legacy mode
  • ✅ try_sum variants
  • ✅ Partial overflow scenarios

Potentially Missing

  • ⚠️ Merge overflow scenarios: Tests don't explicitly verify overflow during the merge phase of distributed aggregation
  • ⚠️ State serialization: No explicit test for state() and merge_batch() round-tripping with overflow states
  • ℹ️ Edge case: Sum of empty groups is tested, but not mixed empty/overflow states in the same execution

Recommendation: Add a test that forces a multi-stage aggregation with overflow in partial phase.

📋 Minor Observations

  1. Line 565-568: The benchmark still uses EvalMode::Legacy - consider adding benchmarks for ANSI mode to measure error path performance

  2. Removed imports: Good cleanup removing unused BooleanBufferBuilder and BitAnd imports

  3. Consistent naming: State field names ("sum", "is_empty") are well chosen and match Spark conventions

Final Verdict

Overall: Strong implementation with one critical fix needed

This is a well-architected change that properly extends decimal sum to support ANSI mode. The code is clean, well-tested, and maintains backward compatibility. The main issue is the missing TRY mode logic distinction, which should be addressed before merging.

Action Items Before Merge

  1. Must Fix: Clarify TRY mode vs Legacy mode handling in overflow checks
  2. 📝 Should Fix: Add proto documentation for eval_mode field
  3. 🔍 Consider: Add merge phase overflow test
  4. 📝 Nice to Have: Document magic numbers in test data generation

Great work on this feature! The implementation demonstrates strong understanding of both Spark semantics and DataFusion's architecture.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala (1)

27-32: New decimal SUM / try_sum ANSI & overflow tests provide strong coverage

The added tests nicely cover:

  • NULL behaviour for sum and try_sum on decimals, with and without GROUP BY, under both ANSI on/off.
  • Decimal overflow semantics for SUM in ANSI vs non‑ANSI (including grouped and repartitioned cases), asserting both Spark and Comet throw ARITHMETIC_OVERFLOW in ANSI mode.
  • Decimal overflow handling for try_sum, including a partial‑overflow GROUP BY where one group overflows and another succeeds.
  • Correct wiring via CometConf.getExprAllowIncompatConfigKey(classOf[Sum]) and use of Sum/col imports.

If you want to make the last “partial overflow” test even more self‑documenting, you could also assert the concrete result (e.g., group 1 → NULL, group 2 → 500) instead of only relying on checkSparkAnswerAndOperator, but the current tests are already effective.

Also applies to: 1475-1635

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between caacad6 and ab437f5.

📒 Files selected for processing (6)
  • native/core/src/execution/planner.rs (1 hunks)
  • native/proto/src/proto/expr.proto (1 hunks)
  • native/spark-expr/benches/aggregate.rs (2 hunks)
  • native/spark-expr/src/agg_funcs/sum_decimal.rs (12 hunks)
  • spark/src/main/scala/org/apache/comet/serde/aggregates.scala (3 hunks)
  • spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-04T14:26:48.750Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().

Applied to files:

  • native/spark-expr/src/agg_funcs/sum_decimal.rs
🧬 Code graph analysis (4)
native/spark-expr/benches/aggregate.rs (1)
native/spark-expr/src/agg_funcs/sum_decimal.rs (1)
  • try_new (47-63)
native/spark-expr/src/agg_funcs/sum_decimal.rs (2)
native/spark-expr/src/utils.rs (1)
  • is_valid_decimal_precision (225-229)
native/spark-expr/src/lib.rs (1)
  • arithmetic_overflow_error (110-114)
spark/src/main/scala/org/apache/comet/serde/aggregates.scala (3)
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala (8)
  • comet (79-81)
  • comet (86-123)
  • comet (125-127)
  • comet (129-138)
  • comet (152-154)
  • comet (156-158)
  • comet (160-166)
  • comet (168-174)
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala (2)
  • QueryPlanSerde (45-717)
  • evalModeToProto (439-446)
spark/src/main/scala/org/apache/comet/serde/SupportLevel.scala (1)
  • Incompatible (39-39)
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala (1)
spark/src/main/scala/org/apache/spark/sql/comet/operators.scala (2)
  • CometHashAggregateExec (1104-1130)
  • CometHashAggregateExec (1168-1225)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
🔇 Additional comments (16)
native/proto/src/proto/expr.proto (1)

120-124: Sum.eval_mode proto change: check versioning/compat expectations

Switching field 3 from bool fail_on_error to EvalMode eval_mode keeps the same field number and varint wire type, but changes the semantic meaning of existing encoded values (e.g., old true now decoding as enum value 1). This is fine if plans are never persisted across mixed Comet versions, but if there is any cross-version interchange, you may need to document that Sum is not backward‑compatible across this change or gate it by version.

native/core/src/execution/planner.rs (1)

1867-1893: Decimal SUM now wired to EvalMode-aware UDF correctly

For decimal inputs you now: (1) decode expr.eval_mode via from_protobuf_eval_mode, and (2) construct SumDecimal::try_new(datatype, eval_mode) and wrap it in an AggregateUDF. This cleanly propagates LEGACY/TRY/ANSI into the decimal SUM implementation while leaving the non-decimal path unchanged and still backed by sum_udaf. Given CometSum.getSupportLevel only allows ANSI/TRY for decimal types, this looks consistent and safe.

native/spark-expr/benches/aggregate.rs (1)

35-36: Benchmark update to new SumDecimal API looks good

Importing EvalMode and constructing SumDecimal with EvalMode::Legacy aligns the benchmark with the new try_new(data_type, eval_mode) signature without changing the workload semantics. No issues here.

Also applies to: 98-101

spark/src/main/scala/org/apache/comet/serde/aggregates.scala (1)

32-34: Sum serde changes correctly gate ANSI/TRY and propagate EvalMode

  • CometSum.getSupportLevel now treats ANSI/TRY as incompatible only when sum.dataType is non‑decimal, which matches the native support matrix and ensures non-decimal SUM in ANSI/TRY still falls back to Spark.
  • In convert, wiring builder.setEvalMode(evalModeToProto(CometEvalModeUtil.fromSparkEvalMode(sum.evalMode))) means the Rust side sees the exact eval mode for decimal SUM, consistent with how scalar/math casts already behave.
    Overall this is a clean extension of EvalMode into the SUM aggregate path.

Also applies to: 216-225, 242-247

native/spark-expr/src/agg_funcs/sum_decimal.rs (12)

18-19: LGTM: EvalMode integration in struct definition.

The addition of the eval_mode field and necessary imports properly extends the SumDecimal struct to support ANSI mode semantics.

Also applies to: 43-43


47-63: LGTM: Constructor properly parameterized with EvalMode.

The try_new method correctly accepts and stores the eval_mode parameter, enabling mode-specific behavior downstream.


71-86: LGTM: Consistent state field semantics.

The implementation correctly propagates eval_mode to accumulators and consistently tracks both sum and is_empty state fields to match Spark's behavior. The comments clarify this design decision effectively.


133-153: LGTM: Accumulator initialization is correct.

The accumulator properly initializes with sum = Some(0) and is_empty = true, which correctly represents an empty aggregation state. The eval_mode parameter is stored for overflow handling.


155-177: LGTM: Overflow handling is correctly implemented.

The overflow detection and handling logic is sound:

  • Early return for already-overflowed state prevents further processing
  • unwrap_or(0) at line 162 is safe due to prior overflow check at line 157
  • ANSI mode correctly raises errors on overflow
  • Non-ANSI mode correctly sets sum to None while marking is_empty as false

181-211: LGTM: Batch update correctly handles nulls and overflow.

The batch update logic correctly:

  • Detects and skips already-overflowed states
  • Updates is_empty status based on null counts
  • Skips null values during iteration (lines 205-206)
  • Delegates to update_single for per-value processing and error propagation

213-230: LGTM: Evaluation correctly distinguishes empty vs. overflow states.

The evaluation logic properly handles:

  • Empty aggregations (no non-null values) → None
  • Overflowed sums (non-ANSI mode) → None
  • Valid sums with precision checks → decimal value

The defensive precision check at line 221 is good practice even though it shouldn't fail during normal operation.


236-306: LGTM: State serialization and merging are correct.

The state and merge logic correctly handles:

  • Serialization of both sum and is_empty state fields
  • Detection and propagation of overflow states (line 270-277)
  • Safe unwrapping at lines 289-290 due to prior validation
  • EvalMode-aware overflow handling during merge
  • Proper short-circuiting for empty and overflowed states

308-353: LGTM: Groups accumulator initialization and per-group updates are correct.

The groups accumulator properly:

  • Maintains per-group state vectors for sum and is_empty
  • Initializes new groups to Some(0) and true in resize_helper
  • Implements overflow detection and EvalMode-aware handling per group
  • Safe use of unwrap_or(0) at line 340 due to prior overflow check at line 336

388-449: LGTM: Group evaluation and state extraction are correct.

The evaluation logic correctly:

  • Handles both EmitTo::All and EmitTo::First(n) cases consistently
  • Checks is_empty and precision before emitting values
  • Returns None for empty groups or overflowed sums
  • Serializes per-group state as two arrays (sum, is_empty)

451-517: LGTM: Groups merge batch correctly handles per-group state merging.

The merge logic correctly:

  • Processes each group independently with proper state extraction
  • Detects and propagates overflow states (lines 480-488)
  • Safely unwraps at lines 500-501 after validation
  • Applies EvalMode-aware overflow handling during merge (lines 504-513)
  • Handles empty state short-circuiting (lines 490-497)

544-595: Tests updated correctly for new API.

The existing tests are properly updated to pass EvalMode::Legacy to the constructors. While these Rust unit tests don't cover overflow or ANSI mode behavior, the AI summary indicates that comprehensive edge case testing (overflow, ANSI/non-ANSI modes, nulls) is provided in the Scala test suite (CometAggregateSuite.scala).

@martin-augment
Copy link
Copy Markdown
Owner Author

martin-augment commented Dec 4, 2025

. ⚠️ Critical: Missing TRY Mode Logic

Location: native/spark-expr/src/agg_funcs/sum_decimal.rs:166, 294, 344, 505

The implementation only checks for EvalMode::Ansi but doesn't handle EvalMode::Try. According to the Scala changes, TRY mode should return NULL on overflow (not throw), but your current code will fall through to the Legacy path.

// Current code (line 166)
if self.eval_mode == EvalMode::Ansi {
    return Err(DataFusionError::from(arithmetic_overflow_error("decimal")));
}

Recommendation: Update all overflow checks to:

if self.eval_mode == EvalMode::Ansi {
    return Err(DataFusionError::from(arithmetic_overflow_error("decimal")));
}
// TRY mode: continue with NULL assignment (existing behavior)

Or add an explicit comment if TRY and Legacy are intentionally identical:

if self.eval_mode == EvalMode::Ansi {
    return Err(DataFusionError::from(arithmetic_overflow_error("decimal")));
}
// TRY and Legacy modes: return NULL on overflow

value:incorrect-but-reasonable; category:bug; feedback:The Claude AI reviewer is not correct! The implementation is ANSI vs non-ANSI, i.e. TRY and LEGACY and treated the same since both of them should return NULL (i.e. None) when the sum overflows. Only ANSI mode should return an error.

@martin-augment
Copy link
Copy Markdown
Owner Author

📝 Documentation: Proto Field Comment Outdated

Location: native/proto/src/proto/expr.proto:123

The proto field changed from bool fail_on_error to EvalMode eval_mode, but there's no inline documentation explaining the new field or what the enum values mean.

Recommendation: Add a comment:

message Sum {
  Expr child = 1;
  DataType datatype = 2;
  EvalMode eval_mode = 3; // Controls overflow behavior: Legacy (NULL), Ansi (error), Try (NULL)
}

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! Having a short comment about the possible values and their behavior would be helpful for the maintenance of the protobuf file. It would prevent consulting with the rest of the code to see the possible values.

@martin-augment
Copy link
Copy Markdown
Owner Author

🔧 Code Quality: Magic Number in Tests

Location: spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala:1554-1556

val maxDec38_0 = new java.math.BigDecimal("99999999999999999999")
(1 to 50).flatMap(_ => Seq((maxDec38_0, 1)))

The decimal value and count (50) are magic numbers without explanation. Why this specific value? How does it guarantee overflow?

Recommendation: Add a comment explaining the overflow calculation or use named constants.

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! A javadoc for the method would be helpful. The method generates 50 tuples of 99999999999999999999 and 1. Those are used as values for 50 rows used in the test cases. Each test case uses a SQL query like "SELECT SUM(col(1)) from table" and here the summing leads to overflow.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants