Skip to content

18677: feat: Handle edge case with corr with single row and NaN#24

Open
martin-augment wants to merge 2 commits intomainfrom
pr-18677-2025-11-13-20-31-04
Open

18677: feat: Handle edge case with corr with single row and NaN#24
martin-augment wants to merge 2 commits intomainfrom
pr-18677-2025-11-13-20-31-04

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

18677: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 13, 2025

Walkthrough

This change modifies the correlation aggregation functionality in DataFusion. The CorrelationGroupsAccumulator::evaluate method now includes NaN-aware logic that treats NaN values in accumulated sums with precedence over group size checks. The merge_batch method has been refactored to explicitly handle merging of pre-aggregated partial states by delegating to accumulate_correlation_states. Additionally, new test cases have been introduced to verify NaN handling behavior in correlation computations for both float32 and float64 data types with various null and NaN combinations.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18677-2025-11-13-20-31-04

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 13, 2025

Code Review for PR #24: Handle edge case with corr with single row and NaN

Summary

This PR addresses NaN handling in the corr() aggregate function for the CorrelationGroupsAccumulator implementation. The changes aim to correctly handle edge cases where NaN values appear in correlation calculations.


Code Quality and Best Practices ✅

Positives:

  • Clear, well-documented logic with inline comments explaining the NaN handling behavior
  • Good test coverage with comprehensive test cases for both float32 and float64
  • Code reorganization (moving merge_batch after evaluate) improves readability

Suggestions:

  1. Code organization (correlation.rs:482-528): While moving merge_batch after evaluate is not wrong, it breaks the typical trait method ordering. Consider keeping the original order to maintain consistency with other accumulators in the codebase, unless there's a specific reason for this change.

  2. Comment clarity (correlation.rs:432-434): The comment logic could be clearer:

    // If BOTH sum_x AND sum_y are NaN, then both input values are NaN → return NaN
    // If only ONE of them is NaN, then only one input value is NaN → return NULL

    This is slightly misleading for multi-row groups. Consider:

    // If BOTH sum_x AND sum_y are NaN, all accumulated values for both columns are NaN → return NaN
    // If only ONE sum is NaN, or if count < 2, insufficient valid data → return NULL

Potential Bugs or Issues ⚠️

Critical Issue - Inconsistent Behavior:

  1. Missing NaN handling in CorrelationAccumulator (correlation.rs:175-222): The PR only updates CorrelationGroupsAccumulator but not the standard CorrelationAccumulator. This creates inconsistent behavior:

    • CorrelationGroupsAccumulator: Returns NaN when both inputs are NaN (after this PR)
    • CorrelationAccumulator: Still returns NULL for all NaN cases (lines 198-221)

    Impact: Queries might produce different results depending on whether they use group accumulator or regular accumulator.

    Recommendation: Update CorrelationAccumulator.evaluate() to match the NaN handling logic, OR document why the behavior should differ.

  2. Edge case with sum propagation (correlation.rs:435-440): The logic assumes that if sum_x.is_nan() && sum_y.is_nan(), then both input columns had NaN values. However, consider:

    • If row 1 has (1.0, NaN) and row 2 has (NaN, 1.0), then sum_x = NaN and sum_y = NaN
    • In this case, NOT all values are NaN, but the function returns NaN instead of NULL
    • Expected behavior: Should probably return NULL since we have mixed NaN/non-NaN values

    Recommendation: This might actually be correct behavior (NaN propagation), but it should be explicitly tested and documented if intentional.


Performance Considerations ⚡

Acceptable:

  • The additional is_nan() checks add minimal overhead (just floating-point comparisons)
  • These checks occur only once per group during evaluation, not during accumulation
  • Performance impact is negligible

Minor optimization opportunity:

  • Lines 440-445 check sum_x.is_nan() || sum_y.is_nan() after already checking both. Could be micro-optimized, but the current code prioritizes readability, which is preferred.

Security Concerns 🔒

No security issues identified. The code:

  • Handles NaN values correctly without potential for crashes
  • Uses standard library functions
  • No unsafe code blocks
  • No external input validation concerns

Test Coverage 🧪

Strengths:

  • Excellent test coverage for NaN scenarios
  • Tests both float32 and float64 types
  • Covers multiple edge cases:
    • One column NaN, other column valid
    • Both columns NaN
    • One column NaN, other column NULL
    • Multiple rows with all NaN
    • Mixed groups with both NaN rows and valid rows

Gaps:

  1. Missing test for mixed NaN propagation: Test the edge case mentioned above where different rows contribute NaN to different sums:

    SELECT corr(a, b) FROM (VALUES (1.0, 'nan'::float), ('nan'::float, 1.0)) AS t(a, b);
  2. No tests for CorrelationAccumulator consistency: Since there's an inconsistency with the non-groups accumulator, add tests that verify both code paths produce the same results.

  3. No tests for multi-row groups with partial NaN: Test groups that have some NaN rows and some valid rows (e.g., 3 valid rows + 1 NaN row).

  4. Missing edge case: What happens with (0.0, 0.0) vs (NaN, NaN)? Should be tested.


Recommendations

Must Fix:

  1. ⚠️ Address the inconsistency between CorrelationAccumulator and CorrelationGroupsAccumulator NaN handling

Should Fix:
2. Add test for mixed NaN propagation case
3. Clarify comments about NaN handling behavior
4. Add tests verifying both accumulator implementations produce consistent results

Nice to Have:
5. Consider keeping original method ordering (move merge_batch back before evaluate)
6. Add test for partial NaN in multi-row groups


Overall Assessment

Code Quality: 8/10 - Well-written with good documentation
Correctness: 6/10 - Logic appears sound but has consistency issues with the other accumulator
Test Coverage: 7/10 - Good coverage but missing some edge cases
Performance: 10/10 - No performance concerns

Recommendation: Request Changes - The inconsistency between the two accumulator implementations should be addressed before merging.


🤖 Generated with Claude Code

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

// If BOTH sum_x AND sum_y are NaN, then both input values are NaN → return NaN
// If only ONE of them is NaN, then only one input value is NaN → return NULL
// This takes precedence over the count < 2 check
if sum_x.is_nan() && sum_y.is_nan() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider aligning CorrelationAccumulator.evaluate() (non-group path) with this NaN precedence: it currently returns NULL when n < 2, so a single row with both inputs NaN yields NULL, whereas grouped corr now yields NaN in the same scenario. This inconsistency can lead to different results for grouped vs non-grouped queries with NaNs.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct that the same check for NaN should be applied in CorrelationAccumulator::evaluate() if the count is less than 2. Otherwise it would be inconsistent - CorrelationGroupsAccumulator would return NaN, CorrelationAccumulator would return Null

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
datafusion/functions-aggregate/src/correlation.rs (1)

414-445: Clarify NaN-handling comments to match actual semantics

The comment says “if any of the accumulated values contain NaN, the result should be NaN regardless of the count”, but the code does something more specific:

  • sum_x.is_nan() && sum_y.is_nan() → result NaN.
  • count < 2 || sum_x.is_nan() || sum_y.is_nan() (i.e., only one NaN or too few rows) → result NULL (0.0 with null bit).

This matches the new SQL tests (e.g., ids 1–4 NULL, 5–6 NaN) but contradicts the current wording. Also, the “Only one input is NaN → return NULL” inline comment is attached to a branch that also covers count < 2.

Consider tightening the comments to explicitly describe:

  • “both sums NaN ⇒ NaN”
  • “count < 2 or exactly one of the sums NaN ⇒ NULL”

to avoid confusion for future maintainers.

🧹 Nitpick comments (2)
datafusion/functions-aggregate/src/correlation.rs (2)

313-351: accumulate_correlation_states invariants and safety

The helper assumes:

  • All state arrays have null_count() == 0.
  • All state arrays share the same length as group_indices.

Both are reasonable for internal state, and you guard the first with assert_eq!(...). There is no explicit length check for the second, though a mismatch would lead to out‑of‑bounds reads.

Not strictly required, but you might:

  • Add a brief doc comment stating that group_indices.len() must equal the length of each state array, and
  • Optionally add a debug assertion on lengths to catch future misuse early.

This keeps the function clearly “internal‑only” while documenting expectations.


482-528: merge_batch wiring looks correct; minor notes on invariants

The new merge_batch correctly:

  • Resizes count/sum vectors to total_num_groups without zeroing already‑accumulated entries when lengths match.
  • Reuses the same state layout as state() ([count, sum_x, sum_y, sum_xy, sum_xx, sum_yy]).
  • Delegates the per‑row accumulation to accumulate_correlation_states.

The assert!(opt_filter.is_none(), ...) is a good guard that filters are applied only in the partial stage. Just be aware this will hard‑panic if planner behavior ever changes; if that’s a concern, you could convert this to a debug assertion or an early Err instead.

Functionally this looks sound and aligns with the partial‑state layout.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between baccbf0 and 2d344a5.

📒 Files selected for processing (2)
  • datafusion/functions-aggregate/src/correlation.rs (2 hunks)
  • datafusion/sqllogictest/test_files/aggregate.slt (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (1)
datafusion/sqllogictest/test_files/aggregate.slt (1)

600-657: NaN correlation tests align with implemented semantics

The f32/f64 correlation_query_with_nans_* blocks exercise:

  • Single-row groups with a single NaN in one column → NULL.
  • Mixed groups with (NaN, NaN) plus valid pairs → NaN.
  • Single-row (NaN, NaN)NaN.

These expectations match the new group correlation logic and nicely pin down the NaN vs NULL behavior across types. Looks good.

@martin-augment
Copy link
Copy Markdown
Owner Author

Missing NaN handling in CorrelationAccumulator (correlation.rs:175-222): The PR only updates CorrelationGroupsAccumulator but not the standard CorrelationAccumulator. This creates inconsistent behavior:

  • CorrelationGroupsAccumulator: Returns NaN when both inputs are NaN (after this PR)
  • CorrelationAccumulator: Still returns NULL for all NaN cases (lines 198-221)

Impact: Queries might produce different results depending on whether they use group accumulator or regular accumulator.

value:useful; category:bug; feedback:The Claude AI reviewer is correct that the same check for NaN should be applied in CorrelationAccumulator::evaluate() if the count is less than 2. Otherwise it would be inconsistent - CorrelationGroupsAccumulator would return NaN, CorrelationAccumulator would return Null

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants