18677: feat: Handle edge case with corr with single row and NaN#24
18677: feat: Handle edge case with corr with single row and NaN#24martin-augment wants to merge 2 commits intomainfrom
corr with single row and NaN#24Conversation
WalkthroughThis change modifies the correlation aggregation functionality in DataFusion. The ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review for PR #24: Handle edge case with
|
| // If BOTH sum_x AND sum_y are NaN, then both input values are NaN → return NaN | ||
| // If only ONE of them is NaN, then only one input value is NaN → return NULL | ||
| // This takes precedence over the count < 2 check | ||
| if sum_x.is_nan() && sum_y.is_nan() { |
There was a problem hiding this comment.
Consider aligning CorrelationAccumulator.evaluate() (non-group path) with this NaN precedence: it currently returns NULL when n < 2, so a single row with both inputs NaN yields NULL, whereas grouped corr now yields NaN in the same scenario. This inconsistency can lead to different results for grouped vs non-grouped queries with NaNs.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct that the same check for NaN should be applied in CorrelationAccumulator::evaluate() if the count is less than 2. Otherwise it would be inconsistent - CorrelationGroupsAccumulator would return NaN, CorrelationAccumulator would return Null
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
datafusion/functions-aggregate/src/correlation.rs (1)
414-445: Clarify NaN-handling comments to match actual semanticsThe comment says “if any of the accumulated values contain NaN, the result should be NaN regardless of the count”, but the code does something more specific:
sum_x.is_nan() && sum_y.is_nan()→ resultNaN.count < 2 || sum_x.is_nan() || sum_y.is_nan()(i.e., only one NaN or too few rows) → resultNULL(0.0 with null bit).This matches the new SQL tests (e.g., ids 1–4 NULL, 5–6 NaN) but contradicts the current wording. Also, the “Only one input is NaN → return NULL” inline comment is attached to a branch that also covers
count < 2.Consider tightening the comments to explicitly describe:
- “both sums NaN ⇒ NaN”
- “count < 2 or exactly one of the sums NaN ⇒ NULL”
to avoid confusion for future maintainers.
🧹 Nitpick comments (2)
datafusion/functions-aggregate/src/correlation.rs (2)
313-351: accumulate_correlation_states invariants and safetyThe helper assumes:
- All state arrays have
null_count() == 0.- All state arrays share the same length as
group_indices.Both are reasonable for internal state, and you guard the first with
assert_eq!(...). There is no explicit length check for the second, though a mismatch would lead to out‑of‑bounds reads.Not strictly required, but you might:
- Add a brief doc comment stating that
group_indices.len()must equal the length of each state array, and- Optionally add a debug assertion on lengths to catch future misuse early.
This keeps the function clearly “internal‑only” while documenting expectations.
482-528: merge_batch wiring looks correct; minor notes on invariantsThe new
merge_batchcorrectly:
- Resizes
count/sum vectors tototal_num_groupswithout zeroing already‑accumulated entries when lengths match.- Reuses the same state layout as
state()([count, sum_x, sum_y, sum_xy, sum_xx, sum_yy]).- Delegates the per‑row accumulation to
accumulate_correlation_states.The
assert!(opt_filter.is_none(), ...)is a good guard that filters are applied only in the partial stage. Just be aware this will hard‑panic if planner behavior ever changes; if that’s a concern, you could convert this to a debug assertion or an earlyErrinstead.Functionally this looks sound and aligns with the partial‑state layout.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
datafusion/functions-aggregate/src/correlation.rs(2 hunks)datafusion/sqllogictest/test_files/aggregate.slt(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: claude-review
- GitHub Check: Analyze (rust)
🔇 Additional comments (1)
datafusion/sqllogictest/test_files/aggregate.slt (1)
600-657: NaN correlation tests align with implemented semanticsThe f32/f64
correlation_query_with_nans_*blocks exercise:
- Single-row groups with a single NaN in one column →
NULL.- Mixed groups with
(NaN, NaN)plus valid pairs →NaN.- Single-row
(NaN, NaN)→NaN.These expectations match the new group correlation logic and nicely pin down the NaN vs
NULLbehavior across types. Looks good.
value:useful; category:bug; feedback:The Claude AI reviewer is correct that the same check for NaN should be applied in |
18677: To review by AI