18647: Refactor approx_median signature & support f16#23
18647: Refactor approx_median signature & support f16#23martin-augment wants to merge 2 commits intomainfrom
Conversation
WalkthroughThis change refactors aggregate function implementations in DataFusion by removing the TryIntoF64 conversion trait from tdigest.rs and updating approx_median and approx_percentile_cont functions. The functions now use direct type handling instead of trait-based conversions. Float16 support is introduced in approx_percentile_cont with corresponding conversion logic. The approx_median function is restructured to use signature variants and adds special handling for null input types. Debug derivations are added to multiple structs. Test coverage is expanded to include Float16, null, and decimal type scenarios for median and approx_median functions. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Review: Refactor approx_median signature & support f16SummaryThis PR refactors the Code Quality & Best PracticesPositive Changes✅ Improved type signature: The new Observations
Potential Issues & ConcernsMinor Issues
Test Coverage✅ Good test additions in
Suggestions for additional tests:
Performance Considerations✅ Performance improvements:
Security Concerns✅ No security issues identified:
Minor consideration: The casts like Architecture & Design✅ Well-aligned with codebase patterns:
✅ Dead code removal: The RecommendationsHigh Priority
Medium Priority
Low Priority
Overall AssessmentThis is a solid refactoring that improves code maintainability while adding useful Float16 support. The changes are well-thought-out and demonstrate good engineering practices. The test additions cover the core functionality, though additional edge case testing would strengthen confidence. Recommendation: ✅ Approve with minor suggestions Great work on modernizing the signature system and removing technical debt! |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
datafusion/functions-aggregate/src/approx_percentile_cont.rs (1)
179-197: Add Float16 to NUMERICS array in type coercion.The concern is valid. Float16 was added to runtime code paths (
create_accumulatorlines 179-197,evaluate, and return type handling) but is missing from theNUMERICSarray used for function signature validation. Since the function signature is built by iterating over NUMERICS, Float16 queries will fail at the planning stage before reaching the runtime implementations.Fix required: Add
DataType::Float16to theNUMERICSarray atdatafusion/expr-common/src/type_coercion/aggregates.rsline 36-47, between Float32 and Float64 (or after Float64). This ensures the function signature accepts Float16 and allows queries to proceed to the runtime handlers that already support it.
🧹 Nitpick comments (3)
datafusion/functions-aggregate/src/approx_percentile_cont.rs (1)
360-419: Reduce per-type conversion boilerplate in convert_to_float.Optional: cast the (already null-filtered) input once to Float64 with arrow::compute::cast, then copy values() into Vec. This trims branching and keeps SIMD benefits in one place.
- match values.data_type() { - DataType::Float64 => { ... } - DataType::Float32 => { ... } - DataType::Float16 => { ... } - DataType::Int64 => { ... } - ... - } + // Values are non-null; a single cast simplifies conversion paths. + let f64_arr = arrow::compute::cast(values, &DataType::Float64)?; + let f64_arr = datafusion_common::downcast_value!(&f64_arr, Float64Array); + Ok(f64_arr.values().to_vec())Note: keep the explicit integer/uint acceptance in create_accumulator; this change only centralizes the numeric→f64 path post-filter.
datafusion/functions-aggregate/src/approx_median.rs (1)
99-123: Minor dedup: consider sharing TDigest state_fields builder.Both approx_median and approx_percentile_cont build identical TDigest fields; a small helper would reduce drift.
datafusion/sqllogictest/test_files/aggregate.slt (1)
913-922: Add a symmetric Float16 test for approx_percentile_cont.To fully exercise the new Float16 path in the shared accumulator, add:
+query RT +select + approx_percentile_cont(0.5) within group (order by arrow_cast(col_f32, 'Float16')), + arrow_typeof(approx_percentile_cont(0.5) within group (order by arrow_cast(col_f32, 'Float16'))) +from median_table; +---- +2.75 Float16
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
datafusion/functions-aggregate-common/src/tdigest.rs(0 hunks)datafusion/functions-aggregate/src/approx_median.rs(4 hunks)datafusion/functions-aggregate/src/approx_percentile_cont.rs(6 hunks)datafusion/sqllogictest/test_files/aggregate.slt(1 hunks)
💤 Files with no reviewable changes (1)
- datafusion/functions-aggregate-common/src/tdigest.rs
🧰 Additional context used
🧬 Code graph analysis (2)
datafusion/functions-aggregate/src/approx_percentile_cont.rs (2)
datafusion/functions-aggregate/src/percentile_cont.rs (4)
std(730-733)fmt(127-132)fmt(388-394)fmt(674-680)datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs (1)
fmt(121-125)
datafusion/functions-aggregate/src/approx_median.rs (4)
datafusion/functions-aggregate/src/approx_distinct.rs (3)
state_fields(329-345)name(317-319)default(280-282)datafusion/expr/src/utils.rs (1)
format_state_name(1253-1255)datafusion/expr-common/src/signature.rs (3)
one_of(1182-1188)new_exact(914-916)new_implicit(922-934)datafusion/functions-aggregate/src/approx_percentile_cont.rs (5)
state_fields(245-281)new(135-152)new(329-335)name(283-285)default(128-130)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (5)
datafusion/functions-aggregate/src/approx_percentile_cont.rs (1)
456-459: LGTM: Float16 result construction.half::f16::from_f64(q) is the right way to produce ScalarValue::Float16.
datafusion/functions-aggregate/src/approx_median.rs (2)
75-91: Signature refactor is clear and matches desired coercions.
- Integers preserved; Floats accept Decimal via implicit cast with default Float64.
- Aligns with tests expecting approx_median(Decimal) → Float64 and approx_median(Float16) → Float16.
99-123: Null input handling is consistent and efficient.Returning a Null state field and using NoopAccumulator avoids unnecessary TDigest state for all-null inputs. Good alignment with approx_distinct’s pattern.
Also applies to: 145-152
datafusion/sqllogictest/test_files/aggregate.slt (2)
913-917: Good coverage: approx_median on Float16 returns Float16.This validates planner + runtime behavior post-refactor.
918-922: Good coverage: approx_median(NULL) yields NULL with Null type.Matches the NoopAccumulator/state_fields changes.
value:delightful; category:bug; feedback:The CodeRabbit AI reviewer is totally correct that Float16 is missing in the NUMERICS array. The NUMERICS are planned for removal (https://github.com/Jefffrey/datafusion/blob/d3630a45272a604f912cbd05ab80de25b7f7c8bc/datafusion/expr-common/src/type_coercion/aggregates.rs#L23) but Float16 should be added until then. |
18647: To review by AI