18647: Refactor approx_median signature & support f16 by martin-augment · Pull Request #23 · martin-augment/datafusion

martin-augment · 2025-11-13T13:52:43Z

18647: To review by AI

coderabbitai · 2025-11-13T13:53:32Z

Walkthrough

This change refactors aggregate function implementations in DataFusion by removing the TryIntoF64 conversion trait from tdigest.rs and updating approx_median and approx_percentile_cont functions. The functions now use direct type handling instead of trait-based conversions. Float16 support is introduced in approx_percentile_cont with corresponding conversion logic. The approx_median function is restructured to use signature variants and adds special handling for null input types. Debug derivations are added to multiple structs. Test coverage is expanded to include Float16, null, and decimal type scenarios for median and approx_median functions.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-18647-2025-11-13-13-52-39

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2025-11-13T13:54:26Z

Pull Request Review: Refactor approx_median signature & support f16

Summary

This PR refactors the APPROX_MEDIAN aggregate function signature to use the new coercion system and adds support for Float16 data types. The changes are well-structured and demonstrate good understanding of the codebase.

Code Quality & Best Practices

Positive Changes

✅ Improved type signature: The new Signature::one_of approach with TypeSignatureClass coercions is more maintainable and follows modern patterns
✅ Consistent Debug implementation: Deriving Debug instead of manual implementation reduces boilerplate
✅ NULL handling: Added proper null input handling using NoopAccumulator
✅ Code cleanup: Removed unused TryIntoF64 trait and related code from tdigest.rs (confirmed no other usages exist)
✅ Simplified conversions: Replaced fallible try_as_f64() pattern with direct casting, which is cleaner

Observations

File: datafusion/functions-aggregate/src/approx_median.rs
- Lines 77-90: The signature definition is more verbose but more explicit about intent
- Lines 100-106: Good null-type handling in state_fields()
- Lines 145-146: Proper use of NoopAccumulator for null types
File: datafusion/functions-aggregate/src/approx_percentile_cont.rs
- Lines 371-377: Float16 support properly implemented with v.to_f64()
- Line 452: Correct Float16 result conversion using half::f16::from_f64(q)
- Lines 362-415: Simplified conversion logic - much cleaner

Potential Issues & Concerns

Minor Issues

Missing import in approx_median.rs:23
```
use datafusion_functions_aggregate_common::noop_accumulator::NoopAccumulator;
```
⚠️ Need to verify this import path exists and is public. The code compiles, so this should be fine, but worth confirming.
Return type validation removed (line 134-135 in approx_median.rs)
- The old code validated numeric types in return_type()
- Now relies on signature validation alone
- This is actually correct - signature validation happens first and is more appropriate
- However, ensure comprehensive tests exist for edge cases
Integer to Float16 precision loss
- approx_median.rs:79-81: The signature accepts integers but they'll be converted to Float64 internally, then potentially back to original type
- For Float16 results from integer inputs, there could be precision loss
- Consider documenting this behavior or adding validation

Test Coverage

✅ Good test additions in aggregate.slt:

Line 913-916: Tests Float16 casting and type verification
Line 918-921: Tests NULL input handling

Suggestions for additional tests:

Test APPROX_PERCENTILE_CONT directly with Float16 (not just through APPROX_MEDIAN)
Test edge cases: very large integers converted to Float16
Test with infinity and NaN for Float16
Test APPROX_MEDIAN with pre-existing Float16 columns (not just casts)

Performance Considerations

✅ Performance improvements:

Removed the filter_map + transpose pattern in favor of direct mapping
Eliminated unnecessary Result wrapping in conversion functions
The new coercion system may be slightly more efficient at type checking

⚠️ Potential concern:

Float16 conversions use to_f64() which may have overhead
This is acceptable for approximate aggregates but worth monitoring
Consider profiling if Float16 becomes heavily used

Security Concerns

✅ No security issues identified:

No unsafe code introduced
No external input validation issues
Proper null handling prevents potential panics
Type conversions are bounded (e.g., q as i8 might truncate but won't cause UB)

Minor consideration: The casts like q as i8 on line 444 can silently truncate. While this is expected behavior for approximate aggregates, it could surprise users if the estimated percentile value exceeds the type bounds. Consider documenting this behavior.

Architecture & Design

✅ Well-aligned with codebase patterns:

Follows the new coercion signature pattern being adopted across aggregate functions
Consistent with how other functions handle null inputs
Maintains separation of concerns (signature vs accumulator logic)

✅ Dead code removal: The TryIntoF64 trait removal is appropriate - verified no other usages exist

Recommendations

High Priority

✅ Code looks ready to merge pending CI

Medium Priority

Add the suggested test cases for more comprehensive Float16 coverage
Consider adding a comment in return_type() explaining why type validation was removed
Document the precision loss behavior for Float16 in user-facing docs

Low Priority

Consider profiling Float16 performance if it becomes a bottleneck in production workloads

Overall Assessment

This is a solid refactoring that improves code maintainability while adding useful Float16 support. The changes are well-thought-out and demonstrate good engineering practices. The test additions cover the core functionality, though additional edge case testing would strengthen confidence.

Recommendation: ✅ Approve with minor suggestions

Great work on modernizing the signature system and removing technical debt!

augmentcode

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

datafusion/functions-aggregate/src/approx_percentile_cont.rs (1)

179-197: Add Float16 to NUMERICS array in type coercion.

The concern is valid. Float16 was added to runtime code paths (create_accumulator lines 179-197, evaluate, and return type handling) but is missing from the NUMERICS array used for function signature validation. Since the function signature is built by iterating over NUMERICS, Float16 queries will fail at the planning stage before reaching the runtime implementations.

Fix required: Add DataType::Float16 to the NUMERICS array at datafusion/expr-common/src/type_coercion/aggregates.rs line 36-47, between Float32 and Float64 (or after Float64). This ensures the function signature accepts Float16 and allows queries to proceed to the runtime handlers that already support it.

🧹 Nitpick comments (3)

datafusion/functions-aggregate/src/approx_percentile_cont.rs (1)
360-419: Reduce per-type conversion boilerplate in convert_to_float.

Optional: cast the (already null-filtered) input once to Float64 with arrow::compute::cast, then copy values() into Vec. This trims branching and keeps SIMD benefits in one place.
-    match values.data_type() {
-        DataType::Float64 => { ... }
-        DataType::Float32 => { ... }
-        DataType::Float16 => { ... }
-        DataType::Int64 => { ... }
-        ...
-    }
+    // Values are non-null; a single cast simplifies conversion paths.
+    let f64_arr = arrow::compute::cast(values, &DataType::Float64)?;
+    let f64_arr = datafusion_common::downcast_value!(&f64_arr, Float64Array);
+    Ok(f64_arr.values().to_vec())
Note: keep the explicit integer/uint acceptance in create_accumulator; this change only centralizes the numeric→f64 path post-filter.
datafusion/functions-aggregate/src/approx_median.rs (1)

99-123: Minor dedup: consider sharing TDigest state_fields builder.

Both approx_median and approx_percentile_cont build identical TDigest fields; a small helper would reduce drift.
datafusion/sqllogictest/test_files/aggregate.slt (1)
913-922: Add a symmetric Float16 test for approx_percentile_cont.

To fully exercise the new Float16 path in the shared accumulator, add:
+query RT
+select
+  approx_percentile_cont(0.5) within group (order by arrow_cast(col_f32, 'Float16')),
+  arrow_typeof(approx_percentile_cont(0.5) within group (order by arrow_cast(col_f32, 'Float16')))
+from median_table;
+----
+2.75 Float16

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47a826f and d3630a4.

📒 Files selected for processing (4)

datafusion/functions-aggregate-common/src/tdigest.rs (0 hunks)
datafusion/functions-aggregate/src/approx_median.rs (4 hunks)
datafusion/functions-aggregate/src/approx_percentile_cont.rs (6 hunks)
datafusion/sqllogictest/test_files/aggregate.slt (1 hunks)

💤 Files with no reviewable changes (1)

datafusion/functions-aggregate-common/src/tdigest.rs

🧰 Additional context used

🧬 Code graph analysis (2)

datafusion/functions-aggregate/src/approx_percentile_cont.rs (2)

datafusion/functions-aggregate/src/percentile_cont.rs (4)

std (730-733)

fmt (127-132)

fmt (388-394)

fmt (674-680)

datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs (1)

fmt (121-125)

datafusion/functions-aggregate/src/approx_median.rs (4)

datafusion/functions-aggregate/src/approx_distinct.rs (3)

state_fields (329-345)

name (317-319)

default (280-282)

datafusion/expr/src/utils.rs (1)

format_state_name (1253-1255)

datafusion/expr-common/src/signature.rs (3)

one_of (1182-1188)

new_exact (914-916)

new_implicit (922-934)

datafusion/functions-aggregate/src/approx_percentile_cont.rs (5)

state_fields (245-281)

new (135-152)

new (329-335)

name (283-285)

default (128-130)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (5)

datafusion/functions-aggregate/src/approx_percentile_cont.rs (1)

456-459: LGTM: Float16 result construction.

half::f16::from_f64(q) is the right way to produce ScalarValue::Float16.

datafusion/functions-aggregate/src/approx_median.rs (2)

75-91: Signature refactor is clear and matches desired coercions.

Integers preserved; Floats accept Decimal via implicit cast with default Float64.

Aligns with tests expecting approx_median(Decimal) → Float64 and approx_median(Float16) → Float16.

99-123: Null input handling is consistent and efficient.

Returning a Null state field and using NoopAccumulator avoids unnecessary TDigest state for all-null inputs. Good alignment with approx_distinct’s pattern.

Also applies to: 145-152

datafusion/sqllogictest/test_files/aggregate.slt (2)

913-917: Good coverage: approx_median on Float16 returns Float16.

This validates planner + runtime behavior post-refactor.

918-922: Good coverage: approx_median(NULL) yields NULL with Null type.

Matches the NoopAccumulator/state_fields changes.

martin-augment · 2025-11-13T14:24:13Z

Add Float16 to NUMERICS array in type coercion.

The concern is valid. Float16 was added to runtime code paths (create_accumulator lines 179-197, evaluate, and return type handling) but is missing from the NUMERICS array used for function signature validation. Since the function signature is built by iterating over NUMERICS, Float16 queries will fail at the planning stage before reaching the runtime implementations.

Fix required: Add DataType::Float16 to the NUMERICS array at datafusion/expr-common/src/type_coercion/aggregates.rs line 36-47, between Float32 and Float64 (or after Float64). This ensures the function signature accepts Float16 and allows queries to proceed to the runtime handlers that already support it.

value:delightful; category:bug; feedback:The CodeRabbit AI reviewer is totally correct that Float16 is missing in the NUMERICS array. The NUMERICS are planned for removal (https://github.com/Jefffrey/datafusion/blob/d3630a45272a604f912cbd05ab80de25b7f7c8bc/datafusion/expr-common/src/type_coercion/aggregates.rs#L23) but Float16 should be added until then.

Jefffrey added 2 commits November 12, 2025 21:38

Refactor approx_median signature & support f16

8e3676e

clippy

d3630a4

augmentcode bot reviewed Nov 13, 2025

View reviewed changes

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

18647: Refactor approx_median signature & support f16#23

18647: Refactor approx_median signature & support f16#23
martin-augment wants to merge 2 commits intomainfrom
pr-18647-2025-11-13-13-52-39

martin-augment commented Nov 13, 2025

Uh oh!

coderabbitai bot commented Nov 13, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 13, 2025

Uh oh!

augmentcode bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

martin-augment commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martin-augment commented Nov 13, 2025

Uh oh!

coderabbitai bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Nov 13, 2025

Pull Request Review: Refactor approx_median signature & support f16

Summary

Code Quality & Best Practices

Positive Changes

Observations

Potential Issues & Concerns

Minor Issues

Test Coverage

Performance Considerations

Security Concerns

Architecture & Design

Recommendations

High Priority

Medium Priority

Low Priority

Overall Assessment

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Nov 13, 2025 •

edited

Loading