Skip to content

18647: Refactor approx_median signature & support f16#23

Open
martin-augment wants to merge 2 commits intomainfrom
pr-18647-2025-11-13-13-52-39
Open

18647: Refactor approx_median signature & support f16#23
martin-augment wants to merge 2 commits intomainfrom
pr-18647-2025-11-13-13-52-39

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

18647: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 13, 2025

Walkthrough

This change refactors aggregate function implementations in DataFusion by removing the TryIntoF64 conversion trait from tdigest.rs and updating approx_median and approx_percentile_cont functions. The functions now use direct type handling instead of trait-based conversions. Float16 support is introduced in approx_percentile_cont with corresponding conversion logic. The approx_median function is restructured to use signature variants and adds special handling for null input types. Debug derivations are added to multiple structs. Test coverage is expanded to include Float16, null, and decimal type scenarios for median and approx_median functions.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18647-2025-11-13-13-52-39

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 13, 2025

Pull Request Review: Refactor approx_median signature & support f16

Summary

This PR refactors the APPROX_MEDIAN aggregate function signature to use the new coercion system and adds support for Float16 data types. The changes are well-structured and demonstrate good understanding of the codebase.

Code Quality & Best Practices

Positive Changes

Improved type signature: The new Signature::one_of approach with TypeSignatureClass coercions is more maintainable and follows modern patterns
Consistent Debug implementation: Deriving Debug instead of manual implementation reduces boilerplate
NULL handling: Added proper null input handling using NoopAccumulator
Code cleanup: Removed unused TryIntoF64 trait and related code from tdigest.rs (confirmed no other usages exist)
Simplified conversions: Replaced fallible try_as_f64() pattern with direct casting, which is cleaner

Observations

  1. File: datafusion/functions-aggregate/src/approx_median.rs

    • Lines 77-90: The signature definition is more verbose but more explicit about intent
    • Lines 100-106: Good null-type handling in state_fields()
    • Lines 145-146: Proper use of NoopAccumulator for null types
  2. File: datafusion/functions-aggregate/src/approx_percentile_cont.rs

    • Lines 371-377: Float16 support properly implemented with v.to_f64()
    • Line 452: Correct Float16 result conversion using half::f16::from_f64(q)
    • Lines 362-415: Simplified conversion logic - much cleaner

Potential Issues & Concerns

Minor Issues

  1. Missing import in approx_median.rs:23

    use datafusion_functions_aggregate_common::noop_accumulator::NoopAccumulator;

    ⚠️ Need to verify this import path exists and is public. The code compiles, so this should be fine, but worth confirming.

  2. Return type validation removed (line 134-135 in approx_median.rs)

    • The old code validated numeric types in return_type()
    • Now relies on signature validation alone
    • This is actually correct - signature validation happens first and is more appropriate
    • However, ensure comprehensive tests exist for edge cases
  3. Integer to Float16 precision loss

    • approx_median.rs:79-81: The signature accepts integers but they'll be converted to Float64 internally, then potentially back to original type
    • For Float16 results from integer inputs, there could be precision loss
    • Consider documenting this behavior or adding validation

Test Coverage

Good test additions in aggregate.slt:

  • Line 913-916: Tests Float16 casting and type verification
  • Line 918-921: Tests NULL input handling

Suggestions for additional tests:

  1. Test APPROX_PERCENTILE_CONT directly with Float16 (not just through APPROX_MEDIAN)
  2. Test edge cases: very large integers converted to Float16
  3. Test with infinity and NaN for Float16
  4. Test APPROX_MEDIAN with pre-existing Float16 columns (not just casts)

Performance Considerations

Performance improvements:

  1. Removed the filter_map + transpose pattern in favor of direct mapping
  2. Eliminated unnecessary Result wrapping in conversion functions
  3. The new coercion system may be slightly more efficient at type checking

⚠️ Potential concern:

  • Float16 conversions use to_f64() which may have overhead
  • This is acceptable for approximate aggregates but worth monitoring
  • Consider profiling if Float16 becomes heavily used

Security Concerns

✅ No security issues identified:

  • No unsafe code introduced
  • No external input validation issues
  • Proper null handling prevents potential panics
  • Type conversions are bounded (e.g., q as i8 might truncate but won't cause UB)

Minor consideration: The casts like q as i8 on line 444 can silently truncate. While this is expected behavior for approximate aggregates, it could surprise users if the estimated percentile value exceeds the type bounds. Consider documenting this behavior.

Architecture & Design

Well-aligned with codebase patterns:

  1. Follows the new coercion signature pattern being adopted across aggregate functions
  2. Consistent with how other functions handle null inputs
  3. Maintains separation of concerns (signature vs accumulator logic)

Dead code removal: The TryIntoF64 trait removal is appropriate - verified no other usages exist

Recommendations

High Priority

  1. ✅ Code looks ready to merge pending CI

Medium Priority

  1. Add the suggested test cases for more comprehensive Float16 coverage
  2. Consider adding a comment in return_type() explaining why type validation was removed
  3. Document the precision loss behavior for Float16 in user-facing docs

Low Priority

  1. Consider profiling Float16 performance if it becomes a bottleneck in production workloads

Overall Assessment

This is a solid refactoring that improves code maintainability while adding useful Float16 support. The changes are well-thought-out and demonstrate good engineering practices. The test additions cover the core functionality, though additional edge case testing would strengthen confidence.

Recommendation: ✅ Approve with minor suggestions

Great work on modernizing the signature system and removing technical debt!

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
datafusion/functions-aggregate/src/approx_percentile_cont.rs (1)

179-197: Add Float16 to NUMERICS array in type coercion.

The concern is valid. Float16 was added to runtime code paths (create_accumulator lines 179-197, evaluate, and return type handling) but is missing from the NUMERICS array used for function signature validation. Since the function signature is built by iterating over NUMERICS, Float16 queries will fail at the planning stage before reaching the runtime implementations.

Fix required: Add DataType::Float16 to the NUMERICS array at datafusion/expr-common/src/type_coercion/aggregates.rs line 36-47, between Float32 and Float64 (or after Float64). This ensures the function signature accepts Float16 and allows queries to proceed to the runtime handlers that already support it.

🧹 Nitpick comments (3)
datafusion/functions-aggregate/src/approx_percentile_cont.rs (1)

360-419: Reduce per-type conversion boilerplate in convert_to_float.

Optional: cast the (already null-filtered) input once to Float64 with arrow::compute::cast, then copy values() into Vec. This trims branching and keeps SIMD benefits in one place.

-    match values.data_type() {
-        DataType::Float64 => { ... }
-        DataType::Float32 => { ... }
-        DataType::Float16 => { ... }
-        DataType::Int64 => { ... }
-        ...
-    }
+    // Values are non-null; a single cast simplifies conversion paths.
+    let f64_arr = arrow::compute::cast(values, &DataType::Float64)?;
+    let f64_arr = datafusion_common::downcast_value!(&f64_arr, Float64Array);
+    Ok(f64_arr.values().to_vec())

Note: keep the explicit integer/uint acceptance in create_accumulator; this change only centralizes the numeric→f64 path post-filter.

datafusion/functions-aggregate/src/approx_median.rs (1)

99-123: Minor dedup: consider sharing TDigest state_fields builder.

Both approx_median and approx_percentile_cont build identical TDigest fields; a small helper would reduce drift.

datafusion/sqllogictest/test_files/aggregate.slt (1)

913-922: Add a symmetric Float16 test for approx_percentile_cont.

To fully exercise the new Float16 path in the shared accumulator, add:

+query RT
+select
+  approx_percentile_cont(0.5) within group (order by arrow_cast(col_f32, 'Float16')),
+  arrow_typeof(approx_percentile_cont(0.5) within group (order by arrow_cast(col_f32, 'Float16')))
+from median_table;
+----
+2.75 Float16
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47a826f and d3630a4.

📒 Files selected for processing (4)
  • datafusion/functions-aggregate-common/src/tdigest.rs (0 hunks)
  • datafusion/functions-aggregate/src/approx_median.rs (4 hunks)
  • datafusion/functions-aggregate/src/approx_percentile_cont.rs (6 hunks)
  • datafusion/sqllogictest/test_files/aggregate.slt (1 hunks)
💤 Files with no reviewable changes (1)
  • datafusion/functions-aggregate-common/src/tdigest.rs
🧰 Additional context used
🧬 Code graph analysis (2)
datafusion/functions-aggregate/src/approx_percentile_cont.rs (2)
datafusion/functions-aggregate/src/percentile_cont.rs (4)
  • std (730-733)
  • fmt (127-132)
  • fmt (388-394)
  • fmt (674-680)
datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs (1)
  • fmt (121-125)
datafusion/functions-aggregate/src/approx_median.rs (4)
datafusion/functions-aggregate/src/approx_distinct.rs (3)
  • state_fields (329-345)
  • name (317-319)
  • default (280-282)
datafusion/expr/src/utils.rs (1)
  • format_state_name (1253-1255)
datafusion/expr-common/src/signature.rs (3)
  • one_of (1182-1188)
  • new_exact (914-916)
  • new_implicit (922-934)
datafusion/functions-aggregate/src/approx_percentile_cont.rs (5)
  • state_fields (245-281)
  • new (135-152)
  • new (329-335)
  • name (283-285)
  • default (128-130)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (rust)
🔇 Additional comments (5)
datafusion/functions-aggregate/src/approx_percentile_cont.rs (1)

456-459: LGTM: Float16 result construction.

half::f16::from_f64(q) is the right way to produce ScalarValue::Float16.

datafusion/functions-aggregate/src/approx_median.rs (2)

75-91: Signature refactor is clear and matches desired coercions.

  • Integers preserved; Floats accept Decimal via implicit cast with default Float64.
  • Aligns with tests expecting approx_median(Decimal) → Float64 and approx_median(Float16) → Float16.

99-123: Null input handling is consistent and efficient.

Returning a Null state field and using NoopAccumulator avoids unnecessary TDigest state for all-null inputs. Good alignment with approx_distinct’s pattern.

Also applies to: 145-152

datafusion/sqllogictest/test_files/aggregate.slt (2)

913-917: Good coverage: approx_median on Float16 returns Float16.

This validates planner + runtime behavior post-refactor.


918-922: Good coverage: approx_median(NULL) yields NULL with Null type.

Matches the NoopAccumulator/state_fields changes.

@martin-augment
Copy link
Copy Markdown
Owner Author

Add Float16 to NUMERICS array in type coercion.

The concern is valid. Float16 was added to runtime code paths (create_accumulator lines 179-197, evaluate, and return type handling) but is missing from the NUMERICS array used for function signature validation. Since the function signature is built by iterating over NUMERICS, Float16 queries will fail at the planning stage before reaching the runtime implementations.

Fix required: Add DataType::Float16 to the NUMERICS array at datafusion/expr-common/src/type_coercion/aggregates.rs line 36-47, between Float32 and Float64 (or after Float64). This ensures the function signature accepts Float16 and allows queries to proceed to the runtime handlers that already support it.

value:delightful; category:bug; feedback:The CodeRabbit AI reviewer is totally correct that Float16 is missing in the NUMERICS array. The NUMERICS are planned for removal (https://github.com/Jefffrey/datafusion/blob/d3630a45272a604f912cbd05ab80de25b7f7c8bc/datafusion/expr-common/src/type_coercion/aggregates.rs#L23) but Float16 should be added until then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants