18649: Refactor Spark bitshift signature by martin-augment · Pull Request #18 · martin-augment/datafusion

martin-augment · 2025-11-12T21:32:10Z

18649: To review by AI

coderabbitai · 2025-11-12T21:32:44Z

Walkthrough

This pull request extends the make_udf_function macro to support an optional custom constructor parameter, allowing callers to supply a custom constructor path instead of always defaulting to <$UDF>::new(). The bitwise shift UDF implementations are consolidated from three separate types (SparkShiftLeft, SparkShiftRight, SparkShiftRightUnsigned) into a unified SparkBitShift implementation with variant constructors (left, right, right_unsigned). The bitwise module is updated to use the new macro overload with the consolidated implementation. Test coverage for bitwise shift operations is expanded to include null handling, boundary conditions, and multiple numeric types.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-18649-2025-11-12-21-32-06

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2025-11-12T21:34:07Z

PR Review: Refactor Spark bitshift signature

Summary

This PR refactors the Spark bitshift functions by consolidating three separate implementations (SparkShiftLeft, SparkShiftRight, SparkShiftRightUnsigned) into a single unified SparkBitShift struct. The refactoring achieves significant code reduction (589 deletions vs 447 additions) while maintaining functionality and improving maintainability.

✅ Code Quality & Best Practices

Excellent refactoring work! The consolidation is well-executed:

DRY Principle: The unified SparkBitShift struct eliminates code duplication across three nearly identical implementations
Clean Architecture: The BitShiftType enum provides a clear, type-safe way to distinguish between shift operations
Factory Methods: The left(), right(), and right_unsigned() constructors provide a clean API
Macro Extension: The addition of a custom constructor parameter to make_udf_function! is a thoughtful enhancement that enables this refactoring

Code organization is logical:

Shift functions (shift_left, shift_right, shift_right_unsigned) remain separate and focused
shift_inner() provides a unified dispatch mechanism
Type coercion is handled elegantly through the new Signature approach

🐛 Potential Issues

1. Missing UShr implementations for smaller integer types

The UShr trait only implements for i32, i64, u32, and u64. However, the signature allows Int8, Int16, UInt8, and UInt16 to be coerced to Int32/UInt32. While the coercion handles this at the type level, the trait bounds in shift_inner() should be verified to work correctly with the coerced types.

Location: bit_shift.rs:88-114

2. Potential type mismatch in null handling

In shift_inner() at line 147, when both inputs are null, the function returns Int32Array::new_null() regardless of the input type. This might not match the expected return type for Int64/UInt64 inputs.

if value_array.data_type().is_null() || shift_array.data_type().is_null() {
    return Ok(Arc::new(Int32Array::new_null(value_array.len())));
}

Recommendation: Return a null array of the same type as arg_types[0] from return_type(), or use a type-specific null array constructor.

3. Unused import: ArrowNativeType

The import arrow::datatypes::ArrowNativeType at line 24 appears to be unused after the refactoring.

⚡ Performance Considerations

Positive:

No performance regressions expected - the core shift logic remains identical
The unified dispatch through shift_inner() adds minimal overhead (single enum match)
The use of compute::binary() for vectorized operations is appropriate

Note:

The shift normalization ((shift % bit_num) + bit_num) % bit_num is computed for every element. This is correct for Spark/Java semantics but does involve two modulo operations per element. This is acceptable given the requirement to match Spark behavior.

🔒 Security Concerns

No security issues identified. The code properly handles:

Integer overflow through shift normalization
Null values
Edge cases (negative shifts, large shifts)

The bit manipulation operations are safe and follow established patterns.

🧪 Test Coverage

Excellent test coverage! The PR adds comprehensive test files:

shiftleft.slt (84 lines): Tests basic operations, data types, large shifts, negative shifts, null handling, and edge cases
shiftright.slt (99 lines): Similar comprehensive coverage including sign preservation for signed integers
shiftrightunsigned.slt (87 lines): Tests unsigned shift behavior including negative value handling

Test scenarios covered:

✅ Basic shift operations
✅ Different integer types (Int32, Int64, UInt32, UInt64)
✅ Type coercion between different integer sizes
✅ Large shifts (> bit width)
✅ Negative shifts
✅ Zero shifts
✅ Null handling
✅ Edge cases (max values, negative values)
✅ Sign preservation (signed vs unsigned shifts)

Minor observation: The tests moved shiftleft.slt from spark/math/ to its logical location alongside other bitwise tests (would have been in spark/bitwise/). However, I see it's in spark/math/ in the file list - this might be intentional or worth verifying.

📝 Suggestions

Fix the null handling type issue (see bug 18457: Support Arrow IPC Stream Files #2 above) - ensure the return type matches arg_types[0]

Remove unused import:

use arrow::datatypes::{
-    ArrowNativeType, DataType, Int32Type, Int64Type, UInt32Type, UInt64Type,
+    DataType, Int32Type, Int64Type, UInt32Type, UInt64Type,
};

Consider adding a comment explaining the shift normalization formula for future maintainers:

// Normalize shift amount using pmod: ((shift % bit_num) + bit_num) % bit_num
// This matches Spark/Java semantics for negative and out-of-range shifts
let shift = ((shift % bit_num) + bit_num) % bit_num;

Documentation: The removal of function-level docs from the individual shift functions is fine since they're now private, but consider ensuring the public API (via mod.rs export_functions!) has clear documentation - which it does! ✅

🎯 Conclusion

This is a high-quality refactoring that improves code maintainability without sacrificing functionality or performance. The consolidation of three separate UDF implementations into one is a significant improvement.

Recommendation: Approve with minor fixes

The only actionable item is addressing the null handling type consistency issue. The other suggestions are minor improvements that can be addressed at the maintainer's discretion.

Great work on this refactoring! The approach of using a unified struct with an enum discriminator is a clean pattern that could be applied to similar function families in the codebase.

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-11-12T21:36:13Z

-            &arg_types[1],
-        ));
+    if value_array.data_type().is_null() || shift_array.data_type().is_null() {
+        return Ok(Arc::new(Int32Array::new_null(value_array.len())));


Null handling here can return an Int32Array even when the first (value) argument is non-null and e.g. Int64/UInt64. Per return_type, the result type should follow the first arg; only when the first arg is Null should Int32 be produced. (Also applies if shift is Null but value is not.)

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback:The Augment AI reviewer is correct! It should return the type of the value_array and fallback to Int32Array only when value_array.data_type().is_null(). The finding prevents problems in eventual usage of the returned value in more kernels which expect the correct type.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f6b7e1 and a2f260b.

📒 Files selected for processing (6)

datafusion/functions/src/macros.rs (2 hunks)
datafusion/spark/src/function/bitwise/bit_shift.rs (4 hunks)
datafusion/spark/src/function/bitwise/mod.rs (1 hunks)
datafusion/sqllogictest/test_files/spark/bitwise/shiftright.slt (1 hunks)
datafusion/sqllogictest/test_files/spark/bitwise/shiftrightunsigned.slt (1 hunks)
datafusion/sqllogictest/test_files/spark/math/shiftleft.slt (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-06T13:37:46.028Z

Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.

Applied to files:

datafusion/functions/src/macros.rs
datafusion/spark/src/function/bitwise/mod.rs

🧬 Code graph analysis (3)

datafusion/functions/src/macros.rs (1)

datafusion/expr/src/udf.rs (2)

new (855-865)

new_from_impl (110-115)

datafusion/spark/src/function/bitwise/bit_shift.rs (3)

datafusion/common/src/utils/mod.rs (1)

take_function_args (926-940)

datafusion/functions-nested/src/utils.rs (3)

make_scalar_function (53-83)

args (42-42)

args (88-91)

datafusion/expr-common/src/signature.rs (1)

new_implicit (922-934)

datafusion/spark/src/function/bitwise/mod.rs (1)

datafusion/spark/src/function/bitwise/bit_shift.rs (3)

left (257-259)

right (261-263)

right_unsigned (265-267)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: claude-review
GitHub Check: Analyze (rust)

coderabbitai · 2025-11-12T21:44:13Z

+    if value_array.data_type().is_null() || shift_array.data_type().is_null() {
+        return Ok(Arc::new(Int32Array::new_null(value_array.len())));
    }


⚠️ Potential issue | 🔴 Critical

Fix null-shift output type to match planner expectations

When the shift column is the NULL type (e.g., SELECT shiftleft(1::bigint, CAST(NULL AS NULL))), this branch returns an Int32Array. However, return_type reports whatever type the value column resolved to (e.g., Int64), so downstream consumers see an Int64 UDF but get an Int32 array. That mismatched dtype will trigger panics in later kernels once the planner trusts the advertised schema.

Return a null array whose datatype mirrors the resolved output instead of hard-coding Int32. Reusing the existing logic for value_array.is_null() keeps the legacy behavior for all-null inputs.

- if value_array.data_type().is_null() || shift_array.data_type().is_null() { - return Ok(Arc::new(Int32Array::new_null(value_array.len()))); - } + if shift_array.data_type().is_null() { + let resolved_type = if value_array.data_type().is_null() { + DataType::Int32 + } else { + value_array.data_type().clone() + }; + return Ok(arrow::array::new_null_array(&resolved_type, value_array.len())); + } + + if value_array.data_type().is_null() { + return Ok(arrow::array::new_null_array(&DataType::Int32, value_array.len())); + }

Remember to add new_null_array to the arrow::array imports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if value_array.data_type().is_null() || shift_array.data_type().is_null() {

return Ok(Arc::new(Int32Array::new_null(value_array.len())));

}

if shift_array.data_type().is_null() {

let resolved_type = if value_array.data_type().is_null() {

DataType::Int32

} else {

value_array.data_type().clone()

};

return Ok(arrow::array::new_null_array(&resolved_type, value_array.len()));

}

if value_array.data_type().is_null() {

return Ok(arrow::array::new_null_array(&DataType::Int32, value_array.len()));

}

🤖 Prompt for AI Agents

In datafusion/spark/src/function/bitwise/bit_shift.rs around lines 146 to 148, the branch that handles a NULL-typed shift column returns an Int32Array which can mismatch the planner-reported return_type (e.g., Int64) and cause downstream panics; change this to return a null array whose data type mirrors the resolved return_type (reuse the existing logic used for value_array.is_null()), and ensure you add new_null_array to the arrow::array imports so you can construct a null array with the correct DataType and length.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! It should return the type of the value_array and fallback to Int32Array only when value_array.data_type().is_null(). The finding prevents problems in eventual usage of the returned value in more kernels which expect the correct type.

martin-augment · 2025-11-12T21:55:22Z

2. Potential type mismatch in null handling

In shift_inner() at line 147, when both inputs are null, the function returns Int32Array::new_null() regardless of the input type. This might not match the expected return type for Int64/UInt64 inputs.
if value_array.data_type().is_null() || shift_array.data_type().is_null() {
    return Ok(Arc::new(Int32Array::new_null(value_array.len())));
}
Recommendation: Return a null array of the same type as arg_types[0] from return_type(), or use a type-specific null array constructor.

value:useful; category:bug; feedback:The Claude AI reviewer is correct! It should return the type of the value_array and fallback to Int32Array only when value_array.data_type().is_null(). The finding prevents problems in eventual usage of the returned value in more kernels which expect the correct type.

martin-augment · 2025-11-12T21:56:27Z

3. Unused import: ArrowNativeType

The import arrow::datatypes::ArrowNativeType at line 24 appears to be unused after the refactoring.

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that the import is no more needed and can be removed! The finding prevents a warning by the linter (clippy).

Refactor Spark bitshift signature

a2f260b

augmentcode bot reviewed Nov 12, 2025

View reviewed changes

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

18649: Refactor Spark bitshift signature#18

18649: Refactor Spark bitshift signature#18
martin-augment wants to merge 1 commit intomainfrom
pr-18649-2025-11-12-21-32-06

martin-augment commented Nov 12, 2025

Uh oh!

coderabbitai bot commented Nov 12, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 12, 2025

Uh oh!

augmentcode bot left a comment

Uh oh!

augmentcode bot Nov 12, 2025

Uh oh!

martin-augment Nov 12, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 12, 2025

Uh oh!

martin-augment Nov 12, 2025

Uh oh!

martin-augment commented Nov 12, 2025

2. Potential type mismatch in null handling

Uh oh!

martin-augment commented Nov 12, 2025

3. Unused import: ArrowNativeType

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    if value_array.data_type().is_null() || shift_array.data_type().is_null() {
-        return Ok(Arc::new(Int32Array::new_null(value_array.len())));
-    }
+    if shift_array.data_type().is_null() {
+        let resolved_type = if value_array.data_type().is_null() {
+            DataType::Int32
+        } else {
+            value_array.data_type().clone()
+        };
+        return Ok(arrow::array::new_null_array(&resolved_type, value_array.len()));
+    }
+    if value_array.data_type().is_null() {
+        return Ok(arrow::array::new_null_array(&DataType::Int32, value_array.len()));
+    }

Conversation

martin-augment commented Nov 12, 2025

Uh oh!

coderabbitai bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Nov 12, 2025

PR Review: Refactor Spark bitshift signature

Summary

✅ Code Quality & Best Practices

🐛 Potential Issues

1. Missing UShr implementations for smaller integer types

2. Potential type mismatch in null handling

3. Unused import: ArrowNativeType

⚡ Performance Considerations

🔒 Security Concerns

🧪 Test Coverage

📝 Suggestions

🎯 Conclusion

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Nov 12, 2025

2. Potential type mismatch in null handling

Uh oh!

martin-augment commented Nov 12, 2025

3. Unused import: ArrowNativeType

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Nov 12, 2025 •

edited

Loading