Skip to content

18649: Refactor Spark bitshift signature#18

Open
martin-augment wants to merge 1 commit intomainfrom
pr-18649-2025-11-12-21-32-06
Open

18649: Refactor Spark bitshift signature#18
martin-augment wants to merge 1 commit intomainfrom
pr-18649-2025-11-12-21-32-06

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

18649: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 12, 2025

Walkthrough

This pull request extends the make_udf_function macro to support an optional custom constructor parameter, allowing callers to supply a custom constructor path instead of always defaulting to <$UDF>::new(). The bitwise shift UDF implementations are consolidated from three separate types (SparkShiftLeft, SparkShiftRight, SparkShiftRightUnsigned) into a unified SparkBitShift implementation with variant constructors (left, right, right_unsigned). The bitwise module is updated to use the new macro overload with the consolidated implementation. Test coverage for bitwise shift operations is expanded to include null handling, boundary conditions, and multiple numeric types.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18649-2025-11-12-21-32-06

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 12, 2025

PR Review: Refactor Spark bitshift signature

Summary

This PR refactors the Spark bitshift functions by consolidating three separate implementations (SparkShiftLeft, SparkShiftRight, SparkShiftRightUnsigned) into a single unified SparkBitShift struct. The refactoring achieves significant code reduction (589 deletions vs 447 additions) while maintaining functionality and improving maintainability.


✅ Code Quality & Best Practices

Excellent refactoring work! The consolidation is well-executed:

  1. DRY Principle: The unified SparkBitShift struct eliminates code duplication across three nearly identical implementations
  2. Clean Architecture: The BitShiftType enum provides a clear, type-safe way to distinguish between shift operations
  3. Factory Methods: The left(), right(), and right_unsigned() constructors provide a clean API
  4. Macro Extension: The addition of a custom constructor parameter to make_udf_function! is a thoughtful enhancement that enables this refactoring

Code organization is logical:

  • Shift functions (shift_left, shift_right, shift_right_unsigned) remain separate and focused
  • shift_inner() provides a unified dispatch mechanism
  • Type coercion is handled elegantly through the new Signature approach

🐛 Potential Issues

1. Missing UShr implementations for smaller integer types

The UShr trait only implements for i32, i64, u32, and u64. However, the signature allows Int8, Int16, UInt8, and UInt16 to be coerced to Int32/UInt32. While the coercion handles this at the type level, the trait bounds in shift_inner() should be verified to work correctly with the coerced types.

Location: bit_shift.rs:88-114

2. Potential type mismatch in null handling

In shift_inner() at line 147, when both inputs are null, the function returns Int32Array::new_null() regardless of the input type. This might not match the expected return type for Int64/UInt64 inputs.

if value_array.data_type().is_null() || shift_array.data_type().is_null() {
    return Ok(Arc::new(Int32Array::new_null(value_array.len())));
}

Recommendation: Return a null array of the same type as arg_types[0] from return_type(), or use a type-specific null array constructor.

3. Unused import: ArrowNativeType

The import arrow::datatypes::ArrowNativeType at line 24 appears to be unused after the refactoring.


⚡ Performance Considerations

Positive:

  • No performance regressions expected - the core shift logic remains identical
  • The unified dispatch through shift_inner() adds minimal overhead (single enum match)
  • The use of compute::binary() for vectorized operations is appropriate

Note:

  • The shift normalization ((shift % bit_num) + bit_num) % bit_num is computed for every element. This is correct for Spark/Java semantics but does involve two modulo operations per element. This is acceptable given the requirement to match Spark behavior.

🔒 Security Concerns

No security issues identified. The code properly handles:

  • Integer overflow through shift normalization
  • Null values
  • Edge cases (negative shifts, large shifts)

The bit manipulation operations are safe and follow established patterns.


🧪 Test Coverage

Excellent test coverage! The PR adds comprehensive test files:

  1. shiftleft.slt (84 lines): Tests basic operations, data types, large shifts, negative shifts, null handling, and edge cases
  2. shiftright.slt (99 lines): Similar comprehensive coverage including sign preservation for signed integers
  3. shiftrightunsigned.slt (87 lines): Tests unsigned shift behavior including negative value handling

Test scenarios covered:

  • ✅ Basic shift operations
  • ✅ Different integer types (Int32, Int64, UInt32, UInt64)
  • ✅ Type coercion between different integer sizes
  • ✅ Large shifts (> bit width)
  • ✅ Negative shifts
  • ✅ Zero shifts
  • ✅ Null handling
  • ✅ Edge cases (max values, negative values)
  • ✅ Sign preservation (signed vs unsigned shifts)

Minor observation: The tests moved shiftleft.slt from spark/math/ to its logical location alongside other bitwise tests (would have been in spark/bitwise/). However, I see it's in spark/math/ in the file list - this might be intentional or worth verifying.


📝 Suggestions

  1. Fix the null handling type issue (see bug 18457: Support Arrow IPC Stream Files #2 above) - ensure the return type matches arg_types[0]

  2. Remove unused import:

    use arrow::datatypes::{
    -    ArrowNativeType, DataType, Int32Type, Int64Type, UInt32Type, UInt64Type,
    +    DataType, Int32Type, Int64Type, UInt32Type, UInt64Type,
    };
  3. Consider adding a comment explaining the shift normalization formula for future maintainers:

    // Normalize shift amount using pmod: ((shift % bit_num) + bit_num) % bit_num
    // This matches Spark/Java semantics for negative and out-of-range shifts
    let shift = ((shift % bit_num) + bit_num) % bit_num;
  4. Documentation: The removal of function-level docs from the individual shift functions is fine since they're now private, but consider ensuring the public API (via mod.rs export_functions!) has clear documentation - which it does! ✅


🎯 Conclusion

This is a high-quality refactoring that improves code maintainability without sacrificing functionality or performance. The consolidation of three separate UDF implementations into one is a significant improvement.

Recommendation: Approve with minor fixes

The only actionable item is addressing the null handling type consistency issue. The other suggestions are minor improvements that can be addressed at the maintainer's discretion.

Great work on this refactoring! The approach of using a unified struct with an enum discriminator is a clean pattern that could be applied to similar function families in the codebase.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

&arg_types[1],
));
if value_array.data_type().is_null() || shift_array.data_type().is_null() {
return Ok(Arc::new(Int32Array::new_null(value_array.len())));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null handling here can return an Int32Array even when the first (value) argument is non-null and e.g. Int64/UInt64. Per return_type, the result type should follow the first arg; only when the first arg is Null should Int32 be produced. (Also applies if shift is Null but value is not.)

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! It should return the type of the value_array and fallback to Int32Array only when value_array.data_type().is_null(). The finding prevents problems in eventual usage of the returned value in more kernels which expect the correct type.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f6b7e1 and a2f260b.

📒 Files selected for processing (6)
  • datafusion/functions/src/macros.rs (2 hunks)
  • datafusion/spark/src/function/bitwise/bit_shift.rs (4 hunks)
  • datafusion/spark/src/function/bitwise/mod.rs (1 hunks)
  • datafusion/sqllogictest/test_files/spark/bitwise/shiftright.slt (1 hunks)
  • datafusion/sqllogictest/test_files/spark/bitwise/shiftrightunsigned.slt (1 hunks)
  • datafusion/sqllogictest/test_files/spark/math/shiftleft.slt (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T13:37:46.028Z
Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.

Applied to files:

  • datafusion/functions/src/macros.rs
  • datafusion/spark/src/function/bitwise/mod.rs
🧬 Code graph analysis (3)
datafusion/functions/src/macros.rs (1)
datafusion/expr/src/udf.rs (2)
  • new (855-865)
  • new_from_impl (110-115)
datafusion/spark/src/function/bitwise/bit_shift.rs (3)
datafusion/common/src/utils/mod.rs (1)
  • take_function_args (926-940)
datafusion/functions-nested/src/utils.rs (3)
  • make_scalar_function (53-83)
  • args (42-42)
  • args (88-91)
datafusion/expr-common/src/signature.rs (1)
  • new_implicit (922-934)
datafusion/spark/src/function/bitwise/mod.rs (1)
datafusion/spark/src/function/bitwise/bit_shift.rs (3)
  • left (257-259)
  • right (261-263)
  • right_unsigned (265-267)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)

Comment on lines +146 to 148
if value_array.data_type().is_null() || shift_array.data_type().is_null() {
return Ok(Arc::new(Int32Array::new_null(value_array.len())));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix null-shift output type to match planner expectations

When the shift column is the NULL type (e.g., SELECT shiftleft(1::bigint, CAST(NULL AS NULL))), this branch returns an Int32Array. However, return_type reports whatever type the value column resolved to (e.g., Int64), so downstream consumers see an Int64 UDF but get an Int32 array. That mismatched dtype will trigger panics in later kernels once the planner trusts the advertised schema.

Return a null array whose datatype mirrors the resolved output instead of hard-coding Int32. Reusing the existing logic for value_array.is_null() keeps the legacy behavior for all-null inputs.

-    if value_array.data_type().is_null() || shift_array.data_type().is_null() {
-        return Ok(Arc::new(Int32Array::new_null(value_array.len())));
-    }
+    if shift_array.data_type().is_null() {
+        let resolved_type = if value_array.data_type().is_null() {
+            DataType::Int32
+        } else {
+            value_array.data_type().clone()
+        };
+        return Ok(arrow::array::new_null_array(&resolved_type, value_array.len()));
+    }
+
+    if value_array.data_type().is_null() {
+        return Ok(arrow::array::new_null_array(&DataType::Int32, value_array.len()));
+    }

Remember to add new_null_array to the arrow::array imports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if value_array.data_type().is_null() || shift_array.data_type().is_null() {
return Ok(Arc::new(Int32Array::new_null(value_array.len())));
}
if shift_array.data_type().is_null() {
let resolved_type = if value_array.data_type().is_null() {
DataType::Int32
} else {
value_array.data_type().clone()
};
return Ok(arrow::array::new_null_array(&resolved_type, value_array.len()));
}
if value_array.data_type().is_null() {
return Ok(arrow::array::new_null_array(&DataType::Int32, value_array.len()));
}
🤖 Prompt for AI Agents
In datafusion/spark/src/function/bitwise/bit_shift.rs around lines 146 to 148,
the branch that handles a NULL-typed shift column returns an Int32Array which
can mismatch the planner-reported return_type (e.g., Int64) and cause downstream
panics; change this to return a null array whose data type mirrors the resolved
return_type (reuse the existing logic used for value_array.is_null()), and
ensure you add new_null_array to the arrow::array imports so you can construct a
null array with the correct DataType and length.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! It should return the type of the value_array and fallback to Int32Array only when value_array.data_type().is_null(). The finding prevents problems in eventual usage of the returned value in more kernels which expect the correct type.

@martin-augment
Copy link
Copy Markdown
Owner Author

2. Potential type mismatch in null handling

In shift_inner() at line 147, when both inputs are null, the function returns Int32Array::new_null() regardless of the input type. This might not match the expected return type for Int64/UInt64 inputs.

if value_array.data_type().is_null() || shift_array.data_type().is_null() {
    return Ok(Arc::new(Int32Array::new_null(value_array.len())));
}

Recommendation: Return a null array of the same type as arg_types[0] from return_type(), or use a type-specific null array constructor.

value:useful; category:bug; feedback:The Claude AI reviewer is correct! It should return the type of the value_array and fallback to Int32Array only when value_array.data_type().is_null(). The finding prevents problems in eventual usage of the returned value in more kernels which expect the correct type.

@martin-augment
Copy link
Copy Markdown
Owner Author

3. Unused import: ArrowNativeType

The import arrow::datatypes::ArrowNativeType at line 24 appears to be unused after the refactoring.

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that the import is no more needed and can be removed! The finding prevents a warning by the linter (clippy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants