Skip to content

21389: Add arrow_field(expr) scalar UDF#306

Open
martin-augment wants to merge 6 commits intomainfrom
pr-21389-2026-04-13-06-47-28
Open

21389: Add arrow_field(expr) scalar UDF#306
martin-augment wants to merge 6 commits intomainfrom
pr-21389-2026-04-13-06-47-28

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21389: To review by AI

adriangb and others added 6 commits April 5, 2026 17:05
Adds a new introspection function that returns a struct containing the
complete Arrow Field information for any expression: name, data_type,
nullable, and metadata. This unifies what `arrow_typeof`,
`arrow_metadata`, and `is_nullable` provide individually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 24451ba6-3e79-4402-b26d-11c409450221

📥 Commits

Reviewing files that changed from the base of the PR and between c68fbab and acb2842.

📒 Files selected for processing (6)
  • .cursor/rules.md
  • AGENTS.md
  • datafusion/functions/src/core/arrow_field.rs
  • datafusion/functions/src/core/mod.rs
  • datafusion/sqllogictest/test_files/arrow_field.slt
  • docs/source/user-guide/sql/scalar_functions.md

Walkthrough

This pull request adds a new DataFusion scalar UDF named arrow_field that returns Arrow field metadata as a struct containing name, data type, nullability, and metadata information. The UDF is implemented in a new file datafusion/functions/src/core/arrow_field.rs, registered in the module exports, and includes comprehensive test cases and user documentation. Additionally, AI configuration files (.cursor/rules.md and AGENTS.md) were updated with new guidelines restricting AI agents from linking GitHub issues or pull requests during code reviews and from reviewing other AI configuration files.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21389-2026-04-13-06-47-28

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review: arrow_field(expr) scalar UDF

Overall this is a well-structured addition that follows existing conventions in the codebase. Here are my observations:

Potential Bug: Map field name mismatch

In arrow_field.rs, return_struct_type() declares the metadata map inner struct fields as "keys" and "values" (plural):

Field::new("keys", DataType::Utf8, false),
Field::new("values", DataType::Utf8, true),

But MapBuilder::new(None, StringBuilder::new(), StringBuilder::new()) produces field names "key" and "value" (singular) by default. This means the declared return type does not match the actual runtime type of metadata_array, resulting in an internally inconsistent ScalarValue.

This is copied from the same pattern in arrow_metadata.rs (lines 106-109), so it is a pre-existing inconsistency, but worth flagging here. The discrepancy likely does not surface in current tests because all test cases have empty metadata. A test with actual field metadata would expose whether this causes downstream issues. Consider either using MapFieldNames to configure the builder with matching names, or aligning return_struct_type() to use "key"/"value" to match Arrow's defaults.

Missing test: non-empty metadata

All test cases in arrow_field.slt show metadata: {}. Since metadata handling is a stated feature of this function, it would be good to add at least one test with a column that carries actual metadata.

Minor: invoke_with_args silently ignores argument values

The implementation reads args.arg_fields (schema-level info) and never touches args.args (actual data). This is the correct behavior since Arrow field info is schema-level and does not vary per row, but a brief comment would help future maintainers understand this is intentional:

// This function returns schema-level metadata about the argument's Arrow field.
// The actual runtime values in args.args are intentionally not used.
let [field] = take_function_args(self.name(), args.arg_fields)?;

Minor: documentation example shows truncated output

The doc comment and scalar_functions.md both show {name: lit, data_type: Int64, ...} with an ellipsis, which does not reflect the actual output. The .slt tests show the full format ({name: lit, data_type: Int64, nullable: false, metadata: {}}). Showing the real output in examples is clearer.

Positive notes

  • Clean implementation following the arrow_metadata.rs pattern
  • Correct use of Volatility::Immutable (output depends only on schema, not data)
  • Good breadth of type coverage in the .slt tests (Int, Bool, String, Float, List, Map, Struct, Dictionary, nullable vs non-nullable columns)
  • Proper registration in mod.rs across all necessary locations (module declaration, make_udf_function!, expr_fn, and functions() list)
  • Sorted metadata output ensures deterministic test results

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 13, 2026

🤖 Augment PR Summary

Summary: Adds a new scalar UDF, arrow_field(expr), to introspect Arrow field properties of an expression.

Changes:

  • Introduced ArrowFieldFunc (new file datafusion/functions/src/core/arrow_field.rs) that returns a Struct with name, data_type, nullable, and metadata.
  • Implemented deterministic metadata output by sorting metadata entries before building the returned Map.
  • Registered the new function in the core functions module so it is available in the built-in function registry and expression helpers.
  • Added sqllogictest coverage for literals and column inputs, including struct/map/list/dictionary examples and struct-field access via ['field'].
  • Documented arrow_field in the SQL scalar functions user guide and added it to the “Other Functions” index.

Technical Notes: The function returns a scalar Struct value (constant per expression schema) and represents the Arrow data type as its Display string.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the arrow_field scalar function, which provides detailed Arrow field information for expressions, including name, data type, nullability, and metadata. The PR includes comprehensive documentation and SQL logic tests. A critical bug was identified in the implementation of the metadata map field names, which must be singular ('key' and 'value') to avoid a runtime panic due to type mismatch with the MapBuilder output.

Comment on lines +83 to +84
Field::new("keys", DataType::Utf8, false),
Field::new("values", DataType::Utf8, true),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The field names for the internal struct of an Arrow Map type should be key and value (singular) to be consistent with the standard Arrow Map representation and the output of MapBuilder. Using keys and values (plural) will cause a data type mismatch and a panic when StructArray::new is called in invoke_with_args (line 147), as the MapBuilder used there produces a MapArray with the standard singular field names.

Suggested change
Field::new("keys", DataType::Utf8, false),
Field::new("values", DataType::Utf8, true),
Field::new("key", DataType::Utf8, false),
Field::new("value", DataType::Utf8, true),

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback: The Gemini AI reviewer is not correct! The arrow-rs MapBuilder actually uses "keys" and "values" by default - https://github.com/apache/arrow-rs/blob/68851ef953fd771cc310203c446e54145d4407e1/arrow-array/src/builder/map_builder.rs#L83-L84

@martin-augment
Copy link
Copy Markdown
Owner Author

Potential Bug: Map field name mismatch

In arrow_field.rs, return_struct_type() declares the metadata map inner struct fields as "keys" and "values" (plural):

Field::new("keys", DataType::Utf8, false),
Field::new("values", DataType::Utf8, true),

But MapBuilder::new(None, StringBuilder::new(), StringBuilder::new()) produces field names "key" and "value" (singular) by default. This means the declared return type does not match the actual runtime type of metadata_array, resulting in an internally inconsistent ScalarValue.

This is copied from the same pattern in arrow_metadata.rs (lines 106-109), so it is a pre-existing inconsistency, but worth flagging here. The discrepancy likely does not surface in current tests because all test cases have empty metadata. A test with actual field metadata would expose whether this causes downstream issues. Consider either using MapFieldNames to configure the builder with matching names, or aligning return_struct_type() to use "key"/"value" to match Arrow's defaults.

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The arrow-rs MapBuilder actually uses "keys" and "values" by default - https://github.com/apache/arrow-rs/blob/68851ef953fd771cc310203c446e54145d4407e1/arrow-array/src/builder/map_builder.rs#L83-L84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants