Skip to content

21389: Add arrow_field(expr) scalar UDF#302

Open
martin-augment wants to merge 2 commits intomainfrom
pr-21389-2026-04-06-12-40-29
Open

21389: Add arrow_field(expr) scalar UDF#302
martin-augment wants to merge 2 commits intomainfrom
pr-21389-2026-04-06-12-40-29

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21389: To review by AI

adriangb and others added 2 commits April 5, 2026 17:05
Adds a new introspection function that returns a struct containing the
complete Arrow Field information for any expression: name, data_type,
nullable, and metadata. This unifies what `arrow_typeof`,
`arrow_metadata`, and `is_nullable` provide individually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

Walkthrough

This pull request introduces a new arrow_field scalar UDF to DataFusion that extracts Arrow field metadata including name, data type, nullability, and metadata entries. The implementation adds a new file arrow_field.rs in the core functions directory, registers the UDF through the module system, and includes a test suite validating behavior across various literal and column types. Additionally, configuration files are updated with new guidelines restricting AI agents from linking to GitHub issues or pull requests during code reviews and from reviewing AI configuration files.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21389-2026-04-06-12-40-29

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

test review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the arrow_field scalar function to DataFusion, which returns a struct containing metadata about an expression's Arrow field, including its name, data type, nullability, and metadata. The PR also includes corresponding SQL logic tests and updates AI agent guidelines in .cursor/rules.md and AGENTS.md. Review feedback recommends refining the function's signature for stricter argument validation, optimizing the invoke_with_args implementation by using ScalarValue directly to avoid unnecessary array allocations, and removing duplicate rule entries in AGENTS.md.

Comment on lines +67 to +69
Self {
signature: Signature::any(1, Volatility::Immutable),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The signature should be Signature::uniform(1, vec![], Volatility::Immutable) to enforce exactly one argument. Signature::any(1, ...) allows one or more arguments. While the planner might not allow more than one argument due to other constraints, being explicit with uniform is safer and clearer, preventing potential panics in invoke_with_args.

Suggested change
Self {
signature: Signature::any(1, Volatility::Immutable),
}
Self {
signature: Signature::uniform(1, vec![], Volatility::Immutable),
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback: The Gemini AI reviewer is not correct! Signature::any(1) is the correct way to say that the function receives just 1 parameter of any type. uniform(1, vec![]) is the same but longer to type. uniform() is useful when the types should be reduced to several types.

Comment on lines +47 to +51
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These lines are a duplicate of lines 42-45 and can be removed to avoid redundancy.

Comment on lines +109 to +157
fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
let [_arg] = take_function_args(self.name(), args.args)?;
let field = &args.arg_fields[0];

// Build the name array
let name_array =
Arc::new(StringArray::from(vec![field.name().as_str()])) as Arc<dyn Array>;

// Build the data_type array
let data_type_str = format!("{}", field.data_type());
let data_type_array =
Arc::new(StringArray::from(vec![data_type_str.as_str()])) as Arc<dyn Array>;

// Build the nullable array
let nullable_array =
Arc::new(BooleanArray::from(vec![field.is_nullable()])) as Arc<dyn Array>;

// Build the metadata map array (same pattern as arrow_metadata.rs)
let metadata = field.metadata();
let mut map_builder =
MapBuilder::new(None, StringBuilder::new(), StringBuilder::new());

let mut entries: Vec<_> = metadata.iter().collect();
entries.sort_by_key(|(k, _)| *k);

for (k, v) in entries {
map_builder.keys().append_value(k);
map_builder.values().append_value(v);
}
map_builder.append(true)?;

let metadata_array = Arc::new(map_builder.finish()) as Arc<dyn Array>;

// Build the struct
let DataType::Struct(fields) = Self::return_struct_type() else {
unreachable!()
};

let struct_array = StructArray::new(
fields,
vec![name_array, data_type_array, nullable_array, metadata_array],
None,
);

Ok(ColumnarValue::Scalar(ScalarValue::try_from_array(
&struct_array,
0,
)?))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation can be simplified and made more robust.

Instead of building single-element Arrays for each field, then a StructArray, and finally extracting a ScalarValue, you can construct the ScalarValue for each field directly and then create the final ScalarValue::Struct. This is more direct and likely more efficient.

Additionally, the argument handling let [_arg] = take_function_args(...) is fragile. A defensive check on args.args.len() is more robust, especially if the signature is not perfectly specified.

    fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
        if args.args.len() != 1 {
            return internal_err!(
                "{} expected 1 argument but got {}",
                self.name(),
                args.args.len()
            );
        }
        let field = &args.arg_fields[0];

        let name_scalar = ScalarValue::Utf8(Some(field.name().clone()));
        let data_type_scalar =
            ScalarValue::Utf8(Some(format!("{}", field.data_type())));
        let nullable_scalar = ScalarValue::Boolean(Some(field.is_nullable()));

        // Build the metadata map scalar
        let metadata = field.metadata();
        let mut map_builder =
            MapBuilder::new(None, StringBuilder::new(), StringBuilder::new());

        let mut entries: Vec<_> = metadata.iter().collect();
        entries.sort_by_key(|(k, _)| *k);

        for (k, v) in entries {
            map_builder.keys().append_value(k);
            map_builder.values().append_value(v);
        }
        map_builder.append(true)?;

        let metadata_array = Arc::new(map_builder.finish());
        let metadata_scalar = ScalarValue::try_from_array(&metadata_array, 0)?;

        let struct_scalar = ScalarValue::Struct(Arc::new([
            name_scalar,
            data_type_scalar,
            nullable_scalar,
            metadata_scalar,
        ]));

        Ok(ColumnarValue::Scalar(struct_scalar))
    }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The result could be constructed without using intermediate arrays. This would reduce the complexity and improve the performance by allocating+deallocating less memory.

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Code Review: arrow_field(expr) Scalar UDF

The overall approach is sound and follows the existing patterns from arrow_typeof and arrow_metadata. Issues to address:

BUG 1 - Duplicated content in AGENTS.md: The same two-rule block is appended twice. One copy should be removed.

BUG 2 - Inaccurate SQL example in user_doc: The doc shows the literal name as 'Int64(1)' but the .slt tests confirm that literals use the name 'lit'. The example should show: {name: lit, data_type: Int64, nullable: false, metadata: {}}.

CODE 3 - Missing Clone derive (arrow_field.rs:57): ArrowFieldFunc derives Debug, PartialEq, Eq, Hash but not Clone. The sibling ArrowMetadataFunc derives Clone. Add Clone for consistency.

CODE 4 - return_struct_type() called twice in invoke_with_args: The function calls Self::return_struct_type() to destructure fields, re-creating the full DataType struct already computed during planning in return_type. Not on a hot path (always returns a Scalar), but unnecessary duplication.

CODE 5 - Minor: format! can be simplified (arrow_field.rs:118): field.data_type().to_string() is cleaner than format!("{}", field.data_type()).

TEST 6 - No test for non-empty metadata: All .slt tests show metadata: {}. No test exercises a column with actual field metadata, which would verify the key/value output and sorted-key ordering logic.

TEST 7 - Consider testing nested/complex types: Only basic scalars and List are tested. A test for Struct or Map types would verify data_type string formatting for complex nested types.

UNRELATED - .cursor/rules.md included accidentally: This Cursor IDE AI config file is unrelated to the arrow_field UDF and should be removed or submitted in a separate PR.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
datafusion/functions/src/core/arrow_field.rs (1)

109-112: Use checked extraction for arg_fields instead of direct indexing.

Line 111 currently uses unchecked indexing (args.arg_fields[0]). Prefer checked extraction to keep failures as structured Result errors instead of panics.

🛡️ Suggested change
 fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
-    let [_arg] = take_function_args(self.name(), args.args)?;
-    let field = &args.arg_fields[0];
+    let [_arg] = take_function_args(self.name(), args.args)?;
+    let [field] = take_function_args(self.name(), args.arg_fields)?;
 
     // Build the name array
     let name_array =
-        Arc::new(StringArray::from(vec![field.name().as_str()])) as Arc<dyn Array>;
+        Arc::new(StringArray::from(vec![field.name().as_str()])) as Arc<dyn Array>;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/core/arrow_field.rs` around lines 109 - 112, The
code in invoke_with_args uses unchecked indexing args.arg_fields[0], which can
panic; replace that with a checked extraction (e.g., match or .get) of
args.arg_fields to return a proper Result error on missing fields instead of
panicking. Update the invoke_with_args implementation to call
take_function_args(self.name(), args.args) as before, then retrieve the first
field with a checked access (e.g., args.arg_fields.get(0).ok_or_else(|| /*
create an appropriate Error with context referencing self.name() or the function
*/)?) and propagate that Result so failures are returned as structured errors
rather than causing an index panic.
datafusion/sqllogictest/test_files/arrow_field.slt (1)

54-70: Add a direct assertion for the metadata subfield.

Current tests validate name, data_type, and nullable via struct indexing, but not metadata. A direct arrow_field(... )['metadata'] check would better protect this new API surface.

📌 Suggested test addition
 # arrow_field struct field access - name
 query T
 SELECT arrow_field(1)['name']
 ----
 lit
+
+# arrow_field struct field access - metadata
+query ?
+SELECT arrow_field(1)['metadata']
+----
+{}

Also applies to: 76-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/sqllogictest/test_files/arrow_field.slt` around lines 54 - 70, Add
a direct assertion that exercises the metadata subfield of the arrow_field
struct: add a query like SELECT arrow_field(1)['metadata'] and an expected
result for that metadata value alongside the existing checks for name,
data_type, and nullable; update both places where arrow_field is tested (the
current block around the first occurrence and the second block noted for lines
76-101) so the test suite explicitly validates arrow_field(... )['metadata'] in
addition to arrow_field(... )['name'], arrow_field(... )['data_type'], and
arrow_field(... )['nullable'].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@datafusion/functions/src/core/arrow_field.rs`:
- Around line 35-40: The documented SQL example for the arrow_field function
shows the literal's name as "Int64(1)" but runtime/tests return "lit" for
literals; update the example output in arrow_field.rs (the example under select
arrow_field(1);) so the returned row displays "name: lit" (e.g., {name: lit,
data_type: Int64, ...}) to match actual behavior of the arrow_field function and
tests.

---

Nitpick comments:
In `@datafusion/functions/src/core/arrow_field.rs`:
- Around line 109-112: The code in invoke_with_args uses unchecked indexing
args.arg_fields[0], which can panic; replace that with a checked extraction
(e.g., match or .get) of args.arg_fields to return a proper Result error on
missing fields instead of panicking. Update the invoke_with_args implementation
to call take_function_args(self.name(), args.args) as before, then retrieve the
first field with a checked access (e.g., args.arg_fields.get(0).ok_or_else(|| /*
create an appropriate Error with context referencing self.name() or the function
*/)?) and propagate that Result so failures are returned as structured errors
rather than causing an index panic.

In `@datafusion/sqllogictest/test_files/arrow_field.slt`:
- Around line 54-70: Add a direct assertion that exercises the metadata subfield
of the arrow_field struct: add a query like SELECT arrow_field(1)['metadata']
and an expected result for that metadata value alongside the existing checks for
name, data_type, and nullable; update both places where arrow_field is tested
(the current block around the first occurrence and the second block noted for
lines 76-101) so the test suite explicitly validates arrow_field(...
)['metadata'] in addition to arrow_field(... )['name'], arrow_field(...
)['data_type'], and arrow_field(... )['nullable'].
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 112717d4-08de-418e-98b7-c79c52423730

📥 Commits

Reviewing files that changed from the base of the PR and between c990ba0 and 3a158e3.

📒 Files selected for processing (5)
  • .cursor/rules.md
  • AGENTS.md
  • datafusion/functions/src/core/arrow_field.rs
  • datafusion/functions/src/core/mod.rs
  • datafusion/sqllogictest/test_files/arrow_field.slt

Comment on lines +35 to +40
> select arrow_field(1);
+----------------------------------------------+
| arrow_field(Int64(1)) |
+----------------------------------------------+
| {name: Int64(1), data_type: Int64, ...} |
+----------------------------------------------+
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the documented example output for name.

The SQL example currently shows name: Int64(1), but the behavior validated by tests returns name: lit for literals. Please align docs with runtime output.

📝 Suggested doc fix
-| {name: Int64(1), data_type: Int64, ...}      |
+| {name: lit, data_type: Int64, ...}           |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
> select arrow_field(1);
+----------------------------------------------+
| arrow_field(Int64(1)) |
+----------------------------------------------+
| {name: Int64(1), data_type: Int64, ...} |
+----------------------------------------------+
> select arrow_field(1);
----------------------------------------------+
| arrow_field(Int64(1)) |
----------------------------------------------+
| {name: lit, data_type: Int64, ...} |
----------------------------------------------+
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/core/arrow_field.rs` around lines 35 - 40, The
documented SQL example for the arrow_field function shows the literal's name as
"Int64(1)" but runtime/tests return "lit" for literals; update the example
output in arrow_field.rs (the example under select arrow_field(1);) so the
returned row displays "name: lit" (e.g., {name: lit, data_type: Int64, ...}) to
match actual behavior of the arrow_field function and tests.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:documentation; feedback: The CodeRabbit AI reviewer is correct! The .slt file confirms that the name of the literal columns is lit. The documentation has to be updated to match with the reality.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 6, 2026

🤖 Augment PR Summary

Summary: Adds a new arrow_field(expr) scalar UDF that exposes Arrow field-level metadata for an expression as a struct.

Changes:

  • Introduces `ArrowFieldFunc` (`datafusion/functions/src/core/arrow_field.rs`) returning a Struct with name, data_type, nullable, and metadata.
  • Builds the returned value as a scalar StructArray, including a Map for metadata with deterministic key ordering.
  • Registers the new function in the core functions module (`datafusion/functions/src/core/mod.rs`) and exports it through the expr function helpers.
  • Adds a new SQL logic test file (`datafusion/sqllogictest/test_files/arrow_field.slt`) covering literals, arrays, and table columns, plus struct field access.

Technical Notes: The UDF derives output from the planned argument field (arg_fields) rather than runtime values, so the result is constant per expression and returned as a scalar.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

+----------------------------------------------+
| arrow_field(Int64(1)) |
+----------------------------------------------+
| {name: Int64(1), data_type: Int64, ...} |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In datafusion/functions/src/core/arrow_field.rs (line 39), the SQL example output shows {name: Int64(1), ...} but the implementation (and the new SLT) uses the argument field name (e.g. lit for literals). This example looks misleading for users trying to understand what name represents.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:documentation; feedback: The Augment AI reviewer is correct! The .slt file confirms that the name of the literal columns is lit. The documentation has to be updated to match with the reality.


fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
let [_arg] = take_function_args(self.name(), args.args)?;
let field = &args.arg_fields[0];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In datafusion/functions/src/core/arrow_field.rs (line 111), args.arg_fields[0] will panic if arg_fields is unexpectedly empty/mismatched with args.args. Since this is a built-in UDF, it may be safer to fail with a Result error rather than potentially aborting execution via panic.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback: The Augment AI reviewer is correct! The function won't be executed at all if the parameters do not match the expected signature, i.e. this method won't be called at all if there is no exactly one argument passed to the function.

@martin-augment
Copy link
Copy Markdown
Owner Author

BUG 2 - Inaccurate SQL example in user_doc: The doc shows the literal name as 'Int64(1)' but the .slt tests confirm that literals use the name 'lit'. The example should show: {name: lit, data_type: Int64, nullable: false, metadata: {}}.

value:useful; category:documentation; feedback: The Claude AI reviewer is correct! The .slt file confirms that the name of the literal columns is lit. The documentation has to be updated to match with the reality.

@martin-augment
Copy link
Copy Markdown
Owner Author

CODE 4 - return_struct_type() called twice in invoke_with_args: The function calls Self::return_struct_type() to destructure fields, re-creating the full DataType struct already computed during planning in return_type. Not on a hot path (always returns a Scalar), but unnecessary duplication.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The return struct type is already constructed and available thru the passed arguments to the function. There is no need to construct a second instance just to deconstruct it.

@martin-augment
Copy link
Copy Markdown
Owner Author

CODE 5 - Minor: format! can be simplified (arrow_field.rs:118): field.data_type().to_string() is cleaner than format!("{}", field.data_type()).

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! Using .to_string() is both shorter to write/read and to execute because format!() macro has additional functionalities related to formatting which are not needed in this case.

@martin-augment
Copy link
Copy Markdown
Owner Author

TEST 7 - Consider testing nested/complex types: Only basic scalars and List are tested. A test for Struct or Map types would verify data_type string formatting for complex nested types.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! SQL Logic tests for more complex data types like Struct, Map, Dictionary would be useful to confirm that their rendering works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants