21389: Add `arrow_field(expr)` scalar UDF by martin-augment · Pull Request #302 · martin-augment/datafusion

martin-augment · 2026-04-06T12:40:33Z

21389: To review by AI

Adds a new introspection function that returns a struct containing the complete Arrow Field information for any expression: name, data_type, nullable, and metadata. This unifies what `arrow_typeof`, `arrow_metadata`, and `is_nullable` provide individually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-06T12:40:50Z

Walkthrough

This pull request introduces a new arrow_field scalar UDF to DataFusion that extracts Arrow field metadata including name, data type, nullability, and metadata entries. The implementation adds a new file arrow_field.rs in the core functions directory, registers the UDF through the module system, and includes a test suite validating behavior across various literal and column types. Additionally, configuration files are updated with new guidelines restricting AI agents from linking to GitHub issues or pull requests during code reviews and from reviewing AI configuration files.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pr-21389-2026-04-06-12-40-29

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-06T12:43:31Z

test review

gemini-code-assist

Code Review

This pull request adds the arrow_field scalar function to DataFusion, which returns a struct containing metadata about an expression's Arrow field, including its name, data type, nullability, and metadata. The PR also includes corresponding SQL logic tests and updates AI agent guidelines in .cursor/rules.md and AGENTS.md. Review feedback recommends refining the function's signature for stricter argument validation, optimizing the invoke_with_args implementation by using ScalarValue directly to avoid unnecessary array allocations, and removing duplicate rule entries in AGENTS.md.

gemini-code-assist · 2026-04-06T12:44:26Z

datafusion/functions/src/core/arrow_field.rs

+        Self {
+            signature: Signature::any(1, Volatility::Immutable),
+        }


The signature should be Signature::uniform(1, vec![], Volatility::Immutable) to enforce exactly one argument. Signature::any(1, ...) allows one or more arguments. While the planner might not allow more than one argument due to other constraints, being explicit with uniform is safer and clearer, preventing potential panics in invoke_with_args.

Suggested change

Self {

signature: Signature::any(1, Volatility::Immutable),

}

Self {

signature: Signature::uniform(1, vec![], Volatility::Immutable),

}

value:annoying; category:bug; feedback: The Gemini AI reviewer is not correct! Signature::any(1) is the correct way to say that the function receives just 1 parameter of any type. uniform(1, vec![]) is the same but longer to type. uniform() is useful when the types should be reduced to several types.

gemini-code-assist · 2026-04-06T12:44:27Z

AGENTS.md

+The AI agents should never link to any issue or a pull request
+in any GitHub repository in the code reviews!
+
+The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!
+


These lines are a duplicate of lines 42-45 and can be removed to avoid redundancy.

gemini-code-assist · 2026-04-06T12:44:27Z

datafusion/functions/src/core/arrow_field.rs

+    fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
+        let [_arg] = take_function_args(self.name(), args.args)?;
+        let field = &args.arg_fields[0];
+
+        // Build the name array
+        let name_array =
+            Arc::new(StringArray::from(vec![field.name().as_str()])) as Arc<dyn Array>;
+
+        // Build the data_type array
+        let data_type_str = format!("{}", field.data_type());
+        let data_type_array =
+            Arc::new(StringArray::from(vec![data_type_str.as_str()])) as Arc<dyn Array>;
+
+        // Build the nullable array
+        let nullable_array =
+            Arc::new(BooleanArray::from(vec![field.is_nullable()])) as Arc<dyn Array>;
+
+        // Build the metadata map array (same pattern as arrow_metadata.rs)
+        let metadata = field.metadata();
+        let mut map_builder =
+            MapBuilder::new(None, StringBuilder::new(), StringBuilder::new());
+
+        let mut entries: Vec<_> = metadata.iter().collect();
+        entries.sort_by_key(|(k, _)| *k);
+
+        for (k, v) in entries {
+            map_builder.keys().append_value(k);
+            map_builder.values().append_value(v);
+        }
+        map_builder.append(true)?;
+
+        let metadata_array = Arc::new(map_builder.finish()) as Arc<dyn Array>;
+
+        // Build the struct
+        let DataType::Struct(fields) = Self::return_struct_type() else {
+            unreachable!()
+        };
+
+        let struct_array = StructArray::new(
+            fields,
+            vec![name_array, data_type_array, nullable_array, metadata_array],
+            None,
+        );
+
+        Ok(ColumnarValue::Scalar(ScalarValue::try_from_array(
+            &struct_array,
+            0,
+        )?))
+    }


This implementation can be simplified and made more robust.

Instead of building single-element Arrays for each field, then a StructArray, and finally extracting a ScalarValue, you can construct the ScalarValue for each field directly and then create the final ScalarValue::Struct. This is more direct and likely more efficient.

Additionally, the argument handling let [_arg] = take_function_args(...) is fragile. A defensive check on args.args.len() is more robust, especially if the signature is not perfectly specified.

fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { if args.args.len() != 1 { return internal_err!( "{} expected 1 argument but got {}", self.name(), args.args.len() ); } let field = &args.arg_fields[0]; let name_scalar = ScalarValue::Utf8(Some(field.name().clone())); let data_type_scalar = ScalarValue::Utf8(Some(format!("{}", field.data_type()))); let nullable_scalar = ScalarValue::Boolean(Some(field.is_nullable())); // Build the metadata map scalar let metadata = field.metadata(); let mut map_builder = MapBuilder::new(None, StringBuilder::new(), StringBuilder::new()); let mut entries: Vec<_> = metadata.iter().collect(); entries.sort_by_key(|(k, _)| *k); for (k, v) in entries { map_builder.keys().append_value(k); map_builder.values().append_value(v); } map_builder.append(true)?; let metadata_array = Arc::new(map_builder.finish()); let metadata_scalar = ScalarValue::try_from_array(&metadata_array, 0)?; let struct_scalar = ScalarValue::Struct(Arc::new([ name_scalar, data_type_scalar, nullable_scalar, metadata_scalar, ])); Ok(ColumnarValue::Scalar(struct_scalar)) }

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The result could be constructed without using intermediate arrays. This would reduce the complexity and improve the performance by allocating+deallocating less memory.

claude · 2026-04-06T12:44:36Z

Code Review: arrow_field(expr) Scalar UDF

The overall approach is sound and follows the existing patterns from arrow_typeof and arrow_metadata. Issues to address:

BUG 1 - Duplicated content in AGENTS.md: The same two-rule block is appended twice. One copy should be removed.

BUG 2 - Inaccurate SQL example in user_doc: The doc shows the literal name as 'Int64(1)' but the .slt tests confirm that literals use the name 'lit'. The example should show: {name: lit, data_type: Int64, nullable: false, metadata: {}}.

CODE 3 - Missing Clone derive (arrow_field.rs:57): ArrowFieldFunc derives Debug, PartialEq, Eq, Hash but not Clone. The sibling ArrowMetadataFunc derives Clone. Add Clone for consistency.

CODE 4 - return_struct_type() called twice in invoke_with_args: The function calls Self::return_struct_type() to destructure fields, re-creating the full DataType struct already computed during planning in return_type. Not on a hot path (always returns a Scalar), but unnecessary duplication.

CODE 5 - Minor: format! can be simplified (arrow_field.rs:118): field.data_type().to_string() is cleaner than format!("{}", field.data_type()).

TEST 6 - No test for non-empty metadata: All .slt tests show metadata: {}. No test exercises a column with actual field metadata, which would verify the key/value output and sorted-key ordering logic.

TEST 7 - Consider testing nested/complex types: Only basic scalars and List are tested. A test for Struct or Map types would verify data_type string formatting for complex nested types.

UNRELATED - .cursor/rules.md included accidentally: This Cursor IDE AI config file is unrelated to the arrow_field UDF and should be removed or submitted in a separate PR.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

datafusion/functions/src/core/arrow_field.rs (1)

109-112: Use checked extraction for arg_fields instead of direct indexing.

Line 111 currently uses unchecked indexing (args.arg_fields[0]). Prefer checked extraction to keep failures as structured Result errors instead of panics.

🛡️ Suggested change

 fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
-    let [_arg] = take_function_args(self.name(), args.args)?;
-    let field = &args.arg_fields[0];
+    let [_arg] = take_function_args(self.name(), args.args)?;
+    let [field] = take_function_args(self.name(), args.arg_fields)?;
 
     // Build the name array
     let name_array =
-        Arc::new(StringArray::from(vec![field.name().as_str()])) as Arc<dyn Array>;
+        Arc::new(StringArray::from(vec![field.name().as_str()])) as Arc<dyn Array>;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/core/arrow_field.rs` around lines 109 - 112, The
code in invoke_with_args uses unchecked indexing args.arg_fields[0], which can
panic; replace that with a checked extraction (e.g., match or .get) of
args.arg_fields to return a proper Result error on missing fields instead of
panicking. Update the invoke_with_args implementation to call
take_function_args(self.name(), args.args) as before, then retrieve the first
field with a checked access (e.g., args.arg_fields.get(0).ok_or_else(|| /*
create an appropriate Error with context referencing self.name() or the function
*/)?) and propagate that Result so failures are returned as structured errors
rather than causing an index panic.

datafusion/sqllogictest/test_files/arrow_field.slt (1)

54-70: Add a direct assertion for the metadata subfield.

Current tests validate name, data_type, and nullable via struct indexing, but not metadata. A direct arrow_field(... )['metadata'] check would better protect this new API surface.

📌 Suggested test addition

 # arrow_field struct field access - name
 query T
 SELECT arrow_field(1)['name']
 ----
 lit
+
+# arrow_field struct field access - metadata
+query ?
+SELECT arrow_field(1)['metadata']
+----
+{}

Also applies to: 76-101

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@datafusion/sqllogictest/test_files/arrow_field.slt` around lines 54 - 70, Add
a direct assertion that exercises the metadata subfield of the arrow_field
struct: add a query like SELECT arrow_field(1)['metadata'] and an expected
result for that metadata value alongside the existing checks for name,
data_type, and nullable; update both places where arrow_field is tested (the
current block around the first occurrence and the second block noted for lines
76-101) so the test suite explicitly validates arrow_field(... )['metadata'] in
addition to arrow_field(... )['name'], arrow_field(... )['data_type'], and
arrow_field(... )['nullable'].

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@datafusion/functions/src/core/arrow_field.rs`:
- Around line 35-40: The documented SQL example for the arrow_field function
shows the literal's name as "Int64(1)" but runtime/tests return "lit" for
literals; update the example output in arrow_field.rs (the example under select
arrow_field(1);) so the returned row displays "name: lit" (e.g., {name: lit,
data_type: Int64, ...}) to match actual behavior of the arrow_field function and
tests.

---

Nitpick comments:
In `@datafusion/functions/src/core/arrow_field.rs`:
- Around line 109-112: The code in invoke_with_args uses unchecked indexing
args.arg_fields[0], which can panic; replace that with a checked extraction
(e.g., match or .get) of args.arg_fields to return a proper Result error on
missing fields instead of panicking. Update the invoke_with_args implementation
to call take_function_args(self.name(), args.args) as before, then retrieve the
first field with a checked access (e.g., args.arg_fields.get(0).ok_or_else(|| /*
create an appropriate Error with context referencing self.name() or the function
*/)?) and propagate that Result so failures are returned as structured errors
rather than causing an index panic.

In `@datafusion/sqllogictest/test_files/arrow_field.slt`:
- Around line 54-70: Add a direct assertion that exercises the metadata subfield
of the arrow_field struct: add a query like SELECT arrow_field(1)['metadata']
and an expected result for that metadata value alongside the existing checks for
name, data_type, and nullable; update both places where arrow_field is tested
(the current block around the first occurrence and the second block noted for
lines 76-101) so the test suite explicitly validates arrow_field(...
)['metadata'] in addition to arrow_field(... )['name'], arrow_field(...
)['data_type'], and arrow_field(... )['nullable'].

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 112717d4-08de-418e-98b7-c79c52423730

📥 Commits

Reviewing files that changed from the base of the PR and between c990ba0 and 3a158e3.

📒 Files selected for processing (5)

.cursor/rules.md
AGENTS.md
datafusion/functions/src/core/arrow_field.rs
datafusion/functions/src/core/mod.rs
datafusion/sqllogictest/test_files/arrow_field.slt

coderabbitai · 2026-04-06T12:47:42Z

datafusion/functions/src/core/arrow_field.rs

+> select arrow_field(1);
+----------------------------------------------+
+| arrow_field(Int64(1))                        |
+----------------------------------------------+
+| {name: Int64(1), data_type: Int64, ...}      |
+----------------------------------------------+


⚠️ Potential issue | 🟡 Minor

Fix the documented example output for name.

The SQL example currently shows name: Int64(1), but the behavior validated by tests returns name: lit for literals. Please align docs with runtime output.

📝 Suggested doc fix

-| {name: Int64(1), data_type: Int64, ...} | +| {name: lit, data_type: Int64, ...} |

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

> select arrow_field(1);

+----------------------------------------------+

| arrow_field(Int64(1)) |

+----------------------------------------------+

| {name: Int64(1), data_type: Int64, ...} |

+----------------------------------------------+

> select arrow_field(1);

----------------------------------------------+

| arrow_field(Int64(1)) |

----------------------------------------------+

| {name: lit, data_type: Int64, ...} |

----------------------------------------------+

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@datafusion/functions/src/core/arrow_field.rs` around lines 35 - 40, The documented SQL example for the arrow_field function shows the literal's name as "Int64(1)" but runtime/tests return "lit" for literals; update the example output in arrow_field.rs (the example under select arrow_field(1);) so the returned row displays "name: lit" (e.g., {name: lit, data_type: Int64, ...}) to match actual behavior of the arrow_field function and tests.

value:useful; category:documentation; feedback: The CodeRabbit AI reviewer is correct! The .slt file confirms that the name of the literal columns is lit. The documentation has to be updated to match with the reality.

augmentcode · 2026-04-06T22:52:46Z

🤖 Augment PR Summary

Summary: Adds a new arrow_field(expr) scalar UDF that exposes Arrow field-level metadata for an expression as a struct.

Changes:

Introduces `ArrowFieldFunc` (`datafusion/functions/src/core/arrow_field.rs`) returning a Struct with name, data_type, nullable, and metadata.
Builds the returned value as a scalar StructArray, including a Map for metadata with deterministic key ordering.
Registers the new function in the core functions module (`datafusion/functions/src/core/mod.rs`) and exports it through the expr function helpers.
Adds a new SQL logic test file (`datafusion/sqllogictest/test_files/arrow_field.slt`) covering literals, arrays, and table columns, plus struct field access.

Technical Notes: The UDF derives output from the planned argument field (arg_fields) rather than runtime values, so the result is constant per expression and returned as a scalar.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-04-06T22:52:47Z

datafusion/functions/src/core/arrow_field.rs

+----------------------------------------------+
+| arrow_field(Int64(1))                        |
+----------------------------------------------+
+| {name: Int64(1), data_type: Int64, ...}      |


In datafusion/functions/src/core/arrow_field.rs (line 39), the SQL example output shows {name: Int64(1), ...} but the implementation (and the new SLT) uses the argument field name (e.g. lit for literals). This example looks misleading for users trying to understand what name represents.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:useful; category:documentation; feedback: The Augment AI reviewer is correct! The .slt file confirms that the name of the literal columns is lit. The documentation has to be updated to match with the reality.

augmentcode · 2026-04-06T22:52:47Z

datafusion/functions/src/core/arrow_field.rs

+
+    fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
+        let [_arg] = take_function_args(self.name(), args.args)?;
+        let field = &args.arg_fields[0];


In datafusion/functions/src/core/arrow_field.rs (line 111), args.arg_fields[0] will panic if arg_fields is unexpectedly empty/mismatched with args.args. Since this is a built-in UDF, it may be safer to fail with a Result error rather than potentially aborting execution via panic.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:annoying; category:bug; feedback: The Augment AI reviewer is correct! The function won't be executed at all if the parameters do not match the expected signature, i.e. this method won't be called at all if there is no exactly one argument passed to the function.

martin-augment · 2026-04-07T11:07:54Z

BUG 2 - Inaccurate SQL example in user_doc: The doc shows the literal name as 'Int64(1)' but the .slt tests confirm that literals use the name 'lit'. The example should show: {name: lit, data_type: Int64, nullable: false, metadata: {}}.

value:useful; category:documentation; feedback: The Claude AI reviewer is correct! The .slt file confirms that the name of the literal columns is lit. The documentation has to be updated to match with the reality.

martin-augment · 2026-04-07T11:10:16Z

CODE 4 - return_struct_type() called twice in invoke_with_args: The function calls Self::return_struct_type() to destructure fields, re-creating the full DataType struct already computed during planning in return_type. Not on a hot path (always returns a Scalar), but unnecessary duplication.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The return struct type is already constructed and available thru the passed arguments to the function. There is no need to construct a second instance just to deconstruct it.

martin-augment · 2026-04-07T11:12:11Z

CODE 5 - Minor: format! can be simplified (arrow_field.rs:118): field.data_type().to_string() is cleaner than format!("{}", field.data_type()).

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! Using .to_string() is both shorter to write/read and to execute because format!() macro has additional functionalities related to formatting which are not needed in this case.

martin-augment · 2026-04-07T11:13:23Z

TEST 7 - Consider testing nested/complex types: Only basic scalars and List are tested. A test for Struct or Map types would verify data_type string formatting for complex nested types.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! SQL Logic tests for more complex data types like Struct, Map, Dictionary would be useful to confirm that their rendering works as expected.

adriangb and others added 2 commits April 5, 2026 17:05

Add CLAUDE.md & AGENTS.md

3a158e3

gemini-code-assist bot reviewed Apr 6, 2026

View reviewed changes

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

augmentcode bot reviewed Apr 6, 2026

View reviewed changes

Conversation

martin-augment commented Apr 6, 2026

Uh oh!

coderabbitai bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Apr 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 6, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Apr 6, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Apr 7, 2026

Uh oh!

martin-augment commented Apr 7, 2026

Uh oh!

martin-augment commented Apr 7, 2026

Uh oh!

martin-augment commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Apr 6, 2026 •

edited

Loading