Conversation
Adds a new introspection function that returns a struct containing the complete Arrow Field information for any expression: name, data_type, nullable, and metadata. This unifies what `arrow_typeof`, `arrow_metadata`, and `is_nullable` provide individually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WalkthroughThis pull request introduces a new ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
test review |
There was a problem hiding this comment.
Code Review
This pull request adds the arrow_field scalar function to DataFusion, which returns a struct containing metadata about an expression's Arrow field, including its name, data type, nullability, and metadata. The PR also includes corresponding SQL logic tests and updates AI agent guidelines in .cursor/rules.md and AGENTS.md. Review feedback recommends refining the function's signature for stricter argument validation, optimizing the invoke_with_args implementation by using ScalarValue directly to avoid unnecessary array allocations, and removing duplicate rule entries in AGENTS.md.
| Self { | ||
| signature: Signature::any(1, Volatility::Immutable), | ||
| } |
There was a problem hiding this comment.
The signature should be Signature::uniform(1, vec![], Volatility::Immutable) to enforce exactly one argument. Signature::any(1, ...) allows one or more arguments. While the planner might not allow more than one argument due to other constraints, being explicit with uniform is safer and clearer, preventing potential panics in invoke_with_args.
| Self { | |
| signature: Signature::any(1, Volatility::Immutable), | |
| } | |
| Self { | |
| signature: Signature::uniform(1, vec![], Volatility::Immutable), | |
| } |
There was a problem hiding this comment.
value:annoying; category:bug; feedback: The Gemini AI reviewer is not correct! Signature::any(1) is the correct way to say that the function receives just 1 parameter of any type. uniform(1, vec![]) is the same but longer to type. uniform() is useful when the types should be reduced to several types.
| The AI agents should never link to any issue or a pull request | ||
| in any GitHub repository in the code reviews! | ||
|
|
||
| The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md! | ||
|
|
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { | ||
| let [_arg] = take_function_args(self.name(), args.args)?; | ||
| let field = &args.arg_fields[0]; | ||
|
|
||
| // Build the name array | ||
| let name_array = | ||
| Arc::new(StringArray::from(vec![field.name().as_str()])) as Arc<dyn Array>; | ||
|
|
||
| // Build the data_type array | ||
| let data_type_str = format!("{}", field.data_type()); | ||
| let data_type_array = | ||
| Arc::new(StringArray::from(vec![data_type_str.as_str()])) as Arc<dyn Array>; | ||
|
|
||
| // Build the nullable array | ||
| let nullable_array = | ||
| Arc::new(BooleanArray::from(vec![field.is_nullable()])) as Arc<dyn Array>; | ||
|
|
||
| // Build the metadata map array (same pattern as arrow_metadata.rs) | ||
| let metadata = field.metadata(); | ||
| let mut map_builder = | ||
| MapBuilder::new(None, StringBuilder::new(), StringBuilder::new()); | ||
|
|
||
| let mut entries: Vec<_> = metadata.iter().collect(); | ||
| entries.sort_by_key(|(k, _)| *k); | ||
|
|
||
| for (k, v) in entries { | ||
| map_builder.keys().append_value(k); | ||
| map_builder.values().append_value(v); | ||
| } | ||
| map_builder.append(true)?; | ||
|
|
||
| let metadata_array = Arc::new(map_builder.finish()) as Arc<dyn Array>; | ||
|
|
||
| // Build the struct | ||
| let DataType::Struct(fields) = Self::return_struct_type() else { | ||
| unreachable!() | ||
| }; | ||
|
|
||
| let struct_array = StructArray::new( | ||
| fields, | ||
| vec![name_array, data_type_array, nullable_array, metadata_array], | ||
| None, | ||
| ); | ||
|
|
||
| Ok(ColumnarValue::Scalar(ScalarValue::try_from_array( | ||
| &struct_array, | ||
| 0, | ||
| )?)) | ||
| } |
There was a problem hiding this comment.
This implementation can be simplified and made more robust.
Instead of building single-element Arrays for each field, then a StructArray, and finally extracting a ScalarValue, you can construct the ScalarValue for each field directly and then create the final ScalarValue::Struct. This is more direct and likely more efficient.
Additionally, the argument handling let [_arg] = take_function_args(...) is fragile. A defensive check on args.args.len() is more robust, especially if the signature is not perfectly specified.
fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
if args.args.len() != 1 {
return internal_err!(
"{} expected 1 argument but got {}",
self.name(),
args.args.len()
);
}
let field = &args.arg_fields[0];
let name_scalar = ScalarValue::Utf8(Some(field.name().clone()));
let data_type_scalar =
ScalarValue::Utf8(Some(format!("{}", field.data_type())));
let nullable_scalar = ScalarValue::Boolean(Some(field.is_nullable()));
// Build the metadata map scalar
let metadata = field.metadata();
let mut map_builder =
MapBuilder::new(None, StringBuilder::new(), StringBuilder::new());
let mut entries: Vec<_> = metadata.iter().collect();
entries.sort_by_key(|(k, _)| *k);
for (k, v) in entries {
map_builder.keys().append_value(k);
map_builder.values().append_value(v);
}
map_builder.append(true)?;
let metadata_array = Arc::new(map_builder.finish());
let metadata_scalar = ScalarValue::try_from_array(&metadata_array, 0)?;
let struct_scalar = ScalarValue::Struct(Arc::new([
name_scalar,
data_type_scalar,
nullable_scalar,
metadata_scalar,
]));
Ok(ColumnarValue::Scalar(struct_scalar))
}There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The result could be constructed without using intermediate arrays. This would reduce the complexity and improve the performance by allocating+deallocating less memory.
|
Code Review: arrow_field(expr) Scalar UDF The overall approach is sound and follows the existing patterns from arrow_typeof and arrow_metadata. Issues to address: BUG 1 - Duplicated content in AGENTS.md: The same two-rule block is appended twice. One copy should be removed. BUG 2 - Inaccurate SQL example in user_doc: The doc shows the literal name as 'Int64(1)' but the .slt tests confirm that literals use the name 'lit'. The example should show: {name: lit, data_type: Int64, nullable: false, metadata: {}}. CODE 3 - Missing Clone derive (arrow_field.rs:57): ArrowFieldFunc derives Debug, PartialEq, Eq, Hash but not Clone. The sibling ArrowMetadataFunc derives Clone. Add Clone for consistency. CODE 4 - return_struct_type() called twice in invoke_with_args: The function calls Self::return_struct_type() to destructure fields, re-creating the full DataType struct already computed during planning in return_type. Not on a hot path (always returns a Scalar), but unnecessary duplication. CODE 5 - Minor: format! can be simplified (arrow_field.rs:118): field.data_type().to_string() is cleaner than format!("{}", field.data_type()). TEST 6 - No test for non-empty metadata: All .slt tests show metadata: {}. No test exercises a column with actual field metadata, which would verify the key/value output and sorted-key ordering logic. TEST 7 - Consider testing nested/complex types: Only basic scalars and List are tested. A test for Struct or Map types would verify data_type string formatting for complex nested types. UNRELATED - .cursor/rules.md included accidentally: This Cursor IDE AI config file is unrelated to the arrow_field UDF and should be removed or submitted in a separate PR. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
datafusion/functions/src/core/arrow_field.rs (1)
109-112: Use checked extraction forarg_fieldsinstead of direct indexing.Line 111 currently uses unchecked indexing (
args.arg_fields[0]). Prefer checked extraction to keep failures as structuredResulterrors instead of panics.🛡️ Suggested change
fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { - let [_arg] = take_function_args(self.name(), args.args)?; - let field = &args.arg_fields[0]; + let [_arg] = take_function_args(self.name(), args.args)?; + let [field] = take_function_args(self.name(), args.arg_fields)?; // Build the name array let name_array = - Arc::new(StringArray::from(vec![field.name().as_str()])) as Arc<dyn Array>; + Arc::new(StringArray::from(vec![field.name().as_str()])) as Arc<dyn Array>;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/functions/src/core/arrow_field.rs` around lines 109 - 112, The code in invoke_with_args uses unchecked indexing args.arg_fields[0], which can panic; replace that with a checked extraction (e.g., match or .get) of args.arg_fields to return a proper Result error on missing fields instead of panicking. Update the invoke_with_args implementation to call take_function_args(self.name(), args.args) as before, then retrieve the first field with a checked access (e.g., args.arg_fields.get(0).ok_or_else(|| /* create an appropriate Error with context referencing self.name() or the function */)?) and propagate that Result so failures are returned as structured errors rather than causing an index panic.datafusion/sqllogictest/test_files/arrow_field.slt (1)
54-70: Add a direct assertion for themetadatasubfield.Current tests validate
name,data_type, andnullablevia struct indexing, but notmetadata. A directarrow_field(... )['metadata']check would better protect this new API surface.📌 Suggested test addition
# arrow_field struct field access - name query T SELECT arrow_field(1)['name'] ---- lit + +# arrow_field struct field access - metadata +query ? +SELECT arrow_field(1)['metadata'] +---- +{}Also applies to: 76-101
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/sqllogictest/test_files/arrow_field.slt` around lines 54 - 70, Add a direct assertion that exercises the metadata subfield of the arrow_field struct: add a query like SELECT arrow_field(1)['metadata'] and an expected result for that metadata value alongside the existing checks for name, data_type, and nullable; update both places where arrow_field is tested (the current block around the first occurrence and the second block noted for lines 76-101) so the test suite explicitly validates arrow_field(... )['metadata'] in addition to arrow_field(... )['name'], arrow_field(... )['data_type'], and arrow_field(... )['nullable'].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@datafusion/functions/src/core/arrow_field.rs`:
- Around line 35-40: The documented SQL example for the arrow_field function
shows the literal's name as "Int64(1)" but runtime/tests return "lit" for
literals; update the example output in arrow_field.rs (the example under select
arrow_field(1);) so the returned row displays "name: lit" (e.g., {name: lit,
data_type: Int64, ...}) to match actual behavior of the arrow_field function and
tests.
---
Nitpick comments:
In `@datafusion/functions/src/core/arrow_field.rs`:
- Around line 109-112: The code in invoke_with_args uses unchecked indexing
args.arg_fields[0], which can panic; replace that with a checked extraction
(e.g., match or .get) of args.arg_fields to return a proper Result error on
missing fields instead of panicking. Update the invoke_with_args implementation
to call take_function_args(self.name(), args.args) as before, then retrieve the
first field with a checked access (e.g., args.arg_fields.get(0).ok_or_else(|| /*
create an appropriate Error with context referencing self.name() or the function
*/)?) and propagate that Result so failures are returned as structured errors
rather than causing an index panic.
In `@datafusion/sqllogictest/test_files/arrow_field.slt`:
- Around line 54-70: Add a direct assertion that exercises the metadata subfield
of the arrow_field struct: add a query like SELECT arrow_field(1)['metadata']
and an expected result for that metadata value alongside the existing checks for
name, data_type, and nullable; update both places where arrow_field is tested
(the current block around the first occurrence and the second block noted for
lines 76-101) so the test suite explicitly validates arrow_field(...
)['metadata'] in addition to arrow_field(... )['name'], arrow_field(...
)['data_type'], and arrow_field(... )['nullable'].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 112717d4-08de-418e-98b7-c79c52423730
📒 Files selected for processing (5)
.cursor/rules.mdAGENTS.mddatafusion/functions/src/core/arrow_field.rsdatafusion/functions/src/core/mod.rsdatafusion/sqllogictest/test_files/arrow_field.slt
| > select arrow_field(1); | ||
| +----------------------------------------------+ | ||
| | arrow_field(Int64(1)) | | ||
| +----------------------------------------------+ | ||
| | {name: Int64(1), data_type: Int64, ...} | | ||
| +----------------------------------------------+ |
There was a problem hiding this comment.
Fix the documented example output for name.
The SQL example currently shows name: Int64(1), but the behavior validated by tests returns name: lit for literals. Please align docs with runtime output.
📝 Suggested doc fix
-| {name: Int64(1), data_type: Int64, ...} |
+| {name: lit, data_type: Int64, ...} |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| > select arrow_field(1); | |
| +----------------------------------------------+ | |
| | arrow_field(Int64(1)) | | |
| +----------------------------------------------+ | |
| | {name: Int64(1), data_type: Int64, ...} | | |
| +----------------------------------------------+ | |
| > select arrow_field(1); | |
| ----------------------------------------------+ | |
| | arrow_field(Int64(1)) | | |
| ----------------------------------------------+ | |
| | {name: lit, data_type: Int64, ...} | | |
| ----------------------------------------------+ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@datafusion/functions/src/core/arrow_field.rs` around lines 35 - 40, The
documented SQL example for the arrow_field function shows the literal's name as
"Int64(1)" but runtime/tests return "lit" for literals; update the example
output in arrow_field.rs (the example under select arrow_field(1);) so the
returned row displays "name: lit" (e.g., {name: lit, data_type: Int64, ...}) to
match actual behavior of the arrow_field function and tests.
There was a problem hiding this comment.
value:useful; category:documentation; feedback: The CodeRabbit AI reviewer is correct! The .slt file confirms that the name of the literal columns is lit. The documentation has to be updated to match with the reality.
🤖 Augment PR SummarySummary: Adds a new Changes:
Technical Notes: The UDF derives output from the planned argument field ( 🤖 Was this summary useful? React with 👍 or 👎 |
| +----------------------------------------------+ | ||
| | arrow_field(Int64(1)) | | ||
| +----------------------------------------------+ | ||
| | {name: Int64(1), data_type: Int64, ...} | |
There was a problem hiding this comment.
In datafusion/functions/src/core/arrow_field.rs (line 39), the SQL example output shows {name: Int64(1), ...} but the implementation (and the new SLT) uses the argument field name (e.g. lit for literals). This example looks misleading for users trying to understand what name represents.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:documentation; feedback: The Augment AI reviewer is correct! The .slt file confirms that the name of the literal columns is lit. The documentation has to be updated to match with the reality.
|
|
||
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { | ||
| let [_arg] = take_function_args(self.name(), args.args)?; | ||
| let field = &args.arg_fields[0]; |
There was a problem hiding this comment.
In datafusion/functions/src/core/arrow_field.rs (line 111), args.arg_fields[0] will panic if arg_fields is unexpectedly empty/mismatched with args.args. Since this is a built-in UDF, it may be safer to fail with a Result error rather than potentially aborting execution via panic.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:annoying; category:bug; feedback: The Augment AI reviewer is correct! The function won't be executed at all if the parameters do not match the expected signature, i.e. this method won't be called at all if there is no exactly one argument passed to the function.
value:useful; category:documentation; feedback: The Claude AI reviewer is correct! The .slt file confirms that the name of the literal columns is |
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The return struct type is already constructed and available thru the passed arguments to the function. There is no need to construct a second instance just to deconstruct it. |
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! Using .to_string() is both shorter to write/read and to execute because format!() macro has additional functionalities related to formatting which are not needed in this case. |
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! SQL Logic tests for more complex data types like Struct, Map, Dictionary would be useful to confirm that their rendering works as expected. |
21389: To review by AI