Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .cursor/rules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

10 changes: 10 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,13 @@ When creating a PR, you MUST follow the [PR template](.github/pull_request_templ

See the [Testing Quick Start](docs/source/contributor-guide/testing.md#testing-quick-start)
for the recommended pre-PR test commands.
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

Comment on lines +47 to +51
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These lines are a duplicate of lines 42-45 and can be removed to avoid redundancy.

162 changes: 162 additions & 0 deletions datafusion/functions/src/core/arrow_field.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

use arrow::array::{
Array, BooleanArray, MapBuilder, StringArray, StringBuilder, StructArray,
};
use arrow::datatypes::{DataType, Field, Fields};
use datafusion_common::{Result, ScalarValue, utils::take_function_args};
use datafusion_expr::{
ColumnarValue, Documentation, ScalarFunctionArgs, ScalarUDFImpl, Signature,
Volatility,
};
use datafusion_macros::user_doc;
use std::sync::Arc;

#[user_doc(
doc_section(label = "Other Functions"),
description = "Returns a struct containing the Arrow field information of the expression, including name, data type, nullability, and metadata.",
syntax_example = "arrow_field(expression)",
sql_example = r#"```sql
> select arrow_field(1);
+----------------------------------------------+
| arrow_field(Int64(1)) |
+----------------------------------------------+
| {name: Int64(1), data_type: Int64, ...} |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In datafusion/functions/src/core/arrow_field.rs (line 39), the SQL example output shows {name: Int64(1), ...} but the implementation (and the new SLT) uses the argument field name (e.g. lit for literals). This example looks misleading for users trying to understand what name represents.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:documentation; feedback: The Augment AI reviewer is correct! The .slt file confirms that the name of the literal columns is lit. The documentation has to be updated to match with the reality.

+----------------------------------------------+
Comment on lines +35 to +40
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the documented example output for name.

The SQL example currently shows name: Int64(1), but the behavior validated by tests returns name: lit for literals. Please align docs with runtime output.

📝 Suggested doc fix
-| {name: Int64(1), data_type: Int64, ...}      |
+| {name: lit, data_type: Int64, ...}           |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
> select arrow_field(1);
+----------------------------------------------+
| arrow_field(Int64(1)) |
+----------------------------------------------+
| {name: Int64(1), data_type: Int64, ...} |
+----------------------------------------------+
> select arrow_field(1);
----------------------------------------------+
| arrow_field(Int64(1)) |
----------------------------------------------+
| {name: lit, data_type: Int64, ...} |
----------------------------------------------+
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/core/arrow_field.rs` around lines 35 - 40, The
documented SQL example for the arrow_field function shows the literal's name as
"Int64(1)" but runtime/tests return "lit" for literals; update the example
output in arrow_field.rs (the example under select arrow_field(1);) so the
returned row displays "name: lit" (e.g., {name: lit, data_type: Int64, ...}) to
match actual behavior of the arrow_field function and tests.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:documentation; feedback: The CodeRabbit AI reviewer is correct! The .slt file confirms that the name of the literal columns is lit. The documentation has to be updated to match with the reality.


> select arrow_field(1)['data_type'];
+-----------------------------------+
| arrow_field(Int64(1))[data_type] |
+-----------------------------------+
| Int64 |
+-----------------------------------+
```"#,
argument(
name = "expression",
description = "Expression to evaluate. The expression can be a constant, column, or function, and any combination of operators."
)
)]
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct ArrowFieldFunc {
signature: Signature,
}

impl Default for ArrowFieldFunc {
fn default() -> Self {
Self::new()
}
}

impl ArrowFieldFunc {
pub fn new() -> Self {
Self {
signature: Signature::any(1, Volatility::Immutable),
}
Comment on lines +67 to +69
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The signature should be Signature::uniform(1, vec![], Volatility::Immutable) to enforce exactly one argument. Signature::any(1, ...) allows one or more arguments. While the planner might not allow more than one argument due to other constraints, being explicit with uniform is safer and clearer, preventing potential panics in invoke_with_args.

Suggested change
Self {
signature: Signature::any(1, Volatility::Immutable),
}
Self {
signature: Signature::uniform(1, vec![], Volatility::Immutable),
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback: The Gemini AI reviewer is not correct! Signature::any(1) is the correct way to say that the function receives just 1 parameter of any type. uniform(1, vec![]) is the same but longer to type. uniform() is useful when the types should be reduced to several types.

}

fn return_struct_type() -> DataType {
DataType::Struct(Fields::from(vec![
Field::new("name", DataType::Utf8, false),
Field::new("data_type", DataType::Utf8, false),
Field::new("nullable", DataType::Boolean, false),
Field::new(
"metadata",
DataType::Map(
Arc::new(Field::new(
"entries",
DataType::Struct(Fields::from(vec![
Field::new("keys", DataType::Utf8, false),
Field::new("values", DataType::Utf8, true),
])),
false,
)),
false,
),
false,
),
]))
}
}

impl ScalarUDFImpl for ArrowFieldFunc {
fn name(&self) -> &str {
"arrow_field"
}

fn signature(&self) -> &Signature {
&self.signature
}

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
Ok(Self::return_struct_type())
}

fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
let [_arg] = take_function_args(self.name(), args.args)?;
let field = &args.arg_fields[0];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In datafusion/functions/src/core/arrow_field.rs (line 111), args.arg_fields[0] will panic if arg_fields is unexpectedly empty/mismatched with args.args. Since this is a built-in UDF, it may be safer to fail with a Result error rather than potentially aborting execution via panic.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback: The Augment AI reviewer is correct! The function won't be executed at all if the parameters do not match the expected signature, i.e. this method won't be called at all if there is no exactly one argument passed to the function.


// Build the name array
let name_array =
Arc::new(StringArray::from(vec![field.name().as_str()])) as Arc<dyn Array>;

// Build the data_type array
let data_type_str = format!("{}", field.data_type());
let data_type_array =
Arc::new(StringArray::from(vec![data_type_str.as_str()])) as Arc<dyn Array>;

// Build the nullable array
let nullable_array =
Arc::new(BooleanArray::from(vec![field.is_nullable()])) as Arc<dyn Array>;

// Build the metadata map array (same pattern as arrow_metadata.rs)
let metadata = field.metadata();
let mut map_builder =
MapBuilder::new(None, StringBuilder::new(), StringBuilder::new());

let mut entries: Vec<_> = metadata.iter().collect();
entries.sort_by_key(|(k, _)| *k);

for (k, v) in entries {
map_builder.keys().append_value(k);
map_builder.values().append_value(v);
}
map_builder.append(true)?;

let metadata_array = Arc::new(map_builder.finish()) as Arc<dyn Array>;

// Build the struct
let DataType::Struct(fields) = Self::return_struct_type() else {
unreachable!()
};

let struct_array = StructArray::new(
fields,
vec![name_array, data_type_array, nullable_array, metadata_array],
None,
);

Ok(ColumnarValue::Scalar(ScalarValue::try_from_array(
&struct_array,
0,
)?))
}
Comment on lines +109 to +157
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation can be simplified and made more robust.

Instead of building single-element Arrays for each field, then a StructArray, and finally extracting a ScalarValue, you can construct the ScalarValue for each field directly and then create the final ScalarValue::Struct. This is more direct and likely more efficient.

Additionally, the argument handling let [_arg] = take_function_args(...) is fragile. A defensive check on args.args.len() is more robust, especially if the signature is not perfectly specified.

    fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
        if args.args.len() != 1 {
            return internal_err!(
                "{} expected 1 argument but got {}",
                self.name(),
                args.args.len()
            );
        }
        let field = &args.arg_fields[0];

        let name_scalar = ScalarValue::Utf8(Some(field.name().clone()));
        let data_type_scalar =
            ScalarValue::Utf8(Some(format!("{}", field.data_type())));
        let nullable_scalar = ScalarValue::Boolean(Some(field.is_nullable()));

        // Build the metadata map scalar
        let metadata = field.metadata();
        let mut map_builder =
            MapBuilder::new(None, StringBuilder::new(), StringBuilder::new());

        let mut entries: Vec<_> = metadata.iter().collect();
        entries.sort_by_key(|(k, _)| *k);

        for (k, v) in entries {
            map_builder.keys().append_value(k);
            map_builder.values().append_value(v);
        }
        map_builder.append(true)?;

        let metadata_array = Arc::new(map_builder.finish());
        let metadata_scalar = ScalarValue::try_from_array(&metadata_array, 0)?;

        let struct_scalar = ScalarValue::Struct(Arc::new([
            name_scalar,
            data_type_scalar,
            nullable_scalar,
            metadata_scalar,
        ]));

        Ok(ColumnarValue::Scalar(struct_scalar))
    }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The result could be constructed without using intermediate arrays. This would reduce the complexity and improve the performance by allocating+deallocating less memory.


fn documentation(&self) -> Option<&Documentation> {
self.doc()
}
}
7 changes: 7 additions & 0 deletions datafusion/functions/src/core/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ use datafusion_expr::ScalarUDF;
use std::sync::Arc;

pub mod arrow_cast;
pub mod arrow_field;
pub mod arrow_metadata;
pub mod arrow_try_cast;
pub mod arrowtypeof;
Expand Down Expand Up @@ -59,6 +60,7 @@ make_udf_function!(union_extract::UnionExtractFun, union_extract);
make_udf_function!(union_tag::UnionTagFunc, union_tag);
make_udf_function!(version::VersionFunc, version);
make_udf_function!(arrow_metadata::ArrowMetadataFunc, arrow_metadata);
make_udf_function!(arrow_field::ArrowFieldFunc, arrow_field);

pub mod expr_fn {
use datafusion_expr::{Expr, Literal};
Expand Down Expand Up @@ -91,6 +93,10 @@ pub mod expr_fn {
arrow_typeof,
"Returns the Arrow type of the input expression.",
arg1
),(
arrow_field,
"Returns the Arrow field info (name, data_type, nullable, metadata) of the input expression.",
arg1
),(
arrow_metadata,
"Returns the metadata of the input expression",
Expand Down Expand Up @@ -147,6 +153,7 @@ pub fn functions() -> Vec<Arc<ScalarUDF>> {
nullif(),
arrow_cast(),
arrow_try_cast(),
arrow_field(),
arrow_metadata(),
nvl(),
nvl2(),
Expand Down
104 changes: 104 additions & 0 deletions datafusion/sqllogictest/test_files/arrow_field.slt
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# arrow_field on integer literal
query ?
SELECT arrow_field(1)
----
{name: lit, data_type: Int64, nullable: false, metadata: {}}

# arrow_field on null literal
query ?
SELECT arrow_field(null)
----
{name: lit, data_type: Null, nullable: true, metadata: {}}

# arrow_field on boolean literal
query ?
SELECT arrow_field(true)
----
{name: lit, data_type: Boolean, nullable: false, metadata: {}}

# arrow_field on string literal
query ?
SELECT arrow_field('foo')
----
{name: lit, data_type: Utf8, nullable: false, metadata: {}}

# arrow_field on float literal
query ?
SELECT arrow_field(1.0)
----
{name: lit, data_type: Float64, nullable: false, metadata: {}}

# arrow_field on list
query ?
SELECT arrow_field(ARRAY[1,2,3])
----
{name: lit, data_type: List(Int64), nullable: false, metadata: {}}

# arrow_field struct field access - data_type
query T
SELECT arrow_field(1)['data_type']
----
Int64

# arrow_field struct field access - nullable
query B
SELECT arrow_field(1)['nullable']
----
false

# arrow_field struct field access - name
query T
SELECT arrow_field(1)['name']
----
lit

# arrow_field with table columns
statement ok
CREATE TABLE arrow_field_test(x INT NOT NULL, y TEXT) AS VALUES (1, 'a');

query ?
SELECT arrow_field(x) FROM arrow_field_test
----
{name: x, data_type: Int32, nullable: false, metadata: {}}

query ?
SELECT arrow_field(y) FROM arrow_field_test
----
{name: y, data_type: Utf8View, nullable: true, metadata: {}}

# arrow_field column access - name reflects column name
query T
SELECT arrow_field(x)['name'] FROM arrow_field_test
----
x

# arrow_field column access - nullability
query B
SELECT arrow_field(x)['nullable'] FROM arrow_field_test
----
false

query B
SELECT arrow_field(y)['nullable'] FROM arrow_field_test
----
true

statement ok
DROP TABLE arrow_field_test;
Loading