21115: functions: Add dict support for get field#293
21115: functions: Add dict support for get field#293martin-augment wants to merge 1 commit intomainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
WalkthroughThis change extends ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code ReviewGood overall direction — preserving dictionary encoding during field extraction is an important optimization. Here are my observations: Correctness
When you extract a field column from the struct values and rebuild the
After resolving one field access on a Code Quality / StyleUse This codebase already imports and uses Arrow's use arrow::downcast_dictionary_array;
// ...
downcast_dictionary_array! {
array => {
let values_struct = as_struct_array(array.values())?;
// ... extract and rebuild
},
_ => exec_err!("Unsupported dictionary key type: {}", array.data_type())
}Use The downcast error at line ~216 uses // Instead of:
datafusion_common::DataFusionError::Internal(format!("Failed to downcast ..."))
// Use:
internal_err!("Failed to downcast dictionary with key type {}", key_type)?Macro defined inside a match arm The Fully-qualified The macro body uses Test CoverageNo test through the full UDF interface Both new tests call No test for nullable keys or null values The tests use dense dictionaries with no nulls in keys or struct fields. A test with null keys and/or nullable struct fields would cover the nullability propagation logic in No test for The existing test Minor
SummaryThe core algorithm is correct and the performance intent (avoiding materialization) is sound. The main asks before merging:
|
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
🤖 Augment PR SummarySummary: This PR extends DataFusion’s Changes:
Technical Notes: The implementation keeps dictionary keys intact to avoid expanding to a dense array, aiming to preserve performance and encoding semantics when selecting child fields. 🤖 Was this summary useful? React with 👍 or 👎 |
| Box::new(child_field.data_type().clone()), | ||
| ); | ||
| current_field = | ||
| Arc::new(Field::new(child_field.name(), dict_type, nullable)); |
There was a problem hiding this comment.
return_field_from_args builds the result with Field::new(...), which drops any metadata/dictionary info present on child_field (unlike the Struct branch which clones the existing field). Consider preserving child_field’s metadata when wrapping it in a DataType::Dictionary so schema properties aren’t lost.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! Instead of constructing a new Field it would be better to clone the existing one and call setters only for the properties that need to be updated. This way any metadata/properties which are the same will be preserved.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
datafusion/functions/src/core/getfield.rs (1)
655-703: Good test coverage for the happy path.The test correctly validates that dictionary encoding is preserved (3 unique values, 5 total entries) rather than materialized.
Consider adding a test with nulls in the dictionary keys or struct field values to ensure null propagation works correctly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/functions/src/core/getfield.rs` around lines 655 - 703, Add a new unit test similar to test_get_field_dict_encoded_struct that introduces nulls to exercise null propagation: create a StructArray where one or more field values are null (e.g., names or ids include nulls) and/or use a DictionaryArray with some null keys (e.g., keys containing None/Null entries), then call extract_single_field with the same key workflow (ColumnarValue::Array of the DictionaryArray and ScalarValue::Utf8 for the field name) and assert that the resulting DictionaryArray preserves dictionary encoding and that null positions are preserved in the output (check result_dict.len(), result_dict.values().len(), and the resolved Utf8 array values for nulls at the expected indices); mirror naming to test_get_field_dict_encoded_struct and keep assertions parallel to the existing test to ensure coverage of null propagation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@datafusion/functions/src/core/getfield.rs`:
- Around line 655-703: Add a new unit test similar to
test_get_field_dict_encoded_struct that introduces nulls to exercise null
propagation: create a StructArray where one or more field values are null (e.g.,
names or ids include nulls) and/or use a DictionaryArray with some null keys
(e.g., keys containing None/Null entries), then call extract_single_field with
the same key workflow (ColumnarValue::Array of the DictionaryArray and
ScalarValue::Utf8 for the field name) and assert that the resulting
DictionaryArray preserves dictionary encoding and that null positions are
preserved in the output (check result_dict.len(), result_dict.values().len(),
and the resolved Utf8 array values for nulls at the expected indices); mirror
naming to test_get_field_dict_encoded_struct and keep assertions parallel to the
existing test to ensure coverage of null propagation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a4632a2c-544d-40a3-b9f5-ec324b54c917
📒 Files selected for processing (1)
datafusion/functions/src/core/getfield.rs
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for extracting fields from dictionary-encoded struct arrays within the getfield function. This includes new logic for both the execution phase in extract_single_field and the type resolution phase in ScalarUDFImpl. New test cases have been added to validate this functionality, including scenarios with nested dictionary-encoded structs. A review comment suggests refactoring duplicated logic for extracting field names into a helper function to improve maintainability.
| let field_name = sv | ||
| .as_ref() | ||
| .and_then(|sv| { | ||
| sv.try_as_str().flatten().filter(|s| !s.is_empty()) | ||
| }) | ||
| .ok_or_else(|| { | ||
| datafusion_common::DataFusionError::Execution( | ||
| "Field name must be a non-empty string".to_string(), | ||
| ) | ||
| })?; |
There was a problem hiding this comment.
This logic to extract field_name from the ScalarValue is duplicated from the DataType::Struct match arm below (lines 425-434). To improve maintainability and reduce code duplication, you could extract this into a helper function.
For example:
fn get_field_name_from_scalar(sv: &Option<&ScalarValue>) -> Result<&str, datafusion_common::DataFusionError> {
sv.as_ref()
.and_then(|sv| sv.try_as_str().flatten().filter(|s| !s.is_empty()))
.ok_or_else(|| {
datafusion_common::DataFusionError::Execution(
"Field name must be a non-empty string".to_string(),
)
})
}Then both places could be simplified to:
let field_name = get_field_name_from_scalar(sv)?;There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The logic is duplicated and it would be good to extract it to a helper method and reuse it. This would prevent double maintenance and any regressions only in one of the copies.
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The macro helpers are already imported and used in this module, so it would be better to use them instead of constructing the errors via their constructors. |
21115: To review by AI