Skip to content

21115: functions: Add dict support for get field#293

Open
martin-augment wants to merge 1 commit intomainfrom
pr-21115-2026-03-25-10-24-48
Open

21115: functions: Add dict support for get field#293
martin-augment wants to merge 1 commit intomainfrom
pr-21115-2026-03-25-10-24-48

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21115: To review by AI

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the get_field function in DataFusion by adding robust support for dictionary-encoded struct arrays. This improvement is vital for maintaining performance and data integrity, as it allows for efficient field extraction from these specialized array types without requiring them to be fully decoded, thereby leveraging the benefits of dictionary encoding.

Highlights

  • Dictionary-Encoded Struct Support: Added functionality to the get_field function to correctly extract fields from dictionary-encoded struct arrays.
  • Encoding Preservation: Ensured that dictionary encoding is preserved during field extraction from dictionary-encoded structs, avoiding unnecessary data materialization.
  • Type Inference Update: Updated the return type inference for get_field to accurately reflect the output type when processing dictionary-encoded struct inputs.
  • New Test Cases: Introduced comprehensive unit tests to validate the get_field function's behavior with both simple and nested dictionary-encoded struct arrays.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

Walkthrough

This change extends get_field functionality to support dictionary-encoded struct columns without materializing the underlying dictionary. The modification handles DataType::Dictionary values in extract_single_field by downcasting to DictionaryArray, extracting the requested child field from the struct values, and rebuilding the result as a DictionaryArray with original keys and extracted field as new values. In return_field_from_args, type-resolution logic was added to compute output field types as dictionaries of the original key type and selected struct child type, with nullability propagation from both current and child fields. Two unit tests validate the behavior for direct and nested dictionary-encoded struct extraction.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21115-2026-03-25-10-24-48

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Mar 25, 2026

Code Review

Good overall direction — preserving dictionary encoding during field extraction is an important optimization. Here are my observations:


Correctness

extract_single_field – key validity after extraction

When you extract a field column from the struct values and rebuild the DictionaryArray, you rely on DictionaryArray::try_new to validate the existing keys against the new values array. This is correct because the extracted field column has exactly the same number of rows as the original struct values array, so all existing key indices remain valid. Worth a brief comment though, since it's non-obvious.

return_field_from_args – multi-step access

After resolving one field access on a Dictionary(key, Struct), current_field becomes Dictionary(key, child_type). On the next loop iteration, if child_type is itself a Struct, the dictionary branch correctly fires again. This works because the execution path also produces a DictionaryArray in that case. The recursion is correct.


Code Quality / Style

Use downcast_dictionary_array! instead of a local macro

This codebase already imports and uses Arrow's downcast_dictionary_array! macro (e.g. datafusion/common/src/hash_utils.rs:988, datafusion/physical-expr-common/src/utils.rs:139) to handle all key types. The local extract_dict_field! macro duplicates that pattern. Using the Arrow-provided macro would be more idiomatic and eliminate the manual 8-arm dispatch:

use arrow::downcast_dictionary_array;
// ...
downcast_dictionary_array! {
    array => {
        let values_struct = as_struct_array(array.values())?;
        // ... extract and rebuild
    },
    _ => exec_err!("Unsupported dictionary key type: {}", array.data_type())
}

Use internal_err! macro consistently

The downcast error at line ~216 uses datafusion_common::DataFusionError::Internal(format!(...)) directly. The rest of the file (and codebase) uses the internal_err! macro. Please align:

// Instead of:
datafusion_common::DataFusionError::Internal(format!("Failed to downcast ..."))
// Use:
internal_err!("Failed to downcast dictionary with key type {}", key_type)?

Macro defined inside a match arm

The extract_dict_field! macro is scoped inside the match arm. This works but is unusual — macros defined in inner scopes can be surprising. Moving it outside the match or using a local generic function would improve readability.

Fully-qualified arrow:: paths inside the macro

The macro body uses arrow::array::DictionaryArray<$key_ty> and arrow::datatypes::Int8Type etc. via full paths, while the file already imports arrow::datatypes::DataType and others at the top. Prefer using the already-imported names or add explicit imports.


Test Coverage

No test through the full UDF interface

Both new tests call extract_single_field directly. There's no test that exercises return_field_from_args (type inference) or invoke_with_args for the Dictionary(key, Struct) path. A test using GetFieldFunc::new() end-to-end — as done in test_placement_literal_key and others below the new tests — would increase confidence that the execution type matches the declared return type.

No test for nullable keys or null values

The tests use dense dictionaries with no nulls in keys or struct fields. A test with null keys and/or nullable struct fields would cover the nullability propagation logic in return_field_from_args (current_field.is_nullable() || child_field.is_nullable()).

No test for Utf8View field name

The existing test test_get_field_utf8view_key covers Utf8View for plain structs. There's no equivalent for the new dictionary path. The code does use try_as_str().flatten() which handles Utf8View correctly, but a test would make the intent explicit.


Minor

  • In return_field_from_args, the Dictionary branch duplicates the field-name lookup boilerplate from the Struct branch verbatim. If this grows further, consider extracting a resolve_field_name(sv) -> Result<&str> helper.

  • The unreachable!() in let DataType::Struct(fields) = value_type.as_ref() else { unreachable!() } is fine (guard guarantees it), but // safety: guard above ensures this is Struct comment would make the intent clearer for readers.


Summary

The core algorithm is correct and the performance intent (avoiding materialization) is sound. The main asks before merging:

  1. Replace the local extract_dict_field! macro with Arrow's downcast_dictionary_array! for consistency with the rest of the codebase.
  2. Use internal_err! instead of DataFusionError::Internal(format!(...)).
  3. Add an end-to-end test through invoke_with_args / return_field_from_args.

@gemini-code-assist
Copy link
Copy Markdown

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 25, 2026

🤖 Augment PR Summary

Summary: This PR extends DataFusion’s get_field scalar UDF to support extracting fields from dictionary-encoded Struct arrays while preserving dictionary encoding.

Changes:

  • Added an execution path in extract_single_field to extract a named struct field from the dictionary’s values array and rebuild a dictionary with the same keys.
  • Added return-type inference support so get_field on Dictionary<K, Struct> returns Dictionary<K, ChildType>, matching runtime behavior.
  • Included tests covering direct dictionary-encoded struct extraction and a nested struct → dictionary(struct) → field extraction chain.

Technical Notes: The implementation keeps dictionary keys intact to avoid expanding to a dense array, aiming to preserve performance and encoding semantics when selecting child fields.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Box::new(child_field.data_type().clone()),
);
current_field =
Arc::new(Field::new(child_field.name(), dict_type, nullable));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_field_from_args builds the result with Field::new(...), which drops any metadata/dictionary info present on child_field (unlike the Struct branch which clones the existing field). Consider preserving child_field’s metadata when wrapping it in a DataType::Dictionary so schema properties aren’t lost.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! Instead of constructing a new Field it would be better to clone the existing one and call setters only for the properties that need to be updated. This way any metadata/properties which are the same will be preserved.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
datafusion/functions/src/core/getfield.rs (1)

655-703: Good test coverage for the happy path.

The test correctly validates that dictionary encoding is preserved (3 unique values, 5 total entries) rather than materialized.

Consider adding a test with nulls in the dictionary keys or struct field values to ensure null propagation works correctly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/core/getfield.rs` around lines 655 - 703, Add a new
unit test similar to test_get_field_dict_encoded_struct that introduces nulls to
exercise null propagation: create a StructArray where one or more field values
are null (e.g., names or ids include nulls) and/or use a DictionaryArray with
some null keys (e.g., keys containing None/Null entries), then call
extract_single_field with the same key workflow (ColumnarValue::Array of the
DictionaryArray and ScalarValue::Utf8 for the field name) and assert that the
resulting DictionaryArray preserves dictionary encoding and that null positions
are preserved in the output (check result_dict.len(),
result_dict.values().len(), and the resolved Utf8 array values for nulls at the
expected indices); mirror naming to test_get_field_dict_encoded_struct and keep
assertions parallel to the existing test to ensure coverage of null propagation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@datafusion/functions/src/core/getfield.rs`:
- Around line 655-703: Add a new unit test similar to
test_get_field_dict_encoded_struct that introduces nulls to exercise null
propagation: create a StructArray where one or more field values are null (e.g.,
names or ids include nulls) and/or use a DictionaryArray with some null keys
(e.g., keys containing None/Null entries), then call extract_single_field with
the same key workflow (ColumnarValue::Array of the DictionaryArray and
ScalarValue::Utf8 for the field name) and assert that the resulting
DictionaryArray preserves dictionary encoding and that null positions are
preserved in the output (check result_dict.len(), result_dict.values().len(),
and the resolved Utf8 array values for nulls at the expected indices); mirror
naming to test_get_field_dict_encoded_struct and keep assertions parallel to the
existing test to ensure coverage of null propagation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a4632a2c-544d-40a3-b9f5-ec324b54c917

📥 Commits

Reviewing files that changed from the base of the PR and between fbfabf9 and fd6ec12.

📒 Files selected for processing (1)
  • datafusion/functions/src/core/getfield.rs

@martin-augment
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for extracting fields from dictionary-encoded struct arrays within the getfield function. This includes new logic for both the execution phase in extract_single_field and the type resolution phase in ScalarUDFImpl. New test cases have been added to validate this functionality, including scenarios with nested dictionary-encoded structs. A review comment suggests refactoring duplicated logic for extracting field names into a helper function to improve maintainability.

Comment on lines +397 to +406
let field_name = sv
.as_ref()
.and_then(|sv| {
sv.try_as_str().flatten().filter(|s| !s.is_empty())
})
.ok_or_else(|| {
datafusion_common::DataFusionError::Execution(
"Field name must be a non-empty string".to_string(),
)
})?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic to extract field_name from the ScalarValue is duplicated from the DataType::Struct match arm below (lines 425-434). To improve maintainability and reduce code duplication, you could extract this into a helper function.

For example:

fn get_field_name_from_scalar(sv: &Option<&ScalarValue>) -> Result<&str, datafusion_common::DataFusionError> {
    sv.as_ref()
        .and_then(|sv| sv.try_as_str().flatten().filter(|s| !s.is_empty()))
        .ok_or_else(|| {
            datafusion_common::DataFusionError::Execution(
                "Field name must be a non-empty string".to_string(),
            )
        })
}

Then both places could be simplified to:

let field_name = get_field_name_from_scalar(sv)?;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The logic is duplicated and it would be good to extract it to a helper method and reuse it. This would prevent double maintenance and any regressions only in one of the copies.

@martin-augment
Copy link
Copy Markdown
Owner Author

Use internal_err! macro consistently

The downcast error at line ~216 uses datafusion_common::DataFusionError::Internal(format!(...)) directly. The rest of the file (and codebase) uses the internal_err! macro. Please align:

// Instead of:
datafusion_common::DataFusionError::Internal(format!("Failed to downcast ..."))
// Use:
internal_err!("Failed to downcast dictionary with key type {}", key_type)?

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The macro helpers are already imported and used in this module, so it would be better to use them instead of constructing the errors via their constructors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants