21115: functions: Add dict support for get field by martin-augment · Pull Request #293 · martin-augment/datafusion

martin-augment · 2026-03-25T10:24:58Z

21115: To review by AI

gemini-code-assist · 2026-03-25T10:25:16Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the get_field function in DataFusion by adding robust support for dictionary-encoded struct arrays. This improvement is vital for maintaining performance and data integrity, as it allows for efficient field extraction from these specialized array types without requiring them to be fully decoded, thereby leveraging the benefits of dictionary encoding.

Highlights

Dictionary-Encoded Struct Support: Added functionality to the get_field function to correctly extract fields from dictionary-encoded struct arrays.
Encoding Preservation: Ensured that dictionary encoding is preserved during field extraction from dictionary-encoded structs, avoiding unnecessary data materialization.
Type Inference Update: Updated the return type inference for get_field to accurately reflect the output type when processing dictionary-encoded struct inputs.
New Test Cases: Introduced comprehensive unit tests to validate the get_field function's behavior with both simple and nested dictionary-encoded struct arrays.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-25T10:25:32Z

Walkthrough

This change extends get_field functionality to support dictionary-encoded struct columns without materializing the underlying dictionary. The modification handles DataType::Dictionary values in extract_single_field by downcasting to DictionaryArray, extracting the requested child field from the struct values, and rebuilding the result as a DictionaryArray with original keys and extracted field as new values. In return_field_from_args, type-resolution logic was added to compute output field types as dictionaries of the original key type and selected struct child type, with nullability propagation from both current and child fields. Two unit tests validate the behavior for direct and nested dictionary-encoded struct extraction.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pr-21115-2026-03-25-10-24-48

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-03-25T10:27:54Z

Code Review

Good overall direction — preserving dictionary encoding during field extraction is an important optimization. Here are my observations:

Correctness

extract_single_field – key validity after extraction

When you extract a field column from the struct values and rebuild the DictionaryArray, you rely on DictionaryArray::try_new to validate the existing keys against the new values array. This is correct because the extracted field column has exactly the same number of rows as the original struct values array, so all existing key indices remain valid. Worth a brief comment though, since it's non-obvious.

return_field_from_args – multi-step access

After resolving one field access on a Dictionary(key, Struct), current_field becomes Dictionary(key, child_type). On the next loop iteration, if child_type is itself a Struct, the dictionary branch correctly fires again. This works because the execution path also produces a DictionaryArray in that case. The recursion is correct.

Code Quality / Style

Use downcast_dictionary_array! instead of a local macro

This codebase already imports and uses Arrow's downcast_dictionary_array! macro (e.g. datafusion/common/src/hash_utils.rs:988, datafusion/physical-expr-common/src/utils.rs:139) to handle all key types. The local extract_dict_field! macro duplicates that pattern. Using the Arrow-provided macro would be more idiomatic and eliminate the manual 8-arm dispatch:

use arrow::downcast_dictionary_array;
// ...
downcast_dictionary_array! {
    array => {
        let values_struct = as_struct_array(array.values())?;
        // ... extract and rebuild
    },
    _ => exec_err!("Unsupported dictionary key type: {}", array.data_type())
}

Use internal_err! macro consistently

The downcast error at line ~216 uses datafusion_common::DataFusionError::Internal(format!(...)) directly. The rest of the file (and codebase) uses the internal_err! macro. Please align:

// Instead of:
datafusion_common::DataFusionError::Internal(format!("Failed to downcast ..."))
// Use:
internal_err!("Failed to downcast dictionary with key type {}", key_type)?

Macro defined inside a match arm

The extract_dict_field! macro is scoped inside the match arm. This works but is unusual — macros defined in inner scopes can be surprising. Moving it outside the match or using a local generic function would improve readability.

Fully-qualified arrow:: paths inside the macro

The macro body uses arrow::array::DictionaryArray<$key_ty> and arrow::datatypes::Int8Type etc. via full paths, while the file already imports arrow::datatypes::DataType and others at the top. Prefer using the already-imported names or add explicit imports.

Test Coverage

No test through the full UDF interface

Both new tests call extract_single_field directly. There's no test that exercises return_field_from_args (type inference) or invoke_with_args for the Dictionary(key, Struct) path. A test using GetFieldFunc::new() end-to-end — as done in test_placement_literal_key and others below the new tests — would increase confidence that the execution type matches the declared return type.

No test for nullable keys or null values

The tests use dense dictionaries with no nulls in keys or struct fields. A test with null keys and/or nullable struct fields would cover the nullability propagation logic in return_field_from_args (current_field.is_nullable() || child_field.is_nullable()).

No test for Utf8View field name

The existing test test_get_field_utf8view_key covers Utf8View for plain structs. There's no equivalent for the new dictionary path. The code does use try_as_str().flatten() which handles Utf8View correctly, but a test would make the intent explicit.

Minor

In return_field_from_args, the Dictionary branch duplicates the field-name lookup boilerplate from the Struct branch verbatim. If this grows further, consider extracting a resolve_field_name(sv) -> Result<&str> helper.
The unreachable!() in let DataType::Struct(fields) = value_type.as_ref() else { unreachable!() } is fine (guard guarantees it), but // safety: guard above ensures this is Struct comment would make the intent clearer for readers.

Summary

The core algorithm is correct and the performance intent (avoiding materialization) is sound. The main asks before merging:

Replace the local extract_dict_field! macro with Arrow's downcast_dictionary_array! for consistency with the rest of the codebase.
Use internal_err! instead of DataFusionError::Internal(format!(...)).
Add an end-to-end test through invoke_with_args / return_field_from_args.

gemini-code-assist · 2026-03-25T10:30:46Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

augmentcode · 2026-03-25T10:32:25Z

🤖 Augment PR Summary

Summary: This PR extends DataFusion’s get_field scalar UDF to support extracting fields from dictionary-encoded Struct arrays while preserving dictionary encoding.

Changes:

Added an execution path in extract_single_field to extract a named struct field from the dictionary’s values array and rebuild a dictionary with the same keys.
Added return-type inference support so get_field on Dictionary<K, Struct> returns Dictionary<K, ChildType>, matching runtime behavior.
Included tests covering direct dictionary-encoded struct extraction and a nested struct → dictionary(struct) → field extraction chain.

Technical Notes: The implementation keeps dictionary keys intact to avoid expanding to a dense array, aiming to preserve performance and encoding semantics when selecting child fields.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-03-25T10:32:26Z

datafusion/functions/src/core/getfield.rs

+                        Box::new(child_field.data_type().clone()),
+                    );
+                    current_field =
+                        Arc::new(Field::new(child_field.name(), dict_type, nullable));


return_field_from_args builds the result with Field::new(...), which drops any metadata/dictionary info present on child_field (unlike the Struct branch which clones the existing field). Consider preserving child_field’s metadata when wrapping it in a DataType::Dictionary so schema properties aren’t lost.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:useful; category:bug; feedback: The Augment AI reviewer is correct! Instead of constructing a new Field it would be better to clone the existing one and call setters only for the properties that need to be updated. This way any metadata/properties which are the same will be preserved.

coderabbitai

🧹 Nitpick comments (1)

datafusion/functions/src/core/getfield.rs (1)

655-703: Good test coverage for the happy path.

The test correctly validates that dictionary encoding is preserved (3 unique values, 5 total entries) rather than materialized.

Consider adding a test with nulls in the dictionary keys or struct field values to ensure null propagation works correctly.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/core/getfield.rs` around lines 655 - 703, Add a new
unit test similar to test_get_field_dict_encoded_struct that introduces nulls to
exercise null propagation: create a StructArray where one or more field values
are null (e.g., names or ids include nulls) and/or use a DictionaryArray with
some null keys (e.g., keys containing None/Null entries), then call
extract_single_field with the same key workflow (ColumnarValue::Array of the
DictionaryArray and ScalarValue::Utf8 for the field name) and assert that the
resulting DictionaryArray preserves dictionary encoding and that null positions
are preserved in the output (check result_dict.len(),
result_dict.values().len(), and the resolved Utf8 array values for nulls at the
expected indices); mirror naming to test_get_field_dict_encoded_struct and keep
assertions parallel to the existing test to ensure coverage of null propagation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@datafusion/functions/src/core/getfield.rs`:
- Around line 655-703: Add a new unit test similar to
test_get_field_dict_encoded_struct that introduces nulls to exercise null
propagation: create a StructArray where one or more field values are null (e.g.,
names or ids include nulls) and/or use a DictionaryArray with some null keys
(e.g., keys containing None/Null entries), then call extract_single_field with
the same key workflow (ColumnarValue::Array of the DictionaryArray and
ScalarValue::Utf8 for the field name) and assert that the resulting
DictionaryArray preserves dictionary encoding and that null positions are
preserved in the output (check result_dict.len(), result_dict.values().len(),
and the resolved Utf8 array values for nulls at the expected indices); mirror
naming to test_get_field_dict_encoded_struct and keep assertions parallel to the
existing test to ensure coverage of null propagation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a4632a2c-544d-40a3-b9f5-ec324b54c917

📥 Commits

Reviewing files that changed from the base of the PR and between fbfabf9 and fd6ec12.

📒 Files selected for processing (1)

datafusion/functions/src/core/getfield.rs

martin-augment · 2026-03-25T11:34:00Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for extracting fields from dictionary-encoded struct arrays within the getfield function. This includes new logic for both the execution phase in extract_single_field and the type resolution phase in ScalarUDFImpl. New test cases have been added to validate this functionality, including scenarios with nested dictionary-encoded structs. A review comment suggests refactoring duplicated logic for extracting field names into a helper function to improve maintainability.

gemini-code-assist · 2026-03-25T11:35:53Z

datafusion/functions/src/core/getfield.rs

+                    let field_name = sv
+                        .as_ref()
+                        .and_then(|sv| {
+                            sv.try_as_str().flatten().filter(|s| !s.is_empty())
+                        })
+                        .ok_or_else(|| {
+                            datafusion_common::DataFusionError::Execution(
+                                "Field name must be a non-empty string".to_string(),
+                            )
+                        })?;


This logic to extract field_name from the ScalarValue is duplicated from the DataType::Struct match arm below (lines 425-434). To improve maintainability and reduce code duplication, you could extract this into a helper function.

For example:

fn get_field_name_from_scalar(sv: &Option<&ScalarValue>) -> Result<&str, datafusion_common::DataFusionError> { sv.as_ref() .and_then(|sv| sv.try_as_str().flatten().filter(|s| !s.is_empty())) .ok_or_else(|| { datafusion_common::DataFusionError::Execution( "Field name must be a non-empty string".to_string(), ) }) }

Then both places could be simplified to:

let field_name = get_field_name_from_scalar(sv)?;

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The logic is duplicated and it would be good to extract it to a helper method and reuse it. This would prevent double maintenance and any regressions only in one of the copies.

martin-augment · 2026-03-25T12:06:00Z

Use internal_err! macro consistently

The downcast error at line ~216 uses datafusion_common::DataFusionError::Internal(format!(...)) directly. The rest of the file (and codebase) uses the internal_err! macro. Please align:
// Instead of:
datafusion_common::DataFusionError::Internal(format!("Failed to downcast ..."))
// Use:
internal_err!("Failed to downcast dictionary with key type {}", key_type)?

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The macro helpers are already imported and used in this module, so it would be better to use them instead of constructing the errors via their constructors.

functions: Add dict support for get field

fd6ec12

augmentcode bot reviewed Mar 25, 2026

View reviewed changes

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

Conversation

martin-augment commented Mar 25, 2026

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Mar 25, 2026

Code Review

Correctness

Code Quality / Style

Test Coverage

Minor

Summary

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Uh oh!

augmentcode bot commented Mar 25, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Mar 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 25, 2026 •

edited

Loading