Add support to cast from UnionArray#9544
Conversation
friendlymatthew
left a comment
There was a problem hiding this comment.
self review
| (_, Union(_, _)) => Err(ArrowError::CastError(format!( | ||
| "Casting from {from_type} to {to_type} not supported" | ||
| ))), |
There was a problem hiding this comment.
This PR does not support casting scalar types into a Union type
It's not a feature that we currently need and will probably need more complex work to get it correct. Though, I think this is a cool/valid feature. I can file an issue if liked
| let type_ids = array.type_ids().clone(); | ||
| let offsets = array.offsets().cloned(); | ||
|
|
||
| let new_children: Vec<ArrayRef> = from_fields | ||
| .iter() | ||
| .map(|(from_id, _from_field)| { | ||
| let (_, to_field) = to_fields | ||
| .iter() | ||
| .find(|(to_id, _)| *to_id == from_id) | ||
| .ok_or_else(|| { | ||
| ArrowError::CastError(format!( | ||
| "Cannot cast union: type_id {from_id} not found in target union" | ||
| )) | ||
| })?; | ||
| let child = array.child(from_id); | ||
| cast_with_options(child.as_ref(), to_field.data_type(), cast_options) | ||
| }) | ||
| .collect::<Result<_, _>>()?; |
There was a problem hiding this comment.
look ups like this can go from n^2 to nlogn if we implement #8937, since sorting by type_id will give us logn searching
9c19cb1 to
8dbe39f
Compare
|
Thanks -- I hope to go through arrow prs more carefully starting tomorrow |
alamb
left a comment
There was a problem hiding this comment.
thank you @friendlymatthew
I like where this is headed.
I left some suggestions - let me know if it makes sense
| array: &UnionArray, | ||
| from_fields: &UnionFields, | ||
| to_fields: &UnionFields, | ||
| _to_mode: UnionMode, |
There was a problem hiding this comment.
why is the _to_mode parameter ignored?
|
|
||
| #[test] | ||
| fn test_cast_union_prefers_exact_type_match() { | ||
| // Union(Int32, Int64) → Int64: Int64 is an exact match, so only |
There was a problem hiding this comment.
Can you please document the expected union casting behavior in the documentation?
Specifically, somewhere here:
https://docs.rs/arrow/latest/arrow/compute/fn.cast_with_options.html#durations-and-intervals
That will make it easier to understand the intended semantics for union casting as well as verify that this code implements the semantics correctly
| /// Since union extraction inherently introduces nulls for non-matching rows, | ||
| /// the target type's inner fields are made nullable to avoid validation errors | ||
| /// (e.g., casting to `List(non-nullable Utf8)` becomes `List(nullable Utf8)`). | ||
| fn cast_union_to_type( |
There was a problem hiding this comment.
I do wonder on the semantics being introduced here; specifically about how we apparently allow casting if at least one field can be cast, but we null the other fields. It might not seem intuitive; I do wonder if we have some existing precedent in other systems we could follow 🤔
There was a problem hiding this comment.
Another potential option is to introduce the behavior into union_extract: https://docs.rs/arrow/latest/arrow/compute/kernels/union_extract/fn.union_extract.html
For example, maybe union_extract_with_option that takes some sort of options struct that allows better control over the extraction
096984a to
3056c06
Compare
3056c06 to
1cf2085
Compare
friendlymatthew
left a comment
There was a problem hiding this comment.
I restarted this PR to cut down on the scope.
This PR adds support for casting union arrays to scalar types only. The casting doesn't reimplement the extraction logic, but now builds on the existing union_extract kernel
The core addition is union_extract_by_type which selects the best-matching variant from a union for a given target type.
The variant selection uses a 3-pass heuristic: exact type match, same type family, then any castable variant
Curious to hear your thoughts!
friendlymatthew
left a comment
There was a problem hiding this comment.
self review
| // this is used during variant selection to prefer a "close" type over a distant cast | ||
| // for example: when targeting Utf8View, a Utf8 variant is preferred over Int32 despite both being castable | ||
| fn same_type_family(a: &DataType, b: &DataType) -> bool { | ||
| use DataType::*; | ||
| matches!( | ||
| (a, b), | ||
| (Utf8 | LargeUtf8 | Utf8View, Utf8 | LargeUtf8 | Utf8View) | ||
| | ( | ||
| Binary | LargeBinary | BinaryView, | ||
| Binary | LargeBinary | BinaryView | ||
| ) | ||
| | (Int8 | Int16 | Int32 | Int64, Int8 | Int16 | Int32 | Int64) | ||
| | ( | ||
| UInt8 | UInt16 | UInt32 | UInt64, | ||
| UInt8 | UInt16 | UInt32 | UInt64 | ||
| ) | ||
| | (Float16 | Float32 | Float64, Float16 | Float32 | Float64) | ||
| ) | ||
| } |
There was a problem hiding this comment.
I wonder if this is something that would be useful as a DataType api...
Equivalence classes let you answer "are these the same logical data" without caring about the physical repr
There was a problem hiding this comment.
I think the notion of "same logical type" gets a bit subtle -- for example LargeUtf8 and Utf8View might seem like the same logical type, but there are some values that LargeUTF8 can store that Utf8View can not (e.g. individual strings over 4GB)
| // variant selection heuristic — 3 passes with decreasing specificity: | ||
| // | ||
| // first pass: field type == target type | ||
| // second pass: field and target are in the same equivalence class | ||
| // (e.g., Utf8 and Utf8View are both strings) | ||
| // third pass: field can be cast to target | ||
| // note: this is the most permissive and may lose information | ||
| // also, the matching logic is greedy so it will pick the first 'castable' variant | ||
| // | ||
| // each pass picks the first matching variant by type_id order. |
There was a problem hiding this comment.
the main heuristic!
I think pass 2 is necessary because can_cast_types is permissive enough that without it, Union(Int32, Utf8) cast to a Utf8View would pick Int32 over Utf8...
Though we can fix this by defining the Utf8 variant first. But I wonder if we should leave it up to the user...
There was a problem hiding this comment.
I think the user can have direct control if they want and can use union_extract and cast
0dc9c7b to
b47a20a
Compare
UnionArray
alamb
left a comment
There was a problem hiding this comment.
I think this one is good to go from my perspective. Thank you @friendlymatthew
Some comments;
terminology
I found the use of the term "variant" with UnionArray confusing. I think The Arrow Spec refers to this concept as "child array" and I think the arrow-rs code refers to it astype_id
Maybe we can use the same terminology
| // this is used during variant selection to prefer a "close" type over a distant cast | ||
| // for example: when targeting Utf8View, a Utf8 variant is preferred over Int32 despite both being castable | ||
| fn same_type_family(a: &DataType, b: &DataType) -> bool { | ||
| use DataType::*; | ||
| matches!( | ||
| (a, b), | ||
| (Utf8 | LargeUtf8 | Utf8View, Utf8 | LargeUtf8 | Utf8View) | ||
| | ( | ||
| Binary | LargeBinary | BinaryView, | ||
| Binary | LargeBinary | BinaryView | ||
| ) | ||
| | (Int8 | Int16 | Int32 | Int64, Int8 | Int16 | Int32 | Int64) | ||
| | ( | ||
| UInt8 | UInt16 | UInt32 | UInt64, | ||
| UInt8 | UInt16 | UInt32 | UInt64 | ||
| ) | ||
| | (Float16 | Float32 | Float64, Float16 | Float32 | Float64) | ||
| ) | ||
| } |
There was a problem hiding this comment.
I think the notion of "same logical type" gets a bit subtle -- for example LargeUtf8 and Utf8View might seem like the same logical type, but there are some values that LargeUTF8 can store that Utf8View can not (e.g. individual strings over 4GB)
| // variant selection heuristic — 3 passes with decreasing specificity: | ||
| // | ||
| // first pass: field type == target type | ||
| // second pass: field and target are in the same equivalence class | ||
| // (e.g., Utf8 and Utf8View are both strings) | ||
| // third pass: field can be cast to target | ||
| // note: this is the most permissive and may lose information | ||
| // also, the matching logic is greedy so it will pick the first 'castable' variant | ||
| // | ||
| // each pass picks the first matching variant by type_id order. |
There was a problem hiding this comment.
I think the user can have direct control if they want and can use union_extract and cast
| ))); | ||
| }; | ||
|
|
||
| let extracted = union_extract(union_array, field.name())?; |
There was a problem hiding this comment.
technically this will not work if there are repeated field names in the Union (as resolve_variant will return a string and union_extract will return the first field with that name)
You could potentially avoid this by returning the type_id / index instead
We can also handle it in a follow on PR
ad3291c to
ecdefa8
Compare
|
Thank you @friendlymatthew |
Which issue does this PR close?
arrow-cast#9225Rationale for this change
This PR adds native support for casting Union arrays in the cast kernel. Previously,
can_cast_typesandcast_with_optionshad no handling for union types at all