perf: improve calculating length performance for GenericByteArray in row conversion#9078
Conversation
…n row conversion
|
run benchmark row_format |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark row_format |
|
🤖 |
|
🤖: Benchmark completed Details
|
…ow conversion (#9080) # Which issue does this PR close? N/A # Rationale for this change Making the row length calculation faster which result in faster row conversion # What changes are included in this PR? 1. Instead of iterating over the bytes and getting the length from the byte slice, we use the offsets directly, this is faster as it saves us going to the buffer 2. Added new API for `GenericByteViewArray` (explained below) # Are these changes tested? Yes # Are there any user-facing changes? Yes, added `lengths` function to `GenericByteViewArray` to get an iterator over the lengths of the items in the array ----- Related to: - #9078 - #9079 --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Conflicts: # arrow-row/src/lib.rs
|
@alamb can you please review and hopefully merge if approved with no comments |
| .lengths() | ||
| .zip(nulls.iter()) | ||
| .map(|(length, is_valid)| if is_valid { Some(length) } else { None }) | ||
| .map(variable::padded_length), |
There was a problem hiding this comment.
I verified that variable::encoded_len is calling padded_len
| array | ||
| .offsets() | ||
| .lengths() | ||
| .map(Some) |
There was a problem hiding this comment.
You could potentially avoid another branch by making a version of padded_length that takes usize instead of Option<usize>
There was a problem hiding this comment.
I hoped that it would be optimized away, but I see it is not.
fixed
|
Very nice! |
… conversion (#9079) # Which issue does this PR close? N/A # Rationale for this change Making the row length calculation faster which result in faster row conversion # What changes are included in this PR? 1. Instead of iterating over the rows and getting the length from the byte slice, we use the offsets directly, this 2. Added 3 new APIs for `Rows` (explained below) # Are these changes tested? Yes # Are there any user-facing changes? Yes, added 3 functions to `Rows`: - `row_len` - get the row length at index - `row_len_unchecked` - get the row length at index without bound checks - `lengths` - get iterator over the lengths of the rows ----- Related to: - #9078 - #9080 --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
| pub fn padded_length(a: Option<usize>) -> usize { | ||
| match a { | ||
| Some(a) if a <= BLOCK_SIZE => 1 + ceil(a, MINI_BLOCK_SIZE) * (MINI_BLOCK_SIZE + 1), | ||
| Some(a) => non_null_padded_length(a), |
…ow conversion (apache#9080) # Which issue does this PR close? N/A # Rationale for this change Making the row length calculation faster which result in faster row conversion # What changes are included in this PR? 1. Instead of iterating over the bytes and getting the length from the byte slice, we use the offsets directly, this is faster as it saves us going to the buffer 2. Added new API for `GenericByteViewArray` (explained below) # Are these changes tested? Yes # Are there any user-facing changes? Yes, added `lengths` function to `GenericByteViewArray` to get an iterator over the lengths of the items in the array ----- Related to: - apache#9078 - apache#9079 --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…n row conversion (apache#9078) # Which issue does this PR close? N/A # Rationale for this change Making the row length calculation faster which result in faster row conversion # What changes are included in this PR? Instead of iterating over the items in the array and getting the length from the byte slice, we use the offsets directly and zip with nulls if necessary # Are these changes tested? Existing tests # Are there any user-facing changes? Faster encoding ------ Split to 2 more PRs as the other 2 add a change to the public API Related to: - apache#9079 - apache#9080 --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
… conversion (apache#9079) # Which issue does this PR close? N/A # Rationale for this change Making the row length calculation faster which result in faster row conversion # What changes are included in this PR? 1. Instead of iterating over the rows and getting the length from the byte slice, we use the offsets directly, this 2. Added 3 new APIs for `Rows` (explained below) # Are these changes tested? Yes # Are there any user-facing changes? Yes, added 3 functions to `Rows`: - `row_len` - get the row length at index - `row_len_unchecked` - get the row length at index without bound checks - `lengths` - get iterator over the lengths of the rows ----- Related to: - apache#9078 - apache#9080 --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
N/A
Rationale for this change
Making the row length calculation faster which result in faster row conversion
What changes are included in this PR?
Instead of iterating over the items in the array and getting the length from the byte slice, we use the offsets directly and zip with nulls if necessary
Are these changes tested?
Existing tests
Are there any user-facing changes?
Faster encoding
Split to 2 more PRs as the other 2 add a change to the public API
Related to: