-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark, Arrow, Parquet: Add vectorized read support for parquet RLE encoded data pages #14853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jbewing
wants to merge
5
commits into
apache:main
Choose a base branch
from
jbewing:vectorized-parquet-rle-reads
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+273
−12
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
dbebc5d
Add vectorized read support for parquet RLE encoded data pages
jbewing 51cbe97
Fix dictionary-encoded vectorized read tests
jbewing dda0185
Add test to verify fall through for reading non-boolean rle encoded d…
jbewing 41286c4
Run spotless for Spark 3.5 & 3.4
jbewing 7e75604
Merge remote-tracking branch 'upstream/main' into vectorized-parquet-…
jbewing File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
111 changes: 111 additions & 0 deletions
111
...pache/iceberg/arrow/vectorized/parquet/VectorizedRunLengthEncodedParquetValuesReader.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
| package org.apache.iceberg.arrow.vectorized.parquet; | ||
|
|
||
| import org.apache.arrow.vector.FieldVector; | ||
| import org.apache.parquet.io.api.Binary; | ||
|
|
||
| /** | ||
| * A {@link VectorizedValuesReader} implementation for the encoding type Run Length Encoding / RLE. | ||
| * | ||
| * @see <a | ||
| * href="https://parquet.apache.org/docs/file-format/data-pages/encodings/#run-length-encoding--bit-packing-hybrid-rle--3"> | ||
| * Parquet format encodings: RLE</a> | ||
| */ | ||
| public class VectorizedRunLengthEncodedParquetValuesReader extends BaseVectorizedParquetValuesReader | ||
| implements VectorizedValuesReader { | ||
|
|
||
| // Since we can only read booleans, bit-width is always 1 | ||
| private static final int BOOLEAN_BIT_WIDTH = 1; | ||
| // Since this can only be used in the context of a data page, the definition level can be set to | ||
| // anything, and it doesn't really matter | ||
| private static final int IRRELEVANT_MAX_DEFINITION_LEVEL = 1; | ||
| // For boolean values in data page v1 & v2, length is always prepended to the encoded data | ||
| // See | ||
| // https://parquet.apache.org/docs/file-format/data-pages/encodings/#run-length-encoding--bit-packing-hybrid-rle--3 | ||
| private static final boolean ALWAYS_READ_LENGTH = true; | ||
|
|
||
| public VectorizedRunLengthEncodedParquetValuesReader(boolean setArrowValidityVector) { | ||
| super( | ||
| BOOLEAN_BIT_WIDTH, | ||
| IRRELEVANT_MAX_DEFINITION_LEVEL, | ||
| ALWAYS_READ_LENGTH, | ||
| setArrowValidityVector); | ||
| } | ||
|
|
||
| /** RLE only supports BOOLEAN as a data page encoding */ | ||
| @Override | ||
| public byte readByte() { | ||
| throw new UnsupportedOperationException("readByte is not supported"); | ||
| } | ||
|
|
||
| /** RLE only supports BOOLEAN as a data page encoding */ | ||
| @Override | ||
| public short readShort() { | ||
| throw new UnsupportedOperationException("readShort is not supported"); | ||
| } | ||
|
|
||
| /** RLE only supports BOOLEAN as a data page encoding */ | ||
| @Override | ||
| public long readLong() { | ||
| throw new UnsupportedOperationException("readLong is not supported"); | ||
| } | ||
|
|
||
| /** RLE only supports BOOLEAN as a data page encoding */ | ||
| @Override | ||
| public float readFloat() { | ||
| throw new UnsupportedOperationException("readFloat is not supported"); | ||
| } | ||
|
|
||
| /** RLE only supports BOOLEAN as a data page encoding */ | ||
| @Override | ||
| public double readDouble() { | ||
| throw new UnsupportedOperationException("readDouble is not supported"); | ||
| } | ||
|
|
||
| /** RLE only supports BOOLEAN as a data page encoding */ | ||
| @Override | ||
| public Binary readBinary(int len) { | ||
| throw new UnsupportedOperationException("readBinary is not supported"); | ||
| } | ||
|
|
||
| /** RLE only supports BOOLEAN as a data page encoding */ | ||
| @Override | ||
| public void readIntegers(int total, FieldVector vec, int rowId) { | ||
| throw new UnsupportedOperationException("readIntegers is not supported"); | ||
| } | ||
|
|
||
| /** RLE only supports BOOLEAN as a data page encoding */ | ||
| @Override | ||
| public void readLongs(int total, FieldVector vec, int rowId) { | ||
| throw new UnsupportedOperationException("readLongs is not supported"); | ||
| } | ||
|
|
||
| /** RLE only supports BOOLEAN as a data page encoding */ | ||
| @Override | ||
| public void readFloats(int total, FieldVector vec, int rowId) { | ||
| throw new UnsupportedOperationException("readFloats is not supported"); | ||
| } | ||
|
|
||
| /** RLE only supports BOOLEAN as a data page encoding */ | ||
| @Override | ||
| public void readDoubles(int total, FieldVector vec, int rowId) { | ||
| throw new UnsupportedOperationException("readDoubles is not supported"); | ||
| } | ||
| } |
Binary file added
BIN
+618 Bytes
parquet/src/testFixtures/resources/encodings/PLAIN/boolean_with_nulls.parquet
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+623 Bytes
parquet/src/testFixtures/resources/encodings/RLE/boolean_with_nulls.parquet
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite a serious fall through here, given the parquet spec limits what RLEs can be used for to bools, Repetition and definition levels & Dictionary indices. Is it likely to occur in the wild?
If so, it probably merits a test case to see that if you create one with a column whose type != BOOLEAN then you can't init() it with RLE data encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah theoretically for a malformed parquet writer this could occur in the wild. That being said it wouldn't be to spec given that bool is the only data page that can be RLE encoded and we handle the dictionary RLE up in
VectorizedDictionaryEncodedParquetValuesReader(directly above here) and the repetition levels are handled viaVectorizedParquetDefinitionLevelReader.All to say, I think this is impossible. If a malformed writer does in fact write a file with a non-bool data page, it wouldn't be to spec so we'd be correctly throwing here. I can add a negative test case for this, although I'd have to make a corrupt parquet writer implementation to do so. Happy to do if you think it adds value.
Also FWIW, the full parquet v2 vectorized impl PR (that this PR was split out from has quite a few production PBs read under its belt at this point and hasn't hit anything like this in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, nothing is going to handle it so it's not much of a source of data, is it? Key thing is not to cause damage to the system other than the specific query failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in dda0185