Added capability to fetch dictionary values#9011
Added capability to fetch dictionary values#9011DarkWanderer wants to merge 3 commits intoapache:mainfrom
Conversation
74358d8 to
db8b8e4
Compare
|
Unfortunately dictionary encoding is best effort, and writers will fallback to different encodings if the dictionary gets too large. The result is you need to know if all the pages are dictionary encoded in order to be able to make this optimisation - iirc this information is not encoded anywhere but the page header itself... Putting this aside there are likely some challenges around typing with this approach. IMO bloom filters are the recommended way to handle this sort of thing, dictionaries are more of an encoding optimisation. Edit: I could see a world where users could opt-in to have ArrowFilter passed the dictionary in a pre-pass, and for this to then allow the reader to skip decoding dictionary encoded pages if there are no matches, but unless I'm remembering incorrectly this wouldn't allow skipping the IO... |
That is exactly what I am hoping for - to perform a multiple-range fetch of a few MB from |
This is why the page encoding stats exist in the column metadata. This will tell you if all pages in a given chunk are dictionary encoded. (See arrow-rs/parquet/src/file/metadata/mod.rs Lines 1072 to 1083 in 116ae12 That said, I'm not nuts about duplicating the logic to decode the dictionary page, and just for the async reader. If we're going down this path, then I think all readers should have access. Perhaps this could be in |
Thanks for confirming, that was my intuition as well
I would note that there is already a disparity in available APIs. Specifically, I hear your argument about consistent API and code duplication however, will think of a better place to insert it. |
Which issue does this PR close?
Rationale for this change
Some databases, one example being Grafana Tempo, utilize column dictionaries as makeshift column indexes, to improve filtering speed ad-hoc. Checking if low-cardinality value is present in dictionary allows to effectively pre-filter data by skipping whole row group. This PR adds this capability
What changes are included in this PR?
Add public
get_row_group_column_dictionaryfunction to ParquetRecordBatchStreamBuilderAre these changes tested?
Tests have been added
Are there any user-facing changes?
Public API extension for
ParquetRecordBatchStreamBuilder