feat(parquet): add content defined chunking for arrow writer#9450
feat(parquet): add content defined chunking for arrow writer#9450alamb merged 21 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
We don't necessarily need to store the codegen script in the repository. Alternatively we could just reference https://github.com/apache/arrow/blob/main/cpp/src/parquet/chunker_internal_generated.h as a source for cdc_generated.rs. Likely it won't be regenerated at all.
There was a problem hiding this comment.
I think it is fine to check this in
|
Hi @kszucs. 👋 Apologies, I have been unusually bandwidth constrained lately. I will try to give this a good look in the next few days. Thank you for your patience 🙏 (and for adding this to arrow-rs). |
|
Hi @etseidl! No worries, I really appreciate you taking the time to review! |
etseidl
left a comment
There was a problem hiding this comment.
Flushing a few early observations/questions. Still need to do the deep dive.
etseidl
left a comment
There was a problem hiding this comment.
Just a few more random comments...I'll get into the meat of the chunking tomorrow. Looking good so far!
…` and use it in content defined chunking
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
As it adds a new opt-in option with minimal impact on the existing code I'd say slip it in if possible. |
|
I think there is something wrong with the tests in this PR. Specifically, I did an ablation study (what a fancy word!) to verify the test coverage in this PR. I disconnected all the CDC code from the writer like this Here is how I ran the tests cargo test -p parquet --features=arrow -- cdcAlmost all of them still pass (only 3 fail). which suggests they aren't actually testing the CDC code |
|
For this PR I think we need an "End to end" test that shows the usecase that the CDC code is intended to solve For example, perhaps such a test can write two parquet files, with the same data except for some chosen rows in the middle , and verify that most of the pages are the same. It is not entirely clear to me how a "content addressable filesystem" works (aka how does it know where the parquet pages start/end) so having that documented / mocked out would also be nice |
| Ok(()) | ||
| } | ||
|
|
||
| fn write_with_chunkers( |
There was a problem hiding this comment.
having two code paths (write and write_with_chunkers) is kind of weird to me and seems inconisistent with other optional features like encoding or compression
I wonder if it would encapsulate the code more if we extended write with a Option<&[ContentDefinedChunke]) rather than have two separate external functions
There was a problem hiding this comment.
You tell me :)
I merely tried to avoid any breaking changes.
There was a problem hiding this comment.
I can tinker as a follow on PR -- I don't think any changes are needed here
There was a problem hiding this comment.
I think it is fine to check this in
The CDC feature in parquet essentially splits pages according to the columns' content resulting in fairly stable pages even if there are insterted deleted records. The HF xet filesystem is format agnostic (similarly to for example a deduplicating backup solution like restic) and chunks the byte stream directly. The main issue with parquet is the page level compression which break the deduplication if the page values change - this CDC feature makes the pages more or less stable depending on theit content. BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files. |
Is this easy to show? I realize this is an important usecase for hugging face, but it would be nice to have some example how this could be used by others that are not using the xet filesystem |
I have been thinking of a page store prototype for a while actually, that would kinda look like:
A format agnostic CAS is different since it does the chunking on the byte stream directly. I have a naive and very simple implementation for that here https://github.com/huggingface/dataset-dedupe-estimator/blob/main/src/store.rs |
|
I'm updating the testing suite to closely match the C++ ones. |
|
@alamb with CDC disabled at the writer level now 85 tests fail from 101 (16 is testing the testing utilities and empty tables). |
Thanks! |
I filed a ticket to track this idea so it doesn't get lost on a old PR |
|
Thank you @alamb and @etseidl! Soon adding a config option to datafusion as well and combined with apache/opendal#7185 (and object_store_opendal) datafusion on huggingface will provide a pretty good performance! |
|
FWIW I believe I have found a bug in this PR (writing nested lists!): (It is not a regression, but just FYI) |
|
Thanks @alamb for catching it, looking at it tomorrow! |
## Rationale for this change - closes #21110 Expose the new Content-Defined Chunking feature from parquet-rs apache/arrow-rs#9450 ## What changes are included in this PR? New parquet writer options for enabling CDC. ## Are these changes tested? In-progress. ## Are there any user-facing changes? New config options. Depends on the 58.1 arrow-rs release.
…e#21110) ## Rationale for this change - closes apache#21110 Expose the new Content-Defined Chunking feature from parquet-rs apache/arrow-rs#9450 ## What changes are included in this PR? New parquet writer options for enabling CDC. ## Are these changes tested? In-progress. ## Are there any user-facing changes? New config options. Depends on the 58.1 arrow-rs release.
Which issue does this PR close?
Rationale for this change
Rust implementation of apache/arrow#45360
Traditional Parquet writing splits data pages at fixed sizes, so a single inserted or deleted row causes all subsequent pages to shift — resulting in nearly every byte being re-uploaded to content-addressable storage (CAS) systems. CDC determines page boundaries via a rolling gearhash over column values, so unchanged data produces identical pages across different writes enabling storage cost reductions and faster upload times.
See more details in https://huggingface.co/blog/parquet-cdc
The original C++ implementation apache/arrow#45360
Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator where I already integrated this PR to verify that deduplication effectiveness is on par with parquet-cpp (lower is better):
What changes are included in this PR?
parquet/src/column/chunker/ArrowColumnWriterCdcOptionsstruct (min_chunk_size,max_chunk_size,norm_level)repeated_ancestor_def_levelfield to for nested field values iterationAre these changes tested?
Yes — unit tests are located in
cdc.rsand ported from the C++ implementation.Are there any user-facing changes?
New experimental API, disabled by default — no behavior change for existing code: