Skip to content

feat(parquet): add content defined chunking for arrow writer#9450

Merged
alamb merged 21 commits intoapache:mainfrom
kszucs:content-defined-chunking
Mar 20, 2026
Merged

feat(parquet): add content defined chunking for arrow writer#9450
alamb merged 21 commits intoapache:mainfrom
kszucs:content-defined-chunking

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented Feb 20, 2026

Which issue does this PR close?

  • Closes #NNN.

Rationale for this change

Rust implementation of apache/arrow#45360

Traditional Parquet writing splits data pages at fixed sizes, so a single inserted or deleted row causes all subsequent pages to shift — resulting in nearly every byte being re-uploaded to content-addressable storage (CAS) systems. CDC determines page boundaries via a rolling gearhash over column values, so unchanged data produces identical pages across different writes enabling storage cost reductions and faster upload times.

See more details in https://huggingface.co/blog/parquet-cdc

The original C++ implementation apache/arrow#45360

Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator where I already integrated this PR to verify that deduplication effectiveness is on par with parquet-cpp (lower is better):

image

What changes are included in this PR?

  • Content-defined chunker at parquet/src/column/chunker/
  • Arrow writer integration integrated in ArrowColumnWriter
  • Writer properties via CdcOptions struct (min_chunk_size, max_chunk_size, norm_level)
  • ColumnDescriptor: added repeated_ancestor_def_level field to for nested field values iteration

Are these changes tested?

Yes — unit tests are located in cdc.rs and ported from the C++ implementation.

Are there any user-facing changes?

New experimental API, disabled by default — no behavior change for existing code:

// Simple toggle (256 KiB min, 1 MiB max, norm_level 0)
let props = WriterProperties::builder()
    .set_content_defined_chunking(true)
    .build();

// Excpliti CDC parameters
let props = WriterProperties::builder()
    .set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 })
    .build();

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 20, 2026
@kszucs kszucs marked this pull request as ready for review February 25, 2026 08:12
@kszucs kszucs requested review from alamb and etseidl February 25, 2026 11:19
Comment thread parquet/src/schema/types.rs Outdated
Copy link
Copy Markdown
Member Author

@kszucs kszucs Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't necessarily need to store the codegen script in the repository. Alternatively we could just reference https://github.com/apache/arrow/blob/main/cpp/src/parquet/chunker_internal_generated.h as a source for cdc_generated.rs. Likely it won't be regenerated at all.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to check this in

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 6, 2026

@alamb @etseidl could you please take a look? Let me know if you need extra context or if you have limited bandwidth.

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Mar 6, 2026

Hi @kszucs. 👋 Apologies, I have been unusually bandwidth constrained lately. I will try to give this a good look in the next few days. Thank you for your patience 🙏 (and for adding this to arrow-rs).

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 6, 2026

Hi @etseidl! No worries, I really appreciate you taking the time to review!

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing a few early observations/questions. Still need to do the deep dive.

Comment thread parquet/src/schema/types.rs Outdated
Comment thread parquet/src/arrow/arrow_writer/mod.rs Outdated
Comment thread parquet/src/column/chunker/mod.rs Outdated
Comment thread parquet/src/arrow/arrow_writer/levels.rs Outdated
Comment thread parquet/src/arrow/arrow_writer/levels.rs Outdated
Comment thread parquet/src/arrow/arrow_writer/mod.rs Outdated
Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more random comments...I'll get into the meat of the chunking tomorrow. Looking good so far!

Comment thread parquet/src/arrow/arrow_writer/mod.rs Outdated
Comment thread parquet/src/file/properties.rs Outdated
Comment thread parquet/src/file/properties.rs Outdated
Comment thread parquet/src/file/properties.rs
@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Linux bench-c4077459897-386-d9ldr 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing content-defined-chunking (3b45dc8) to fcab5d2 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Details

group                                     content-defined-chunking               main
-----                                     ------------------------               ----
bool/bloom_filter                         1.00     55.9±0.36µs    19.0 MB/sec    1.05     58.6±0.45µs    18.1 MB/sec
bool/cdc                                  1.00     41.8±0.34µs    25.3 MB/sec  
bool/default                              1.00     42.1±0.33µs    25.2 MB/sec    1.06     44.5±0.39µs    23.8 MB/sec
bool/parquet_2                            1.00     48.0±0.37µs    22.1 MB/sec    1.05     50.4±0.51µs    21.0 MB/sec
bool/zstd                                 1.00     46.9±0.36µs    22.6 MB/sec    1.05     49.4±0.39µs    21.5 MB/sec
bool/zstd_parquet_2                       1.00     52.2±0.42µs    20.3 MB/sec    1.04     54.6±0.51µs    19.4 MB/sec
bool_non_null/bloom_filter                1.00     34.7±0.08µs    16.5 MB/sec    1.06     36.8±0.19µs    15.5 MB/sec
bool_non_null/cdc                         1.00     17.6±0.08µs    32.4 MB/sec  
bool_non_null/default                     1.00     17.7±0.07µs    32.4 MB/sec    1.14     20.2±0.17µs    28.4 MB/sec
bool_non_null/parquet_2                   1.00     25.9±0.13µs    22.1 MB/sec    1.11     28.8±0.23µs    19.9 MB/sec
bool_non_null/zstd                        1.00     21.9±0.08µs    26.1 MB/sec    1.12     24.6±0.20µs    23.2 MB/sec
bool_non_null/zstd_parquet_2              1.00     30.2±0.10µs    19.0 MB/sec    1.09     32.8±0.33µs    17.5 MB/sec
float_with_nans/bloom_filter              1.00    397.7±3.01µs   138.2 MB/sec    1.05    416.4±3.84µs   132.0 MB/sec
float_with_nans/cdc                       1.00    312.2±0.75µs   176.0 MB/sec  
float_with_nans/default                   1.00    311.1±0.79µs   176.7 MB/sec    1.03    319.4±0.77µs   172.1 MB/sec
float_with_nans/parquet_2                 1.00    486.2±0.89µs   113.0 MB/sec    1.02    496.5±1.30µs   110.7 MB/sec
float_with_nans/zstd                      1.00    434.6±0.91µs   126.5 MB/sec    1.03    445.8±0.97µs   123.3 MB/sec
float_with_nans/zstd_parquet_2            1.00    613.6±1.10µs    89.6 MB/sec    1.02    623.8±1.36µs    88.1 MB/sec
list_primitive/bloom_filter               1.03      2.9±0.12ms   740.3 MB/sec    1.00      2.8±0.01ms   765.7 MB/sec
list_primitive/cdc                        1.00   1224.9±7.93µs  1740.5 MB/sec  
list_primitive/default                    1.00   1224.8±8.33µs  1740.7 MB/sec    1.01   1241.6±6.59µs  1717.0 MB/sec
list_primitive/parquet_2                  1.09  1799.4±39.88µs  1184.8 MB/sec    1.00  1658.2±19.75µs  1285.7 MB/sec
list_primitive/zstd                       1.00      3.2±0.01ms   670.5 MB/sec    1.01      3.2±0.01ms   666.0 MB/sec
list_primitive/zstd_parquet_2             1.00      3.0±0.01ms   712.1 MB/sec    1.00      3.0±0.03ms   715.5 MB/sec
list_primitive_non_null/bloom_filter      1.00      3.4±0.01ms   626.1 MB/sec    1.00      3.4±0.01ms   627.9 MB/sec
list_primitive_non_null/cdc               1.00   1313.3±7.02µs  1619.8 MB/sec  
list_primitive_non_null/default           1.00   1312.2±6.97µs  1621.2 MB/sec    1.01   1324.5±7.93µs  1606.2 MB/sec
list_primitive_non_null/parquet_2         1.00      2.1±0.01ms  1001.9 MB/sec    1.11      2.3±0.06ms   905.7 MB/sec
list_primitive_non_null/zstd              1.00      4.0±0.01ms   536.8 MB/sec    1.11      4.4±0.01ms   485.2 MB/sec
list_primitive_non_null/zstd_parquet_2    1.00      4.3±0.01ms   489.6 MB/sec    1.06      4.6±0.01ms   462.2 MB/sec
primitive/bloom_filter                    1.08      3.0±0.32ms    57.7 MB/sec    1.00      2.8±0.06ms    62.6 MB/sec
primitive/cdc                             1.00    573.3±1.57µs   306.9 MB/sec  
primitive/default                         1.00    572.7±2.59µs   307.2 MB/sec    1.01    579.6±3.05µs   303.5 MB/sec
primitive/parquet_2                       1.04    675.4±4.20µs   260.5 MB/sec    1.00    650.9±2.73µs   270.3 MB/sec
primitive/zstd                            1.00    775.7±1.63µs   226.8 MB/sec    1.00    778.9±1.83µs   225.9 MB/sec
primitive/zstd_parquet_2                  1.13   1035.3±1.86µs   169.9 MB/sec    1.00   915.3±10.56µs   192.2 MB/sec
primitive_non_null/bloom_filter           1.14      3.1±0.01ms    56.3 MB/sec    1.00      2.7±0.01ms    64.1 MB/sec
primitive_non_null/cdc                    1.00    485.5±1.74µs   355.3 MB/sec  
primitive_non_null/default                1.00    484.8±2.04µs   355.9 MB/sec    1.03    497.7±4.73µs   346.7 MB/sec
primitive_non_null/parquet_2              1.04    598.8±2.21µs   288.1 MB/sec    1.00    574.4±1.30µs   300.4 MB/sec
primitive_non_null/zstd                   1.00    679.6±3.57µs   253.8 MB/sec    1.00    678.8±1.33µs   254.1 MB/sec
primitive_non_null/zstd_parquet_2         1.08   930.7±17.17µs   185.4 MB/sec    1.00    865.2±1.44µs   199.4 MB/sec
string/bloom_filter                       1.01   1838.6±2.22µs  1113.9 MB/sec    1.00   1820.6±2.31µs  1125.0 MB/sec
string/cdc                                1.00    591.7±1.07µs     3.4 GB/sec  
string/default                            1.00    592.1±1.14µs     3.4 GB/sec    1.01    599.8±0.97µs     3.3 GB/sec
string/parquet_2                          1.02   966.6±15.29µs     2.1 GB/sec    1.00   946.7±19.62µs     2.1 GB/sec
string/zstd                               1.00   1827.9±1.67µs  1120.5 MB/sec    1.42      2.6±0.00ms   790.9 MB/sec
string/zstd_parquet_2                     1.00      2.7±0.00ms   746.5 MB/sec    1.00      2.7±0.00ms   749.0 MB/sec
string_and_binary_view/bloom_filter       1.00    366.8±1.61µs   344.1 MB/sec    1.02    373.0±1.31µs   338.3 MB/sec
string_and_binary_view/cdc                1.00    306.4±0.69µs   411.8 MB/sec  
string_and_binary_view/default            1.00    306.6±0.74µs   411.6 MB/sec    1.00    306.7±0.75µs   411.5 MB/sec
string_and_binary_view/parquet_2          1.00    318.9±0.95µs   395.7 MB/sec    1.00    319.4±0.96µs   395.0 MB/sec
string_and_binary_view/zstd               1.01    478.9±0.78µs   263.5 MB/sec    1.00    475.0±0.97µs   265.7 MB/sec
string_and_binary_view/zstd_parquet_2     1.00    513.8±1.12µs   245.6 MB/sec    1.00    512.8±1.19µs   246.1 MB/sec
string_dictionary/bloom_filter            1.03    434.1±1.62µs     2.3 GB/sec    1.00    423.1±1.06µs     2.4 GB/sec
string_dictionary/cdc                     1.00    296.2±0.70µs     3.4 GB/sec  
string_dictionary/default                 1.01    297.8±0.52µs     3.4 GB/sec    1.00    296.2±0.55µs     3.4 GB/sec
string_dictionary/parquet_2               1.02    299.7±0.98µs     3.4 GB/sec    1.00    293.1±0.59µs     3.4 GB/sec
string_dictionary/zstd                    1.00    917.7±1.00µs  1124.5 MB/sec    1.01    925.6±1.28µs  1115.0 MB/sec
string_dictionary/zstd_parquet_2          1.00   1440.7±1.81µs   716.3 MB/sec    1.02   1465.0±5.11µs   704.4 MB/sec
string_non_null/bloom_filter              1.01      2.4±0.00ms   867.6 MB/sec    1.00      2.3±0.00ms   873.8 MB/sec
string_non_null/cdc                       1.00    869.0±1.33µs     2.3 GB/sec  
string_non_null/default                   1.00    868.1±1.53µs     2.3 GB/sec    1.01    873.9±4.36µs     2.3 GB/sec
string_non_null/parquet_2                 1.01   1373.7±1.93µs  1490.3 MB/sec    1.00   1362.3±3.68µs  1502.6 MB/sec
string_non_null/zstd                      1.01      3.6±0.01ms   576.3 MB/sec    1.00      3.5±0.03ms   584.0 MB/sec
string_non_null/zstd_parquet_2            1.02      3.9±0.00ms   528.8 MB/sec    1.00      3.8±0.00ms   536.9 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 592.2s
Peak memory 2.7 GiB
Avg memory 2.7 GiB
CPU user 508.9s
CPU sys 83.0s
Disk read 0 B
Disk write 700.3 MiB

branch

Metric Value
Wall time 712.3s
Peak memory 2.7 GiB
Avg memory 2.7 GiB
CPU user 634.0s
CPU sys 78.3s
Disk read 0 B
Disk write 4.1 MiB

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kszucs. Benchmarks look great...no additional overhead ❤️. I think this is a really cool addition.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 17, 2026

Thanks @kszucs. Benchmarks look great...no additional overhead ❤️. I think this is a really cool addition.

Thanks @etseidl -- do you think we should add this in the 58.1.0 (minor release) in a few days or does it need to wait for a larger major release ?

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Mar 17, 2026

Thanks @kszucs. Benchmarks look great...no additional overhead ❤️. I think this is a really cool addition.

Thanks @etseidl -- do you think we should add this in the 58.1.0 (minor release) in a few days or does it need to wait for a larger major release ?

As it adds a new opt-in option with minimal impact on the existing code I'd say slip it in if possible.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 18, 2026

I think there is something wrong with the tests in this PR.

Specifically, I did an ablation study (what a fancy word!) to verify the test coverage in this PR.

I disconnected all the CDC code from the writer like this

Here is how I ran the tests

cargo test -p parquet --features=arrow -- cdc

Almost all of them still pass (only 3 fail). which suggests they aren't actually testing the CDC code

running 35 tests
test column::chunker::cdc::arrow_tests::test_cdc_find_differences ... ok
test column::chunker::cdc::arrow_tests::test_cdc_empty_table ... ok
test column::chunker::cdc::arrow_tests::test_cdc_f64_column ... ok
test column::chunker::cdc::arrow_tests::test_cdc_array_offsets ... FAILED
test column::chunker::cdc::arrow_tests::test_cdc_prepend ... ok
test column::chunker::cdc::arrow_tests::test_cdc_append ... ok
test column::chunker::cdc::arrow_tests::test_cdc_insert_once ... ok
test column::chunker::cdc::arrow_tests::test_cdc_delete_once ... ok
test column::chunker::cdc::arrow_tests::test_cdc_delete_twice ... FAILED
test column::chunker::cdc::arrow_tests::test_cdc_insert_twice ... FAILED
test column::chunker::cdc::arrow_tests::test_cdc_multiple_row_groups_insert ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_dictionary ... ok
test column::chunker::cdc::arrow_tests::test_cdc_multiple_row_groups_append ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_large_binary ... ok
test column::chunker::cdc::arrow_tests::test_cdc_deterministic ... ok
test column::chunker::cdc::tests::test_calculate_mask_defaults ... ok
test column::chunker::cdc::tests::test_calculate_mask_invalid ... ok
test column::chunker::cdc::tests::test_calculate_mask_with_norm_level ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_string ... ok
test column::chunker::cdc::tests::test_non_nested_non_null_single_chunk ... ok
test column::chunker::cdc::tests::test_nullable_non_nested ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_nullable ... ok
test column::chunker::cdc::tests::test_deterministic_chunks ... ok
test column::chunker::cdc::tests::test_max_chunk_size_forces_boundary ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_i32 ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_list ... ok
test column::chunker::cdc::arrow_tests::test_cdc_update_once ... ok
test column::chunker::cdc::arrow_tests::test_cdc_nullable_column ... ok
test column::chunker::cdc::arrow_tests::test_cdc_string_column ... ok
test column::chunker::cdc::arrow_tests::test_cdc_update_twice ... ok
test column::chunker::cdc::arrow_tests::test_cdc_array_offsets_direct ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_multiple_columns ... ok
test column::chunker::cdc::arrow_tests::test_cdc_produces_multiple_pages ... ok
test column::chunker::cdc::arrow_tests::test_cdc_page_reuse_on_append ... ok
test column::chunker::cdc::arrow_tests::test_cdc_state_persists_across_row_groups ... ok

failures:

---- column::chunker::cdc::arrow_tests::test_cdc_array_offsets stdout ----

thread 'column::chunker::cdc::arrow_tests::test_cdc_array_offsets' (9668022) panicked at parquet/src/column/chunker/cdc.rs:1668:9:
assertion `left == right` failed: sliced first page should be 10 values shorter: full=[20480, 12288] sliced=[20480, 12278]
  left: 0
 right: 10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- column::chunker::cdc::arrow_tests::test_cdc_delete_twice stdout ----

thread 'column::chunker::cdc::arrow_tests::test_cdc_delete_twice' (9668025) panicked at parquet/src/column/chunker/cdc.rs:1574:9:
assertion `left == right` failed: Expected 2 diffs for double delete, got [([16448], [16384])]
  left: 1
 right: 2

---- column::chunker::cdc::arrow_tests::test_cdc_insert_twice stdout ----

thread 'column::chunker::cdc::arrow_tests::test_cdc_insert_twice' (9668031) panicked at parquet/src/column/chunker/cdc.rs:1613:9:
assertion `left == right` failed: Expected 2 diffs for double insert, got [([16384], [16448])]
  left: 1
 right: 2


failures:
    column::chunker::cdc::arrow_tests::test_cdc_array_offsets
    column::chunker::cdc::arrow_tests::test_cdc_delete_twice
    column::chunker::cdc::arrow_tests::test_cdc_insert_twice

test result: FAILED. 32 passed; 3 failed; 0 ignored; 0 measured; 885 filtered out; finished in 0.66s

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 18, 2026

For this PR I think we need an "End to end" test that shows the usecase that the CDC code is intended to solve

For example, perhaps such a test can write two parquet files, with the same data except for some chosen rows in the middle , and verify that most of the pages are the same.

It is not entirely clear to me how a "content addressable filesystem" works (aka how does it know where the parquet pages start/end) so having that documented / mocked out would also be nice

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 18, 2026

Thank you for this work @kszucs and @etseidl

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

Ok(())
}

fn write_with_chunkers(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having two code paths (write and write_with_chunkers) is kind of weird to me and seems inconisistent with other optional features like encoding or compression

I wonder if it would encapsulate the code more if we extended write with a Option<&[ContentDefinedChunke]) rather than have two separate external functions

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You tell me :)
I merely tried to avoid any breaking changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can tinker as a follow on PR -- I don't think any changes are needed here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to check this in

Comment thread parquet/src/lib.rs Outdated
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 18, 2026

It is not entirely clear to me how a "content addressable filesystem" works (aka how does it know where the parquet pages start/end) so having that documented / mocked out would also be nice

The CDC feature in parquet essentially splits pages according to the columns' content resulting in fairly stable pages even if there are insterted deleted records.

The HF xet filesystem is format agnostic (similarly to for example a deduplicating backup solution like restic) and chunks the byte stream directly. The main issue with parquet is the page level compression which break the deduplication if the page values change - this CDC feature makes the pages more or less stable depending on theit content.

BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 18, 2026

BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files.

Is this easy to show? I realize this is an important usecase for hugging face, but it would be nice to have some example how this could be used by others that are not using the xet filesystem

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 18, 2026

BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files.

Is this easy to show? I realize this is an important usecase for hugging face, but it would be nice to have some example how this could be used by others that are not using the xet filesystem

I have been thinking of a page store prototype for a while actually, that would kinda look like:

  1. iterate over the parquet pages using a page reader
  2. use a hash function to assign a unique key to the page based on its content, like xxhash, shar, blake (this is different from the gearhash since chunking is already done by the parquet writer)
  3. write out the page to a hashtable like storage system like kv store, object store, but really depends on the use case
  4. maintain the necessary metadata to reassemble the original parquet file from the stored pages

A format agnostic CAS is different since it does the chunking on the byte stream directly. I have a naive and very simple implementation for that here https://github.com/huggingface/dataset-dedupe-estimator/blob/main/src/store.rs

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 18, 2026

I'm updating the testing suite to closely match the C++ ones.

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 18, 2026

@alamb with CDC disabled at the writer level now 85 tests fail from 101 (16 is testing the testing utilities and empty tables).

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 20, 2026

@alamb with CDC disabled at the writer level now 85 tests fail from 101 (16 is testing the testing utilities and empty tables).

Thanks!

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 20, 2026

BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files.

Is this easy to show? I realize this is an important usecase for hugging face, but it would be nice to have some example how this could be used by others that are not using the xet filesystem

I have been thinking of a page store prototype for a while actually, that would kinda look like:

  1. iterate over the parquet pages using a page reader
  2. use a hash function to assign a unique key to the page based on its content, like xxhash, shar, blake (this is different from the gearhash since chunking is already done by the parquet writer)
  3. write out the page to a hashtable like storage system like kv store, object store, but really depends on the use case
  4. maintain the necessary metadata to reassemble the original parquet file from the stored pages

A format agnostic CAS is different since it does the chunking on the byte stream directly. I have a naive and very simple implementation for that here https://github.com/huggingface/dataset-dedupe-estimator/blob/main/src/store.rs

I filed a ticket to track this idea so it doesn't get lost on a old PR

@alamb alamb merged commit bc74c71 into apache:main Mar 20, 2026
16 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 20, 2026

Thank you very much @kszucs and @etseidl 🙏

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 20, 2026

Thank you @alamb and @etseidl! Soon adding a config option to datafusion as well and combined with apache/opendal#7185 (and object_store_opendal) datafusion on huggingface will provide a pretty good performance!

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 31, 2026

FWIW I believe I have found a bug in this PR (writing nested lists!):

(It is not a regression, but just FYI)

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 31, 2026

Thanks @alamb for catching it, looking at it tomorrow!

github-merge-queue bot pushed a commit to apache/datafusion that referenced this pull request Apr 6, 2026
## Rationale for this change

- closes #21110

Expose the new Content-Defined Chunking feature from parquet-rs
apache/arrow-rs#9450

## What changes are included in this PR?

New parquet writer options for enabling CDC.

## Are these changes tested?

In-progress.

## Are there any user-facing changes?

New config options.


Depends on the 58.1 arrow-rs release.
Dandandan pushed a commit to Dandandan/arrow-datafusion that referenced this pull request Apr 8, 2026
…e#21110)

## Rationale for this change

- closes apache#21110

Expose the new Content-Defined Chunking feature from parquet-rs
apache/arrow-rs#9450

## What changes are included in this PR?

New parquet writer options for enabling CDC.

## Are these changes tested?

In-progress.

## Are there any user-facing changes?

New config options.


Depends on the 58.1 arrow-rs release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants