feat(parquet): add content defined chunking for arrow writer by kszucs · Pull Request #9450 · apache/arrow-rs

kszucs · 2026-02-20T15:26:52Z

Which issue does this PR close?

Closes #NNN.

Rationale for this change

Rust implementation of apache/arrow#45360

Traditional Parquet writing splits data pages at fixed sizes, so a single inserted or deleted row causes all subsequent pages to shift — resulting in nearly every byte being re-uploaded to content-addressable storage (CAS) systems. CDC determines page boundaries via a rolling gearhash over column values, so unchanged data produces identical pages across different writes enabling storage cost reductions and faster upload times.

See more details in https://huggingface.co/blog/parquet-cdc

The original C++ implementation apache/arrow#45360

Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator where I already integrated this PR to verify that deduplication effectiveness is on par with parquet-cpp (lower is better):

What changes are included in this PR?

Content-defined chunker at parquet/src/column/chunker/
Arrow writer integration integrated in ArrowColumnWriter
Writer properties via CdcOptions struct (min_chunk_size, max_chunk_size, norm_level)
ColumnDescriptor: added repeated_ancestor_def_level field to for nested field values iteration

Are these changes tested?

Yes — unit tests are located in cdc.rs and ported from the C++ implementation.

Are there any user-facing changes?

New experimental API, disabled by default — no behavior change for existing code:

// Simple toggle (256 KiB min, 1 MiB max, norm_level 0)
let props = WriterProperties::builder()
    .set_content_defined_chunking(true)
    .build();

// Excpliti CDC parameters
let props = WriterProperties::builder()
    .set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 })
    .build();

kszucs · 2026-02-25T11:27:02Z

We don't necessarily need to store the codegen script in the repository. Alternatively we could just reference https://github.com/apache/arrow/blob/main/cpp/src/parquet/chunker_internal_generated.h as a source for cdc_generated.rs. Likely it won't be regenerated at all.

I think it is fine to check this in

kszucs · 2026-03-06T11:53:29Z

@alamb @etseidl could you please take a look? Let me know if you need extra context or if you have limited bandwidth.

etseidl · 2026-03-06T16:12:12Z

Hi @kszucs. 👋 Apologies, I have been unusually bandwidth constrained lately. I will try to give this a good look in the next few days. Thank you for your patience 🙏 (and for adding this to arrow-rs).

kszucs · 2026-03-06T16:16:11Z

Hi @etseidl! No worries, I really appreciate you taking the time to review!

etseidl

Flushing a few early observations/questions. Still need to do the deep dive.

etseidl

Just a few more random comments...I'll get into the meat of the chunking tomorrow. Looking good so far!

…` and use it in content defined chunking

…nments

… chunks

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

adriangbot · 2026-03-17T19:25:14Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Linux bench-c4077459897-386-d9ldr 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing content-defined-chunking (3b45dc8) to fcab5d2 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete

adriangbot · 2026-03-17T19:48:09Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Details

group                                     content-defined-chunking               main
-----                                     ------------------------               ----
bool/bloom_filter                         1.00     55.9±0.36µs    19.0 MB/sec    1.05     58.6±0.45µs    18.1 MB/sec
bool/cdc                                  1.00     41.8±0.34µs    25.3 MB/sec  
bool/default                              1.00     42.1±0.33µs    25.2 MB/sec    1.06     44.5±0.39µs    23.8 MB/sec
bool/parquet_2                            1.00     48.0±0.37µs    22.1 MB/sec    1.05     50.4±0.51µs    21.0 MB/sec
bool/zstd                                 1.00     46.9±0.36µs    22.6 MB/sec    1.05     49.4±0.39µs    21.5 MB/sec
bool/zstd_parquet_2                       1.00     52.2±0.42µs    20.3 MB/sec    1.04     54.6±0.51µs    19.4 MB/sec
bool_non_null/bloom_filter                1.00     34.7±0.08µs    16.5 MB/sec    1.06     36.8±0.19µs    15.5 MB/sec
bool_non_null/cdc                         1.00     17.6±0.08µs    32.4 MB/sec  
bool_non_null/default                     1.00     17.7±0.07µs    32.4 MB/sec    1.14     20.2±0.17µs    28.4 MB/sec
bool_non_null/parquet_2                   1.00     25.9±0.13µs    22.1 MB/sec    1.11     28.8±0.23µs    19.9 MB/sec
bool_non_null/zstd                        1.00     21.9±0.08µs    26.1 MB/sec    1.12     24.6±0.20µs    23.2 MB/sec
bool_non_null/zstd_parquet_2              1.00     30.2±0.10µs    19.0 MB/sec    1.09     32.8±0.33µs    17.5 MB/sec
float_with_nans/bloom_filter              1.00    397.7±3.01µs   138.2 MB/sec    1.05    416.4±3.84µs   132.0 MB/sec
float_with_nans/cdc                       1.00    312.2±0.75µs   176.0 MB/sec  
float_with_nans/default                   1.00    311.1±0.79µs   176.7 MB/sec    1.03    319.4±0.77µs   172.1 MB/sec
float_with_nans/parquet_2                 1.00    486.2±0.89µs   113.0 MB/sec    1.02    496.5±1.30µs   110.7 MB/sec
float_with_nans/zstd                      1.00    434.6±0.91µs   126.5 MB/sec    1.03    445.8±0.97µs   123.3 MB/sec
float_with_nans/zstd_parquet_2            1.00    613.6±1.10µs    89.6 MB/sec    1.02    623.8±1.36µs    88.1 MB/sec
list_primitive/bloom_filter               1.03      2.9±0.12ms   740.3 MB/sec    1.00      2.8±0.01ms   765.7 MB/sec
list_primitive/cdc                        1.00   1224.9±7.93µs  1740.5 MB/sec  
list_primitive/default                    1.00   1224.8±8.33µs  1740.7 MB/sec    1.01   1241.6±6.59µs  1717.0 MB/sec
list_primitive/parquet_2                  1.09  1799.4±39.88µs  1184.8 MB/sec    1.00  1658.2±19.75µs  1285.7 MB/sec
list_primitive/zstd                       1.00      3.2±0.01ms   670.5 MB/sec    1.01      3.2±0.01ms   666.0 MB/sec
list_primitive/zstd_parquet_2             1.00      3.0±0.01ms   712.1 MB/sec    1.00      3.0±0.03ms   715.5 MB/sec
list_primitive_non_null/bloom_filter      1.00      3.4±0.01ms   626.1 MB/sec    1.00      3.4±0.01ms   627.9 MB/sec
list_primitive_non_null/cdc               1.00   1313.3±7.02µs  1619.8 MB/sec  
list_primitive_non_null/default           1.00   1312.2±6.97µs  1621.2 MB/sec    1.01   1324.5±7.93µs  1606.2 MB/sec
list_primitive_non_null/parquet_2         1.00      2.1±0.01ms  1001.9 MB/sec    1.11      2.3±0.06ms   905.7 MB/sec
list_primitive_non_null/zstd              1.00      4.0±0.01ms   536.8 MB/sec    1.11      4.4±0.01ms   485.2 MB/sec
list_primitive_non_null/zstd_parquet_2    1.00      4.3±0.01ms   489.6 MB/sec    1.06      4.6±0.01ms   462.2 MB/sec
primitive/bloom_filter                    1.08      3.0±0.32ms    57.7 MB/sec    1.00      2.8±0.06ms    62.6 MB/sec
primitive/cdc                             1.00    573.3±1.57µs   306.9 MB/sec  
primitive/default                         1.00    572.7±2.59µs   307.2 MB/sec    1.01    579.6±3.05µs   303.5 MB/sec
primitive/parquet_2                       1.04    675.4±4.20µs   260.5 MB/sec    1.00    650.9±2.73µs   270.3 MB/sec
primitive/zstd                            1.00    775.7±1.63µs   226.8 MB/sec    1.00    778.9±1.83µs   225.9 MB/sec
primitive/zstd_parquet_2                  1.13   1035.3±1.86µs   169.9 MB/sec    1.00   915.3±10.56µs   192.2 MB/sec
primitive_non_null/bloom_filter           1.14      3.1±0.01ms    56.3 MB/sec    1.00      2.7±0.01ms    64.1 MB/sec
primitive_non_null/cdc                    1.00    485.5±1.74µs   355.3 MB/sec  
primitive_non_null/default                1.00    484.8±2.04µs   355.9 MB/sec    1.03    497.7±4.73µs   346.7 MB/sec
primitive_non_null/parquet_2              1.04    598.8±2.21µs   288.1 MB/sec    1.00    574.4±1.30µs   300.4 MB/sec
primitive_non_null/zstd                   1.00    679.6±3.57µs   253.8 MB/sec    1.00    678.8±1.33µs   254.1 MB/sec
primitive_non_null/zstd_parquet_2         1.08   930.7±17.17µs   185.4 MB/sec    1.00    865.2±1.44µs   199.4 MB/sec
string/bloom_filter                       1.01   1838.6±2.22µs  1113.9 MB/sec    1.00   1820.6±2.31µs  1125.0 MB/sec
string/cdc                                1.00    591.7±1.07µs     3.4 GB/sec  
string/default                            1.00    592.1±1.14µs     3.4 GB/sec    1.01    599.8±0.97µs     3.3 GB/sec
string/parquet_2                          1.02   966.6±15.29µs     2.1 GB/sec    1.00   946.7±19.62µs     2.1 GB/sec
string/zstd                               1.00   1827.9±1.67µs  1120.5 MB/sec    1.42      2.6±0.00ms   790.9 MB/sec
string/zstd_parquet_2                     1.00      2.7±0.00ms   746.5 MB/sec    1.00      2.7±0.00ms   749.0 MB/sec
string_and_binary_view/bloom_filter       1.00    366.8±1.61µs   344.1 MB/sec    1.02    373.0±1.31µs   338.3 MB/sec
string_and_binary_view/cdc                1.00    306.4±0.69µs   411.8 MB/sec  
string_and_binary_view/default            1.00    306.6±0.74µs   411.6 MB/sec    1.00    306.7±0.75µs   411.5 MB/sec
string_and_binary_view/parquet_2          1.00    318.9±0.95µs   395.7 MB/sec    1.00    319.4±0.96µs   395.0 MB/sec
string_and_binary_view/zstd               1.01    478.9±0.78µs   263.5 MB/sec    1.00    475.0±0.97µs   265.7 MB/sec
string_and_binary_view/zstd_parquet_2     1.00    513.8±1.12µs   245.6 MB/sec    1.00    512.8±1.19µs   246.1 MB/sec
string_dictionary/bloom_filter            1.03    434.1±1.62µs     2.3 GB/sec    1.00    423.1±1.06µs     2.4 GB/sec
string_dictionary/cdc                     1.00    296.2±0.70µs     3.4 GB/sec  
string_dictionary/default                 1.01    297.8±0.52µs     3.4 GB/sec    1.00    296.2±0.55µs     3.4 GB/sec
string_dictionary/parquet_2               1.02    299.7±0.98µs     3.4 GB/sec    1.00    293.1±0.59µs     3.4 GB/sec
string_dictionary/zstd                    1.00    917.7±1.00µs  1124.5 MB/sec    1.01    925.6±1.28µs  1115.0 MB/sec
string_dictionary/zstd_parquet_2          1.00   1440.7±1.81µs   716.3 MB/sec    1.02   1465.0±5.11µs   704.4 MB/sec
string_non_null/bloom_filter              1.01      2.4±0.00ms   867.6 MB/sec    1.00      2.3±0.00ms   873.8 MB/sec
string_non_null/cdc                       1.00    869.0±1.33µs     2.3 GB/sec  
string_non_null/default                   1.00    868.1±1.53µs     2.3 GB/sec    1.01    873.9±4.36µs     2.3 GB/sec
string_non_null/parquet_2                 1.01   1373.7±1.93µs  1490.3 MB/sec    1.00   1362.3±3.68µs  1502.6 MB/sec
string_non_null/zstd                      1.01      3.6±0.01ms   576.3 MB/sec    1.00      3.5±0.03ms   584.0 MB/sec
string_non_null/zstd_parquet_2            1.02      3.9±0.00ms   528.8 MB/sec    1.00      3.8±0.00ms   536.9 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	592.2s
Peak memory	2.7 GiB
Avg memory	2.7 GiB
CPU user	508.9s
CPU sys	83.0s
Disk read	0 B
Disk write	700.3 MiB

branch

Metric	Value
Wall time	712.3s
Peak memory	2.7 GiB
Avg memory	2.7 GiB
CPU user	634.0s
CPU sys	78.3s
Disk read	0 B
Disk write	4.1 MiB

etseidl

Thanks @kszucs. Benchmarks look great...no additional overhead ❤️. I think this is a really cool addition.

alamb · 2026-03-17T20:58:32Z

Thanks @kszucs. Benchmarks look great...no additional overhead ❤️. I think this is a really cool addition.

Thanks @etseidl -- do you think we should add this in the 58.1.0 (minor release) in a few days or does it need to wait for a larger major release ?

etseidl · 2026-03-17T21:05:06Z

Thanks @kszucs. Benchmarks look great...no additional overhead ❤️. I think this is a really cool addition.

Thanks @etseidl -- do you think we should add this in the 58.1.0 (minor release) in a few days or does it need to wait for a larger major release ?

As it adds a new opt-in option with minimal impact on the existing code I'd say slip it in if possible.

alamb · 2026-03-18T20:20:12Z

I think there is something wrong with the tests in this PR.

Specifically, I did an ablation study (what a fancy word!) to verify the test coverage in this PR.

I disconnected all the CDC code from the writer like this

TESTING: Disconnect all cdc logic kszucs/arrow-rs#1

Here is how I ran the tests

cargo test -p parquet --features=arrow -- cdc

Almost all of them still pass (only 3 fail). which suggests they aren't actually testing the CDC code

running 35 tests
test column::chunker::cdc::arrow_tests::test_cdc_find_differences ... ok
test column::chunker::cdc::arrow_tests::test_cdc_empty_table ... ok
test column::chunker::cdc::arrow_tests::test_cdc_f64_column ... ok
test column::chunker::cdc::arrow_tests::test_cdc_array_offsets ... FAILED
test column::chunker::cdc::arrow_tests::test_cdc_prepend ... ok
test column::chunker::cdc::arrow_tests::test_cdc_append ... ok
test column::chunker::cdc::arrow_tests::test_cdc_insert_once ... ok
test column::chunker::cdc::arrow_tests::test_cdc_delete_once ... ok
test column::chunker::cdc::arrow_tests::test_cdc_delete_twice ... FAILED
test column::chunker::cdc::arrow_tests::test_cdc_insert_twice ... FAILED
test column::chunker::cdc::arrow_tests::test_cdc_multiple_row_groups_insert ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_dictionary ... ok
test column::chunker::cdc::arrow_tests::test_cdc_multiple_row_groups_append ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_large_binary ... ok
test column::chunker::cdc::arrow_tests::test_cdc_deterministic ... ok
test column::chunker::cdc::tests::test_calculate_mask_defaults ... ok
test column::chunker::cdc::tests::test_calculate_mask_invalid ... ok
test column::chunker::cdc::tests::test_calculate_mask_with_norm_level ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_string ... ok
test column::chunker::cdc::tests::test_non_nested_non_null_single_chunk ... ok
test column::chunker::cdc::tests::test_nullable_non_nested ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_nullable ... ok
test column::chunker::cdc::tests::test_deterministic_chunks ... ok
test column::chunker::cdc::tests::test_max_chunk_size_forces_boundary ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_i32 ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_list ... ok
test column::chunker::cdc::arrow_tests::test_cdc_update_once ... ok
test column::chunker::cdc::arrow_tests::test_cdc_nullable_column ... ok
test column::chunker::cdc::arrow_tests::test_cdc_string_column ... ok
test column::chunker::cdc::arrow_tests::test_cdc_update_twice ... ok
test column::chunker::cdc::arrow_tests::test_cdc_array_offsets_direct ... ok
test column::chunker::cdc::arrow_tests::test_cdc_roundtrip_multiple_columns ... ok
test column::chunker::cdc::arrow_tests::test_cdc_produces_multiple_pages ... ok
test column::chunker::cdc::arrow_tests::test_cdc_page_reuse_on_append ... ok
test column::chunker::cdc::arrow_tests::test_cdc_state_persists_across_row_groups ... ok

failures:

---- column::chunker::cdc::arrow_tests::test_cdc_array_offsets stdout ----

thread 'column::chunker::cdc::arrow_tests::test_cdc_array_offsets' (9668022) panicked at parquet/src/column/chunker/cdc.rs:1668:9:
assertion `left == right` failed: sliced first page should be 10 values shorter: full=[20480, 12288] sliced=[20480, 12278]
  left: 0
 right: 10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- column::chunker::cdc::arrow_tests::test_cdc_delete_twice stdout ----

thread 'column::chunker::cdc::arrow_tests::test_cdc_delete_twice' (9668025) panicked at parquet/src/column/chunker/cdc.rs:1574:9:
assertion `left == right` failed: Expected 2 diffs for double delete, got [([16448], [16384])]
  left: 1
 right: 2

---- column::chunker::cdc::arrow_tests::test_cdc_insert_twice stdout ----

thread 'column::chunker::cdc::arrow_tests::test_cdc_insert_twice' (9668031) panicked at parquet/src/column/chunker/cdc.rs:1613:9:
assertion `left == right` failed: Expected 2 diffs for double insert, got [([16384], [16448])]
  left: 1
 right: 2


failures:
    column::chunker::cdc::arrow_tests::test_cdc_array_offsets
    column::chunker::cdc::arrow_tests::test_cdc_delete_twice
    column::chunker::cdc::arrow_tests::test_cdc_insert_twice

test result: FAILED. 32 passed; 3 failed; 0 ignored; 0 measured; 885 filtered out; finished in 0.66s

alamb · 2026-03-18T20:22:52Z

For this PR I think we need an "End to end" test that shows the usecase that the CDC code is intended to solve

For example, perhaps such a test can write two parquet files, with the same data except for some chosen rows in the middle , and verify that most of the pages are the same.

It is not entirely clear to me how a "content addressable filesystem" works (aka how does it know where the parquet pages start/end) so having that documented / mocked out would also be nice

alamb · 2026-03-18T20:23:03Z

Thank you for this work @kszucs and @etseidl

alamb

Thank you

alamb · 2026-03-18T20:27:14Z

+        Ok(())
+    }
+
+    fn write_with_chunkers(


having two code paths (write and write_with_chunkers) is kind of weird to me and seems inconisistent with other optional features like encoding or compression

I wonder if it would encapsulate the code more if we extended write with a Option<&[ContentDefinedChunke]) rather than have two separate external functions

You tell me :)
I merely tried to avoid any breaking changes.

I can tinker as a follow on PR -- I don't think any changes are needed here

alamb · 2026-03-18T20:27:34Z

I think it is fine to check this in

kszucs · 2026-03-18T20:30:07Z

It is not entirely clear to me how a "content addressable filesystem" works (aka how does it know where the parquet pages start/end) so having that documented / mocked out would also be nice

The CDC feature in parquet essentially splits pages according to the columns' content resulting in fairly stable pages even if there are insterted deleted records.

The HF xet filesystem is format agnostic (similarly to for example a deduplicating backup solution like restic) and chunks the byte stream directly. The main issue with parquet is the page level compression which break the deduplication if the page values change - this CDC feature makes the pages more or less stable depending on theit content.

BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files.

alamb · 2026-03-18T20:39:39Z

BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files.

Is this easy to show? I realize this is an important usecase for hugging face, but it would be nice to have some example how this could be used by others that are not using the xet filesystem

kszucs · 2026-03-18T20:55:46Z

BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files.

Is this easy to show? I realize this is an important usecase for hugging face, but it would be nice to have some example how this could be used by others that are not using the xet filesystem

I have been thinking of a page store prototype for a while actually, that would kinda look like:

iterate over the parquet pages using a page reader
use a hash function to assign a unique key to the page based on its content, like xxhash, shar, blake (this is different from the gearhash since chunking is already done by the parquet writer)
write out the page to a hashtable like storage system like kv store, object store, but really depends on the use case
maintain the necessary metadata to reassemble the original parquet file from the stored pages

A format agnostic CAS is different since it does the chunking on the byte stream directly. I have a naive and very simple implementation for that here https://github.com/huggingface/dataset-dedupe-estimator/blob/main/src/store.rs

kszucs · 2026-03-18T20:59:16Z

I'm updating the testing suite to closely match the C++ ones.

kszucs · 2026-03-18T21:19:43Z

@alamb with CDC disabled at the writer level now 85 tests fail from 101 (16 is testing the testing utilities and empty tables).

alamb · 2026-03-20T13:54:27Z

@alamb with CDC disabled at the writer level now 85 tests fail from 101 (16 is testing the testing utilities and empty tables).

Thanks!

alamb · 2026-03-20T13:54:49Z

BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files.

Is this easy to show? I realize this is an important usecase for hugging face, but it would be nice to have some example how this could be used by others that are not using the xet filesystem

I have been thinking of a page store prototype for a while actually, that would kinda look like:

iterate over the parquet pages using a page reader

use a hash function to assign a unique key to the page based on its content, like xxhash, shar, blake (this is different from the gearhash since chunking is already done by the parquet writer)

write out the page to a hashtable like storage system like kv store, object store, but really depends on the use case

maintain the necessary metadata to reassemble the original parquet file from the stored pages

A format agnostic CAS is different since it does the chunking on the byte stream directly. I have a naive and very simple implementation for that here https://github.com/huggingface/dataset-dedupe-estimator/blob/main/src/store.rs

I filed a ticket to track this idea so it doesn't get lost on a old PR

Create a "end to end" Content Addressable Storage / CDC Chunking Example #9592

alamb · 2026-03-20T13:55:05Z

Thank you very much @kszucs and @etseidl 🙏

kszucs · 2026-03-20T14:05:07Z

Thank you @alamb and @etseidl! Soon adding a config option to datafusion as well and combined with apache/opendal#7185 (and object_store_opendal) datafusion on huggingface will provide a pretty good performance!

alamb · 2026-03-31T20:26:28Z

FWIW I believe I have found a bug in this PR (writing nested lists!):

[Parquet] ArrowWriter with CDC panics on nested ListArrays #9637

(It is not a regression, but just FYI)

kszucs · 2026-03-31T22:10:33Z

Thanks @alamb for catching it, looking at it tomorrow!

## Rationale for this change - closes #21110 Expose the new Content-Defined Chunking feature from parquet-rs apache/arrow-rs#9450 ## What changes are included in this PR? New parquet writer options for enabling CDC. ## Are these changes tested? In-progress. ## Are there any user-facing changes? New config options. Depends on the 58.1 arrow-rs release.

…e#21110) ## Rationale for this change - closes apache#21110 Expose the new Content-Defined Chunking feature from parquet-rs apache/arrow-rs#9450 ## What changes are included in this PR? New parquet writer options for enabling CDC. ## Are these changes tested? In-progress. ## Are there any user-facing changes? New config options. Depends on the 58.1 arrow-rs release.

github-actions bot added the parquet Changes to the parquet crate label Feb 20, 2026

kszucs marked this pull request as ready for review February 25, 2026 08:12

kszucs requested review from alamb and etseidl February 25, 2026 11:19

kszucs commented Feb 25, 2026

View reviewed changes

Comment thread parquet/src/schema/types.rs Outdated

kszucs commented Feb 25, 2026

View reviewed changes

etseidl reviewed Mar 6, 2026

View reviewed changes

etseidl reviewed Mar 10, 2026

View reviewed changes

Comment thread parquet/src/arrow/arrow_writer/mod.rs Outdated

Comment thread parquet/src/file/properties.rs Outdated

Comment thread parquet/src/file/properties.rs Outdated

Comment thread parquet/src/file/properties.rs

kszucs and others added 19 commits March 16, 2026 11:48

feat(parquet): add content defined chunking for arrow writer

b66d452

feat(parquet): add repeated_ancestor_def_level to `ColumnDescriptor…

26364c5

…` and use it in content defined chunking

chore: cargo format

cf48df7

chore: fix clippy errors

b05da4d

refactor(parquet): maintain field for better encapsulation

ea0e344

refactor(parquet): simplify the CDC implementation

04711e2

refactor(parquet): hold the cdc chunkers in ArrowWriter

c2b31ff

chore(parquet): remove redundant flush_current_page() method

ad4d2c6

doc(parquet): remove content defined chunking example from dosctrings

2553575

chore(parquet): remove unnecessary mut row_group_writer_factory assig…

5facebb

…nments

fix(parquet): incorporate primitive array offset when calculating cdc…

b25f206

… chunks

chore(parquet): add benchmark for cdc chunking

a699aef

chore(parquet): fix clippy errors

94d2efc

refactor(parquet): do not store the chunker in the row group writer

caef92e

chore(parquet): spell out cdc as content_defined_chunking in properties

947bfdf

chore(parquet): apply suggestions from code review

7622f22

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

chore: address review comments

3f087aa

chore: address review comments

255bec8

chore: add constants for default cdc parameters

7094008

etseidl approved these changes Mar 17, 2026

View reviewed changes

alamb mentioned this pull request Mar 18, 2026

TESTING: Disconnect all cdc logic kszucs/arrow-rs#1

Draft

alamb reviewed Mar 18, 2026

View reviewed changes

test(parquet): closely port the CDC tests from the C++ implementation

8dc0e5b

alamb mentioned this pull request Mar 20, 2026

Create a "end to end" Content Addressable Storage / CDC Chunking Example #9592

Open

alamb merged commit bc74c71 into apache:main Mar 20, 2026
16 checks passed

kszucs mentioned this pull request Mar 23, 2026

feat: add support for parquet content defined chunking options apache/datafusion#21110

Merged

alamb mentioned this pull request Mar 31, 2026

[Parquet] ArrowWriter with CDC panics on nested ListArrays #9637

Closed

kszucs mentioned this pull request Apr 1, 2026

fix(parquet): fix CDC panic on nested ListArrays with null entries kszucs/arrow-rs#2

Closed

alamb mentioned this pull request Apr 6, 2026

Support parquet content-defined chunking options apache/datafusion#21408

Open

Conversation

kszucs commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

kszucs Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

kszucs commented Mar 6, 2026

Uh oh!

etseidl commented Mar 6, 2026

Uh oh!

kszucs commented Mar 6, 2026

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adriangbot commented Mar 17, 2026

Uh oh!

adriangbot commented Mar 17, 2026

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 17, 2026

Uh oh!

etseidl commented Mar 17, 2026

Uh oh!

alamb commented Mar 18, 2026

Uh oh!

alamb commented Mar 18, 2026

Uh oh!

alamb commented Mar 18, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

kszucs Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kszucs commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Mar 18, 2026

Uh oh!

kszucs commented Mar 18, 2026

Uh oh!

kszucs commented Mar 18, 2026

kszucs commented Feb 20, 2026 •

edited

Loading

kszucs Feb 25, 2026 •

edited

Loading

kszucs commented Mar 18, 2026 •

edited

Loading