1+ <!-- -
2+ Licensed to the Apache Software Foundation (ASF) under one
3+ or more contributor license agreements. See the NOTICE file
4+ distributed with this work for additional information
5+ regarding copyright ownership. The ASF licenses this file
6+ to you under the Apache License, Version 2.0 (the
7+ "License"); you may not use this file except in compliance
8+ with the License. You may obtain a copy of the License at
9+
10+ http://www.apache.org/licenses/LICENSE-2.0
11+
12+ Unless required by applicable law or agreed to in writing,
13+ software distributed under the License is distributed on an
14+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+ KIND, either express or implied. See the License for the
16+ specific language governing permissions and limitations
17+ under the License.
18+ -->
19+
120# Parquet Page Store — Deduplication Demo
221
322> ** Prototype** : This is an experimental feature exploring content-defined
4- > chunking for Parquet. APIs and file formats may change.
23+ > chunking for Parquet. APIs and file formats may change.
524
625Demonstrates how Content-Defined Chunking (CDC) enables efficient deduplication
726across multiple versions of a dataset using the Parquet page store writer in
8- Apache Arrow Rust. The deduplication is self-contained in the Parquet writer —
27+ Apache Arrow Rust. The deduplication is self-contained in the Parquet writer —
928no special storage system is required.
1029
1130## What this demo shows
1231
1332Four common dataset operations are applied to a real-world dataset
1433([ OpenHermes-2.5] ( https://huggingface.co/datasets/teknium/OpenHermes-2.5 )
15- conversational data, ~ 800 MB per file). Each operation produces a separate
16- Parquet file. Without a page store, storing all four files costs the full sum
17- of their sizes. With the CDC page store, identical pages are stored ** exactly
34+ conversational data, ~ 800 MB per file). Each operation produces a separate
35+ Parquet file. Without a page store, storing all four files costs the full sum
36+ of their sizes. With the CDC page store, identical pages are stored ** exactly
1837once** — indexed by their BLAKE3 hash — so the four files share most of their
19- bytes. The resulting files can be stored anywhere.
38+ bytes. The resulting files can be stored anywhere.
2039
21- | File | Operation |
22- | ------| -----------|
23- | ` original.parquet ` | Baseline dataset (~ 996k rows) |
24- | ` filtered.parquet ` | Keep rows where ` num_turns ≤ 3 ` |
40+ | File | Operation |
41+ | ------------------- | -------------------------------------- |
42+ | ` original.parquet ` | Baseline dataset (~ 996k rows) |
43+ | ` filtered.parquet ` | Keep rows where ` num_turns ≤ 3 ` |
2544| ` augmented.parquet ` | Original + computed column ` num_turns ` |
26- | ` appended.parquet ` | Original + 5 000 new rows appended |
45+ | ` appended.parquet ` | Original + 5 000 new rows appended |
2746
2847## Prerequisites
2948
@@ -52,6 +71,7 @@ python pipeline.py --skip-prepare --skip-build --skip-ingest # stats only
5271```
5372
5473Outputs:
74+
5575- ` page_store_concept.png ` — architectural overview of how shared pages work
5676- ` page_store_savings.png ` — side-by-side storage comparison with real numbers
5777
@@ -62,42 +82,42 @@ python pipeline.py --file /path/to/your.parquet
6282```
6383
6484The script requires a ` conversations ` list column for the filtered and augmented
65- variants. Adapt ` pipeline.py ` to your own schema as needed.
85+ variants. Adapt ` pipeline.py ` to your own schema as needed.
6686
6787## Results
6888
6989Dataset: ** OpenHermes-2.5** (short conversations, ` num_turns < 10 ` )
7090
7191### Dataset variants
7292
73- | File | Operation | Rows | Size |
74- | ------| -----------| ------| ------|
75- | ` original.parquet ` | Baseline | 996,009 | 782.1 MB |
76- | ` filtered.parquet ` | Keep ` num_turns ≤ 3 ` (removes 0.2% of rows) | 993,862 | 776.8 MB |
77- | ` augmented.parquet ` | Add column ` num_turns ` | 996,009 | 782.2 MB |
78- | ` appended.parquet ` | Append 5,000 rows | 1,001,009 | 788.6 MB |
79- | ** Total** | | | ** 3,129.7 MB** |
93+ | File | Operation | Rows | Size |
94+ | ------------------- | ------------------------------------------- | --------- | -------------- |
95+ | ` original.parquet ` | Baseline | 996,009 | 782.1 MB |
96+ | ` filtered.parquet ` | Keep ` num_turns ≤ 3 ` (removes 0.2% of rows) | 993,862 | 776.8 MB |
97+ | ` augmented.parquet ` | Add column ` num_turns ` | 996,009 | 782.2 MB |
98+ | ` appended.parquet ` | Append 5,000 rows | 1,001,009 | 788.6 MB |
99+ | ** Total** | | | ** 3,129.7 MB** |
80100
81101### Page store results
82102
83- | Metric | Value |
84- | --------| -------|
85- | Unique pages stored | 3,400 |
86- | Total page references | 15,179 |
87- | Page store size | 559.0 MB |
88- | Metadata files size | 4.4 MB |
89- | ** Page store + metadata** | ** 563.4 MB** |
90- | ** Storage saved** | ** 2,566.3 MB (82%)** |
91- | ** Deduplication ratio** | ** 5.6×** |
103+ | Metric | Value |
104+ | ------------------------- | -------------------- |
105+ | Unique pages stored | 3,400 |
106+ | Total page references | 15,179 |
107+ | Page store size | 559.0 MB |
108+ | Metadata files size | 4.4 MB |
109+ | ** Page store + metadata** | ** 563.4 MB** |
110+ | ** Storage saved** | ** 2,566.3 MB (82%)** |
111+ | ** Deduplication ratio** | ** 5.6×** |
92112
93113### Per-file page breakdown
94114
95- | File | Page refs | Unique hashes | New pages | Reused pages |
96- | ------| -----------| ---------------| -----------| --------------|
97- | ` original.parquet ` | 3,782 | 3,100 | 3,100 | 0 |
98- | ` filtered.parquet ` | 3,755 | 3,075 | 222 | 2,853 (92%) |
99- | ` augmented.parquet ` | 3,834 | 3,136 | 36 | 3,100 (98%) |
100- | ` appended.parquet ` | 3,808 | 3,125 | 42 | 3,083 (98%) |
115+ | File | Page refs | Unique hashes | New pages | Reused pages |
116+ | ------------------- | --------- | ------------- | --------- | ------------ |
117+ | ` original.parquet ` | 3,782 | 3,100 | 3,100 | 0 |
118+ | ` filtered.parquet ` | 3,755 | 3,075 | 222 | 2,853 (92%) |
119+ | ` augmented.parquet ` | 3,834 | 3,136 | 36 | 3,100 (98%) |
120+ | ` appended.parquet ` | 3,808 | 3,125 | 42 | 3,083 (98%) |
101121
102122### Key insights
103123
@@ -110,7 +130,7 @@ Dataset: **OpenHermes-2.5** (short conversations, `num_turns < 10`)
110130
1111313 . ** Filtering rows** (` filtered ` ): 92% of pages reused despite row removal.
112132 Removing just 0.2% of rows barely shifts CDC boundaries — most pages are
113- unchanged. Heavier filtering (removing 20–50% of rows) would produce more new
133+ unchanged. Heavier filtering (removing 20–50% of rows) would produce more new
114134 pages, as CDC boundaries shift further throughout the file.
115135
1161364 . ** Net result** : 4 dataset versions stored for ** 563 MB instead of 3.1 GB** — an
0 commit comments