Skip to content

Commit 2e9ee36

Browse files
committed
chore: add ASF license headers and fix RAT/Prettier CI failures
1 parent 0d34a46 commit 2e9ee36

File tree

4 files changed

+88
-35
lines changed

4 files changed

+88
-35
lines changed

dev/release/rat_exclude_files.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,4 @@ arrow-flight/src/sql/arrow.flight.protocol.sql.rs
2020
.github/*
2121
parquet/src/bin/parquet-fromcsv-help.txt
2222
arrow-flight/examples/data/*
23+
parquet/examples/page_store_dedup/page_store_concept.svg

parquet/examples/page_store_dedup/README.md

Lines changed: 55 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,48 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
120
# Parquet Page Store — Deduplication Demo
221

322
> **Prototype**: This is an experimental feature exploring content-defined
4-
> chunking for Parquet. APIs and file formats may change.
23+
> chunking for Parquet. APIs and file formats may change.
524
625
Demonstrates how Content-Defined Chunking (CDC) enables efficient deduplication
726
across multiple versions of a dataset using the Parquet page store writer in
8-
Apache Arrow Rust. The deduplication is self-contained in the Parquet writer —
27+
Apache Arrow Rust. The deduplication is self-contained in the Parquet writer —
928
no special storage system is required.
1029

1130
## What this demo shows
1231

1332
Four common dataset operations are applied to a real-world dataset
1433
([OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5)
15-
conversational data, ~800 MB per file). Each operation produces a separate
16-
Parquet file. Without a page store, storing all four files costs the full sum
17-
of their sizes. With the CDC page store, identical pages are stored **exactly
34+
conversational data, ~800 MB per file). Each operation produces a separate
35+
Parquet file. Without a page store, storing all four files costs the full sum
36+
of their sizes. With the CDC page store, identical pages are stored **exactly
1837
once** — indexed by their BLAKE3 hash — so the four files share most of their
19-
bytes. The resulting files can be stored anywhere.
38+
bytes. The resulting files can be stored anywhere.
2039

21-
| File | Operation |
22-
|------|-----------|
23-
| `original.parquet` | Baseline dataset (~996k rows) |
24-
| `filtered.parquet` | Keep rows where `num_turns ≤ 3` |
40+
| File | Operation |
41+
| ------------------- | -------------------------------------- |
42+
| `original.parquet` | Baseline dataset (~996k rows) |
43+
| `filtered.parquet` | Keep rows where `num_turns ≤ 3` |
2544
| `augmented.parquet` | Original + computed column `num_turns` |
26-
| `appended.parquet` | Original + 5 000 new rows appended |
45+
| `appended.parquet` | Original + 5 000 new rows appended |
2746

2847
## Prerequisites
2948

@@ -52,6 +71,7 @@ python pipeline.py --skip-prepare --skip-build --skip-ingest # stats only
5271
```
5372

5473
Outputs:
74+
5575
- `page_store_concept.png` — architectural overview of how shared pages work
5676
- `page_store_savings.png` — side-by-side storage comparison with real numbers
5777

@@ -62,42 +82,42 @@ python pipeline.py --file /path/to/your.parquet
6282
```
6383

6484
The script requires a `conversations` list column for the filtered and augmented
65-
variants. Adapt `pipeline.py` to your own schema as needed.
85+
variants. Adapt `pipeline.py` to your own schema as needed.
6686

6787
## Results
6888

6989
Dataset: **OpenHermes-2.5** (short conversations, `num_turns < 10`)
7090

7191
### Dataset variants
7292

73-
| File | Operation | Rows | Size |
74-
|------|-----------|------|------|
75-
| `original.parquet` | Baseline | 996,009 | 782.1 MB |
76-
| `filtered.parquet` | Keep `num_turns ≤ 3` (removes 0.2% of rows) | 993,862 | 776.8 MB |
77-
| `augmented.parquet` | Add column `num_turns` | 996,009 | 782.2 MB |
78-
| `appended.parquet` | Append 5,000 rows | 1,001,009 | 788.6 MB |
79-
| **Total** | | | **3,129.7 MB** |
93+
| File | Operation | Rows | Size |
94+
| ------------------- | ------------------------------------------- | --------- | -------------- |
95+
| `original.parquet` | Baseline | 996,009 | 782.1 MB |
96+
| `filtered.parquet` | Keep `num_turns ≤ 3` (removes 0.2% of rows) | 993,862 | 776.8 MB |
97+
| `augmented.parquet` | Add column `num_turns` | 996,009 | 782.2 MB |
98+
| `appended.parquet` | Append 5,000 rows | 1,001,009 | 788.6 MB |
99+
| **Total** | | | **3,129.7 MB** |
80100

81101
### Page store results
82102

83-
| Metric | Value |
84-
|--------|-------|
85-
| Unique pages stored | 3,400 |
86-
| Total page references | 15,179 |
87-
| Page store size | 559.0 MB |
88-
| Metadata files size | 4.4 MB |
89-
| **Page store + metadata** | **563.4 MB** |
90-
| **Storage saved** | **2,566.3 MB (82%)** |
91-
| **Deduplication ratio** | **5.6×** |
103+
| Metric | Value |
104+
| ------------------------- | -------------------- |
105+
| Unique pages stored | 3,400 |
106+
| Total page references | 15,179 |
107+
| Page store size | 559.0 MB |
108+
| Metadata files size | 4.4 MB |
109+
| **Page store + metadata** | **563.4 MB** |
110+
| **Storage saved** | **2,566.3 MB (82%)** |
111+
| **Deduplication ratio** | **5.6×** |
92112

93113
### Per-file page breakdown
94114

95-
| File | Page refs | Unique hashes | New pages | Reused pages |
96-
|------|-----------|---------------|-----------|--------------|
97-
| `original.parquet` | 3,782 | 3,100 | 3,100 | 0 |
98-
| `filtered.parquet` | 3,755 | 3,075 | 222 | 2,853 (92%) |
99-
| `augmented.parquet` | 3,834 | 3,136 | 36 | 3,100 (98%) |
100-
| `appended.parquet` | 3,808 | 3,125 | 42 | 3,083 (98%) |
115+
| File | Page refs | Unique hashes | New pages | Reused pages |
116+
| ------------------- | --------- | ------------- | --------- | ------------ |
117+
| `original.parquet` | 3,782 | 3,100 | 3,100 | 0 |
118+
| `filtered.parquet` | 3,755 | 3,075 | 222 | 2,853 (92%) |
119+
| `augmented.parquet` | 3,834 | 3,136 | 36 | 3,100 (98%) |
120+
| `appended.parquet` | 3,808 | 3,125 | 42 | 3,083 (98%) |
101121

102122
### Key insights
103123

@@ -110,7 +130,7 @@ Dataset: **OpenHermes-2.5** (short conversations, `num_turns < 10`)
110130

111131
3. **Filtering rows** (`filtered`): 92% of pages reused despite row removal.
112132
Removing just 0.2% of rows barely shifts CDC boundaries — most pages are
113-
unchanged. Heavier filtering (removing 20–50% of rows) would produce more new
133+
unchanged. Heavier filtering (removing 20–50% of rows) would produce more new
114134
pages, as CDC boundaries shift further throughout the file.
115135

116136
4. **Net result**: 4 dataset versions stored for **563 MB instead of 3.1 GB** — an

parquet/examples/page_store_dedup/concept.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,20 @@
11
#!/usr/bin/env python3
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing,
13+
# software distributed under the License is distributed on an
14+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
# KIND, either express or implied. See the License for the
16+
# specific language governing permissions and limitations
17+
# under the License.
218
"""
319
Generate the Parquet Page Store concept diagram.
420

parquet/examples/page_store_dedup/pipeline.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,20 @@
11
#!/usr/bin/env python3
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing,
13+
# software distributed under the License is distributed on an
14+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
# KIND, either express or implied. See the License for the
16+
# specific language governing permissions and limitations
17+
# under the License.
218
"""
319
Full pipeline for the Parquet Page Store deduplication demo.
420

0 commit comments

Comments
 (0)