Skip to content

Commit 0d34a46

Browse files
committed
feat(parquet): add page store demo, reconstruct CLI command, and roundtrip verification
1 parent 3533fd8 commit 0d34a46

File tree

12 files changed

+1571
-98
lines changed

12 files changed

+1571
-98
lines changed

parquet/Cargo.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ num-integer = { version = "0.1.46", default-features = false, features = ["std"]
6666
num-traits = { version = "0.2.19", default-features = false, features = ["std"] }
6767
base64 = { version = "0.22", default-features = false, features = ["std", ], optional = true }
6868
clap = { version = "4.1", default-features = false, features = ["std", "derive", "env", "help", "error-context", "usage"], optional = true }
69+
glob = { version = "0.3", default-features = false, optional = true }
6970
serde = { version = "1.0", default-features = false, features = ["derive"], optional = true }
7071
serde_json = { version = "1.0", default-features = false, features = ["std"], optional = true }
7172
seq-macro = { version = "0.3", default-features = false }
@@ -110,7 +111,7 @@ arrow = ["base64", "arrow-array", "arrow-buffer", "arrow-data", "arrow-schema",
110111
# Enable support for arrow canonical extension types
111112
arrow_canonical_extension_types = ["arrow-schema?/canonical_extension_types"]
112113
# Enable CLI tools
113-
cli = ["json", "base64", "clap", "arrow-csv", "serde"]
114+
cli = ["json", "base64", "clap", "arrow-csv", "serde", "dep:glob"]
114115
# Enable JSON APIs
115116
json = ["serde_json", "base64"]
116117
# Enable internal testing APIs

parquet/examples/page_store.rs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,11 @@ fn main() -> parquet::errors::Result<()> {
8787
let batches = reader.read_batches()?;
8888

8989
let total_rows: usize = batches.iter().map(|b| b.num_rows()).sum();
90-
println!("Read {} batch(es), {} total rows", batches.len(), total_rows);
90+
println!(
91+
"Read {} batch(es), {} total rows",
92+
batches.len(),
93+
total_rows
94+
);
9195

9296
// Display
9397
let formatted = pretty_format_batches(&batches).unwrap();
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
data/
2+
meta/
3+
pages/
4+
verify/
5+
.venv/
6+
.cache/
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# Parquet Page Store — Deduplication Demo
2+
3+
> **Prototype**: This is an experimental feature exploring content-defined
4+
> chunking for Parquet. APIs and file formats may change.
5+
6+
Demonstrates how Content-Defined Chunking (CDC) enables efficient deduplication
7+
across multiple versions of a dataset using the Parquet page store writer in
8+
Apache Arrow Rust. The deduplication is self-contained in the Parquet writer —
9+
no special storage system is required.
10+
11+
## What this demo shows
12+
13+
Four common dataset operations are applied to a real-world dataset
14+
([OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5)
15+
conversational data, ~800 MB per file). Each operation produces a separate
16+
Parquet file. Without a page store, storing all four files costs the full sum
17+
of their sizes. With the CDC page store, identical pages are stored **exactly
18+
once** — indexed by their BLAKE3 hash — so the four files share most of their
19+
bytes. The resulting files can be stored anywhere.
20+
21+
| File | Operation |
22+
|------|-----------|
23+
| `original.parquet` | Baseline dataset (~996k rows) |
24+
| `filtered.parquet` | Keep rows where `num_turns ≤ 3` |
25+
| `augmented.parquet` | Original + computed column `num_turns` |
26+
| `appended.parquet` | Original + 5 000 new rows appended |
27+
28+
## Prerequisites
29+
30+
```bash
31+
pip install pyarrow matplotlib huggingface_hub
32+
cargo build --release -p parquet --features page_store,cli
33+
```
34+
35+
## Running the demo
36+
37+
```bash
38+
cd parquet/examples/page_store_dedup
39+
40+
# Run the full pipeline: prepare data, build binary, ingest into page store, show stats
41+
python pipeline.py
42+
43+
# Then generate diagrams
44+
python diagram.py
45+
```
46+
47+
Individual steps can be skipped if they've already run:
48+
49+
```bash
50+
python pipeline.py --skip-prepare --skip-build # re-run ingest + stats only
51+
python pipeline.py --skip-prepare --skip-build --skip-ingest # stats only
52+
```
53+
54+
Outputs:
55+
- `page_store_concept.png` — architectural overview of how shared pages work
56+
- `page_store_savings.png` — side-by-side storage comparison with real numbers
57+
58+
## Using your own dataset
59+
60+
```bash
61+
python pipeline.py --file /path/to/your.parquet
62+
```
63+
64+
The script requires a `conversations` list column for the filtered and augmented
65+
variants. Adapt `pipeline.py` to your own schema as needed.
66+
67+
## Results
68+
69+
Dataset: **OpenHermes-2.5** (short conversations, `num_turns < 10`)
70+
71+
### Dataset variants
72+
73+
| File | Operation | Rows | Size |
74+
|------|-----------|------|------|
75+
| `original.parquet` | Baseline | 996,009 | 782.1 MB |
76+
| `filtered.parquet` | Keep `num_turns ≤ 3` (removes 0.2% of rows) | 993,862 | 776.8 MB |
77+
| `augmented.parquet` | Add column `num_turns` | 996,009 | 782.2 MB |
78+
| `appended.parquet` | Append 5,000 rows | 1,001,009 | 788.6 MB |
79+
| **Total** | | | **3,129.7 MB** |
80+
81+
### Page store results
82+
83+
| Metric | Value |
84+
|--------|-------|
85+
| Unique pages stored | 3,400 |
86+
| Total page references | 15,179 |
87+
| Page store size | 559.0 MB |
88+
| Metadata files size | 4.4 MB |
89+
| **Page store + metadata** | **563.4 MB** |
90+
| **Storage saved** | **2,566.3 MB (82%)** |
91+
| **Deduplication ratio** | **5.6×** |
92+
93+
### Per-file page breakdown
94+
95+
| File | Page refs | Unique hashes | New pages | Reused pages |
96+
|------|-----------|---------------|-----------|--------------|
97+
| `original.parquet` | 3,782 | 3,100 | 3,100 | 0 |
98+
| `filtered.parquet` | 3,755 | 3,075 | 222 | 2,853 (92%) |
99+
| `augmented.parquet` | 3,834 | 3,136 | 36 | 3,100 (98%) |
100+
| `appended.parquet` | 3,808 | 3,125 | 42 | 3,083 (98%) |
101+
102+
### Key insights
103+
104+
1. **Adding a column** (`augmented`): only 36 new pages out of 3,136 (1.1%).
105+
The existing 17 columns produce identical CDC pages — only the new `num_turns`
106+
column contributes new pages.
107+
108+
2. **Appending rows** (`appended`): only 42 new pages out of 3,125 (1.3%).
109+
The original 996k rows' pages are unchanged; only the 5k new rows create new pages.
110+
111+
3. **Filtering rows** (`filtered`): 92% of pages reused despite row removal.
112+
Removing just 0.2% of rows barely shifts CDC boundaries — most pages are
113+
unchanged. Heavier filtering (removing 20–50% of rows) would produce more new
114+
pages, as CDC boundaries shift further throughout the file.
115+
116+
4. **Net result**: 4 dataset versions stored for **563 MB instead of 3.1 GB** — an
117+
**82% reduction**, or equivalently, 4 versions for the cost of **0.72×** a single
118+
version.
119+
120+
## How it works
121+
122+
```
123+
Standard Parquet — each file stored independently:
124+
125+
original.parquet ──► [ page 1 ][ page 2 ][ page 3 ]...[ page N ]
126+
filtered.parquet ──► [ page 1'][ page 2 ][ page 3 ]...[ page M ]
127+
augmented.parquet ──► [ page 1 ][ page 2 ][ page 3 ]...[ page N ][ extra ]
128+
appended.parquet ──► [ page 1 ][ page 2 ][ page 3 ]...[ page N ][ new ]
129+
130+
Total: sum of all four file sizes
131+
132+
CDC Page Store — content-addressed, deduplicated:
133+
134+
pages/
135+
<hash-of-page-1>.page ← shared by original, augmented, appended
136+
<hash-of-page-2>.page ← shared by original, filtered, augmented, appended
137+
<hash-of-page-3>.page ← shared by filtered only (boundary page)
138+
... (only UNIQUE pages stored)
139+
140+
meta/
141+
original.meta.parquet ← tiny manifest referencing page hashes
142+
filtered.meta.parquet
143+
augmented.meta.parquet
144+
appended.meta.parquet
145+
146+
Total: ~18% of the combined file sizes
147+
```
148+
149+
CDC ensures that page boundaries are **content-defined** (not fixed row
150+
counts), so adding columns or appending rows only requires storing the small
151+
number of new pages — the rest remain identical and are reused.
152+
153+
## Further reading
154+
155+
- [`parquet::arrow::page_store`][api] API docs
156+
- [`parquet-page-store` CLI][cli] source
157+
158+
[api]: https://docs.rs/parquet/latest/parquet/arrow/page_store/index.html
159+
[cli]: ../../src/bin/parquet-page-store.rs

0 commit comments

Comments
 (0)