|
| 1 | +# Parquet Page Store — Deduplication Demo |
| 2 | + |
| 3 | +> **Prototype**: This is an experimental feature exploring content-defined |
| 4 | +> chunking for Parquet. APIs and file formats may change. |
| 5 | +
|
| 6 | +Demonstrates how Content-Defined Chunking (CDC) enables efficient deduplication |
| 7 | +across multiple versions of a dataset using the Parquet page store writer in |
| 8 | +Apache Arrow Rust. The deduplication is self-contained in the Parquet writer — |
| 9 | +no special storage system is required. |
| 10 | + |
| 11 | +## What this demo shows |
| 12 | + |
| 13 | +Four common dataset operations are applied to a real-world dataset |
| 14 | +([OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |
| 15 | +conversational data, ~800 MB per file). Each operation produces a separate |
| 16 | +Parquet file. Without a page store, storing all four files costs the full sum |
| 17 | +of their sizes. With the CDC page store, identical pages are stored **exactly |
| 18 | +once** — indexed by their BLAKE3 hash — so the four files share most of their |
| 19 | +bytes. The resulting files can be stored anywhere. |
| 20 | + |
| 21 | +| File | Operation | |
| 22 | +|------|-----------| |
| 23 | +| `original.parquet` | Baseline dataset (~996k rows) | |
| 24 | +| `filtered.parquet` | Keep rows where `num_turns ≤ 3` | |
| 25 | +| `augmented.parquet` | Original + computed column `num_turns` | |
| 26 | +| `appended.parquet` | Original + 5 000 new rows appended | |
| 27 | + |
| 28 | +## Prerequisites |
| 29 | + |
| 30 | +```bash |
| 31 | +pip install pyarrow matplotlib huggingface_hub |
| 32 | +cargo build --release -p parquet --features page_store,cli |
| 33 | +``` |
| 34 | + |
| 35 | +## Running the demo |
| 36 | + |
| 37 | +```bash |
| 38 | +cd parquet/examples/page_store_dedup |
| 39 | + |
| 40 | +# Run the full pipeline: prepare data, build binary, ingest into page store, show stats |
| 41 | +python pipeline.py |
| 42 | + |
| 43 | +# Then generate diagrams |
| 44 | +python diagram.py |
| 45 | +``` |
| 46 | + |
| 47 | +Individual steps can be skipped if they've already run: |
| 48 | + |
| 49 | +```bash |
| 50 | +python pipeline.py --skip-prepare --skip-build # re-run ingest + stats only |
| 51 | +python pipeline.py --skip-prepare --skip-build --skip-ingest # stats only |
| 52 | +``` |
| 53 | + |
| 54 | +Outputs: |
| 55 | +- `page_store_concept.png` — architectural overview of how shared pages work |
| 56 | +- `page_store_savings.png` — side-by-side storage comparison with real numbers |
| 57 | + |
| 58 | +## Using your own dataset |
| 59 | + |
| 60 | +```bash |
| 61 | +python pipeline.py --file /path/to/your.parquet |
| 62 | +``` |
| 63 | + |
| 64 | +The script requires a `conversations` list column for the filtered and augmented |
| 65 | +variants. Adapt `pipeline.py` to your own schema as needed. |
| 66 | + |
| 67 | +## Results |
| 68 | + |
| 69 | +Dataset: **OpenHermes-2.5** (short conversations, `num_turns < 10`) |
| 70 | + |
| 71 | +### Dataset variants |
| 72 | + |
| 73 | +| File | Operation | Rows | Size | |
| 74 | +|------|-----------|------|------| |
| 75 | +| `original.parquet` | Baseline | 996,009 | 782.1 MB | |
| 76 | +| `filtered.parquet` | Keep `num_turns ≤ 3` (removes 0.2% of rows) | 993,862 | 776.8 MB | |
| 77 | +| `augmented.parquet` | Add column `num_turns` | 996,009 | 782.2 MB | |
| 78 | +| `appended.parquet` | Append 5,000 rows | 1,001,009 | 788.6 MB | |
| 79 | +| **Total** | | | **3,129.7 MB** | |
| 80 | + |
| 81 | +### Page store results |
| 82 | + |
| 83 | +| Metric | Value | |
| 84 | +|--------|-------| |
| 85 | +| Unique pages stored | 3,400 | |
| 86 | +| Total page references | 15,179 | |
| 87 | +| Page store size | 559.0 MB | |
| 88 | +| Metadata files size | 4.4 MB | |
| 89 | +| **Page store + metadata** | **563.4 MB** | |
| 90 | +| **Storage saved** | **2,566.3 MB (82%)** | |
| 91 | +| **Deduplication ratio** | **5.6×** | |
| 92 | + |
| 93 | +### Per-file page breakdown |
| 94 | + |
| 95 | +| File | Page refs | Unique hashes | New pages | Reused pages | |
| 96 | +|------|-----------|---------------|-----------|--------------| |
| 97 | +| `original.parquet` | 3,782 | 3,100 | 3,100 | 0 | |
| 98 | +| `filtered.parquet` | 3,755 | 3,075 | 222 | 2,853 (92%) | |
| 99 | +| `augmented.parquet` | 3,834 | 3,136 | 36 | 3,100 (98%) | |
| 100 | +| `appended.parquet` | 3,808 | 3,125 | 42 | 3,083 (98%) | |
| 101 | + |
| 102 | +### Key insights |
| 103 | + |
| 104 | +1. **Adding a column** (`augmented`): only 36 new pages out of 3,136 (1.1%). |
| 105 | + The existing 17 columns produce identical CDC pages — only the new `num_turns` |
| 106 | + column contributes new pages. |
| 107 | + |
| 108 | +2. **Appending rows** (`appended`): only 42 new pages out of 3,125 (1.3%). |
| 109 | + The original 996k rows' pages are unchanged; only the 5k new rows create new pages. |
| 110 | + |
| 111 | +3. **Filtering rows** (`filtered`): 92% of pages reused despite row removal. |
| 112 | + Removing just 0.2% of rows barely shifts CDC boundaries — most pages are |
| 113 | + unchanged. Heavier filtering (removing 20–50% of rows) would produce more new |
| 114 | + pages, as CDC boundaries shift further throughout the file. |
| 115 | + |
| 116 | +4. **Net result**: 4 dataset versions stored for **563 MB instead of 3.1 GB** — an |
| 117 | + **82% reduction**, or equivalently, 4 versions for the cost of **0.72×** a single |
| 118 | + version. |
| 119 | + |
| 120 | +## How it works |
| 121 | + |
| 122 | +``` |
| 123 | +Standard Parquet — each file stored independently: |
| 124 | +
|
| 125 | + original.parquet ──► [ page 1 ][ page 2 ][ page 3 ]...[ page N ] |
| 126 | + filtered.parquet ──► [ page 1'][ page 2 ][ page 3 ]...[ page M ] |
| 127 | + augmented.parquet ──► [ page 1 ][ page 2 ][ page 3 ]...[ page N ][ extra ] |
| 128 | + appended.parquet ──► [ page 1 ][ page 2 ][ page 3 ]...[ page N ][ new ] |
| 129 | +
|
| 130 | + Total: sum of all four file sizes |
| 131 | +
|
| 132 | +CDC Page Store — content-addressed, deduplicated: |
| 133 | +
|
| 134 | + pages/ |
| 135 | + <hash-of-page-1>.page ← shared by original, augmented, appended |
| 136 | + <hash-of-page-2>.page ← shared by original, filtered, augmented, appended |
| 137 | + <hash-of-page-3>.page ← shared by filtered only (boundary page) |
| 138 | + ... (only UNIQUE pages stored) |
| 139 | +
|
| 140 | + meta/ |
| 141 | + original.meta.parquet ← tiny manifest referencing page hashes |
| 142 | + filtered.meta.parquet |
| 143 | + augmented.meta.parquet |
| 144 | + appended.meta.parquet |
| 145 | +
|
| 146 | + Total: ~18% of the combined file sizes |
| 147 | +``` |
| 148 | + |
| 149 | +CDC ensures that page boundaries are **content-defined** (not fixed row |
| 150 | +counts), so adding columns or appending rows only requires storing the small |
| 151 | +number of new pages — the rest remain identical and are reused. |
| 152 | + |
| 153 | +## Further reading |
| 154 | + |
| 155 | +- [`parquet::arrow::page_store`][api] API docs |
| 156 | +- [`parquet-page-store` CLI][cli] source |
| 157 | + |
| 158 | +[api]: https://docs.rs/parquet/latest/parquet/arrow/page_store/index.html |
| 159 | +[cli]: ../../src/bin/parquet-page-store.rs |
0 commit comments