Skip to content

Avoid unnecessary buffer zero-fill in Snappy decompression#9583

Open
Dandandan wants to merge 2 commits intoapache:mainfrom
Dandandan:pr/snappy-zero-fill
Open

Avoid unnecessary buffer zero-fill in Snappy decompression#9583
Dandandan wants to merge 2 commits intoapache:mainfrom
Dandandan:pr/snappy-zero-fill

Conversation

@Dandandan
Copy link
Copy Markdown
Contributor

@Dandandan Dandandan commented Mar 19, 2026

Which issue does this PR close?

Closes #9579

Rationale

Currently, Snappy decompression uses resize(len, 0) which zero-fills the buffer before writing. Since Snappy will overwrite the entire region on success, this memset is wasted work.

1-2% win on snappy e2e decoding of snappy encoded parquet data

What changes are included in this PR?

Write directly into spare capacity using reserve() + spare_capacity_mut() + set_len(), eliminating the unnecessary zero-fill.

Are there any user-facing changes?

No.

🤖 Generated with Claude Code

Write directly into spare capacity instead of resize+zero-fill,
eliminating unnecessary memset for the decompression output buffer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Dandandan
Copy link
Copy Markdown
Contributor Author

run benchmark arrow_reader_clickbench

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Linux bench-c4093137898-468-pw7st 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing pr/snappy-zero-fill (eaa3ae4) to 88422cb (merge-base) diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
Results will be posted here when complete

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Details

group                                             main                                   pr_snappy-zero-fill
-----                                             ----                                   -------------------
arrow_reader_clickbench/async/Q1                  1.01   1093.5±5.62µs        ? ?/sec    1.00   1087.7±3.63µs        ? ?/sec
arrow_reader_clickbench/async/Q10                 1.01      6.7±0.05ms        ? ?/sec    1.00      6.7±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q11                 1.02      7.8±0.07ms        ? ?/sec    1.00      7.7±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q12                 1.00     14.4±0.07ms        ? ?/sec    1.00     14.4±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q13                 1.01     17.1±0.09ms        ? ?/sec    1.00     16.9±0.10ms        ? ?/sec
arrow_reader_clickbench/async/Q14                 1.00     15.9±0.07ms        ? ?/sec    1.00     15.9±0.09ms        ? ?/sec
arrow_reader_clickbench/async/Q19                 1.01      3.1±0.03ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
arrow_reader_clickbench/async/Q20                 1.00     78.7±0.37ms        ? ?/sec    1.13     88.9±9.93ms        ? ?/sec
arrow_reader_clickbench/async/Q21                 1.22     97.0±0.55ms        ? ?/sec    1.00     79.4±0.20ms        ? ?/sec
arrow_reader_clickbench/async/Q22                 1.11    131.6±5.00ms        ? ?/sec    1.00    118.2±6.31ms        ? ?/sec
arrow_reader_clickbench/async/Q23                 1.02    245.9±0.84ms        ? ?/sec    1.00    240.6±1.16ms        ? ?/sec
arrow_reader_clickbench/async/Q24                 1.04     20.0±0.14ms        ? ?/sec    1.00     19.2±0.14ms        ? ?/sec
arrow_reader_clickbench/async/Q27                 1.04     58.7±0.55ms        ? ?/sec    1.00     56.3±0.21ms        ? ?/sec
arrow_reader_clickbench/async/Q28                 1.03     57.9±0.36ms        ? ?/sec    1.00     56.3±0.16ms        ? ?/sec
arrow_reader_clickbench/async/Q30                 1.01     18.6±0.07ms        ? ?/sec    1.00     18.4±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q36                 1.02     15.3±0.28ms        ? ?/sec    1.00     14.9±0.12ms        ? ?/sec
arrow_reader_clickbench/async/Q37                 1.00      5.4±0.03ms        ? ?/sec    1.00      5.4±0.02ms        ? ?/sec
arrow_reader_clickbench/async/Q38                 1.03     13.6±0.26ms        ? ?/sec    1.00     13.1±0.16ms        ? ?/sec
arrow_reader_clickbench/async/Q39                 1.03     24.4±0.31ms        ? ?/sec    1.00     23.8±0.18ms        ? ?/sec
arrow_reader_clickbench/async/Q40                 1.01      5.8±0.06ms        ? ?/sec    1.00      5.7±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q41                 1.01      5.0±0.03ms        ? ?/sec    1.00      4.9±0.02ms        ? ?/sec
arrow_reader_clickbench/async/Q42                 1.00      3.5±0.02ms        ? ?/sec    1.00      3.5±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q1     1.00   1062.7±2.48µs        ? ?/sec    1.00   1067.4±2.72µs        ? ?/sec
arrow_reader_clickbench/async_object_store/Q10    1.02      6.6±0.06ms        ? ?/sec    1.00      6.5±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q11    1.01      7.6±0.06ms        ? ?/sec    1.00      7.5±0.06ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q12    1.01     14.3±0.08ms        ? ?/sec    1.00     14.2±0.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q13    1.02     17.1±0.24ms        ? ?/sec    1.00     16.8±0.15ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q14    1.00     15.9±0.11ms        ? ?/sec    1.00     15.8±0.12ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q19    1.01      3.0±0.03ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q20    1.03     72.2±0.64ms        ? ?/sec    1.00     70.0±0.28ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q21    1.03     80.8±0.54ms        ? ?/sec    1.00     78.5±0.24ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q22    1.04     99.1±0.77ms        ? ?/sec    1.00     95.4±0.26ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q23    1.00    213.3±0.80ms        ? ?/sec    1.12    238.7±1.23ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q24    1.01     19.4±0.14ms        ? ?/sec    1.00     19.2±0.09ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q27    1.03     57.2±0.63ms        ? ?/sec    1.00     55.4±0.27ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q28    1.03     56.9±0.45ms        ? ?/sec    1.00     55.5±0.24ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q30    1.00     18.3±0.08ms        ? ?/sec    1.00     18.3±0.06ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q36    1.01     14.5±0.23ms        ? ?/sec    1.00     14.4±0.20ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q37    1.00      5.3±0.03ms        ? ?/sec    1.00      5.4±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q38    1.01     12.8±0.20ms        ? ?/sec    1.00     12.6±0.20ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q39    1.02     23.3±0.28ms        ? ?/sec    1.00     22.7±0.19ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q40    1.01      5.5±0.04ms        ? ?/sec    1.00      5.4±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q41    1.01      4.8±0.02ms        ? ?/sec    1.00      4.8±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q42    1.01      3.5±0.02ms        ? ?/sec    1.00      3.4±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q1                   1.00    868.7±1.80µs        ? ?/sec    1.01    873.1±1.90µs        ? ?/sec
arrow_reader_clickbench/sync/Q10                  1.00      5.2±0.04ms        ? ?/sec    1.00      5.1±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q11                  1.00      6.1±0.04ms        ? ?/sec    1.00      6.1±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q12                  1.02     22.1±0.67ms        ? ?/sec    1.00     21.6±0.15ms        ? ?/sec
arrow_reader_clickbench/sync/Q13                  1.00     28.7±0.88ms        ? ?/sec    1.05     30.2±0.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q14                  1.00     23.1±0.12ms        ? ?/sec    1.19     27.4±0.48ms        ? ?/sec
arrow_reader_clickbench/sync/Q19                  1.04      2.8±0.03ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20                  1.03    125.7±0.35ms        ? ?/sec    1.00    122.0±0.34ms        ? ?/sec
arrow_reader_clickbench/sync/Q21                  1.03     99.3±0.19ms        ? ?/sec    1.00     96.4±0.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q22                  1.01    145.7±0.50ms        ? ?/sec    1.00    144.6±0.80ms        ? ?/sec
arrow_reader_clickbench/sync/Q23                  1.01   282.2±14.62ms        ? ?/sec    1.00   280.6±16.88ms        ? ?/sec
arrow_reader_clickbench/sync/Q24                  1.02     27.4±0.13ms        ? ?/sec    1.00     26.9±0.08ms        ? ?/sec
arrow_reader_clickbench/sync/Q27                  1.05    109.9±0.24ms        ? ?/sec    1.00    104.6±0.13ms        ? ?/sec
arrow_reader_clickbench/sync/Q28                  1.04    105.7±0.18ms        ? ?/sec    1.00    101.9±0.10ms        ? ?/sec
arrow_reader_clickbench/sync/Q30                  1.02     18.9±0.08ms        ? ?/sec    1.00     18.5±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q36                  1.02     22.3±0.13ms        ? ?/sec    1.00     21.9±0.11ms        ? ?/sec
arrow_reader_clickbench/sync/Q37                  1.00      6.9±0.01ms        ? ?/sec    1.00      6.9±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q38                  1.02     11.5±0.08ms        ? ?/sec    1.00     11.2±0.10ms        ? ?/sec
arrow_reader_clickbench/sync/Q39                  1.03     21.1±0.12ms        ? ?/sec    1.00     20.5±0.10ms        ? ?/sec
arrow_reader_clickbench/sync/Q40                  1.00      5.2±0.02ms        ? ?/sec    1.00      5.2±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q41                  1.00      5.6±0.02ms        ? ?/sec    1.00      5.7±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q42                  1.01      4.4±0.02ms        ? ?/sec    1.00      4.3±0.02ms        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 784.1s
Peak memory 3.1 GiB
Avg memory 2.9 GiB
CPU user 707.4s
CPU sys 76.4s
Disk read 0 B
Disk write 758.4 MiB

branch

Metric Value
Wall time 781.9s
Peak memory 3.2 GiB
Avg memory 3.1 GiB
CPU user 707.9s
CPU sys 74.1s
Disk read 0 B
Disk write 171.3 MiB

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very exciting @Dandandan

let n = self
.decoder
.decompress(input_buf, &mut spare_bytes[..len])
.map_err(|e| -> ParquetError { e.into() })?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this returns on error before setting len, will the buffer be left in an inconsistent state?

I think the use of the mut slice ensures that the call to decompress won't overwrite the newly allocated bytes.

However, this also basically passes in uninitialized bytes to decompress -- how do we know that the decompress doesn't read them? Maybe we should add a SAFETY warning to the decompress API that says it can't rely on initialized bytes 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively we rely on this:

https://docs.rs/snap/latest/snap/raw/struct.Decoder.html#errors

  • output has length less than decompress_len(input).

To not use unsafe we would need to have this feature:
BurntSushi/rust-snappy#62

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could improve the documentation around Decoder::decompress to mention it can receive non zero bytes and should not make any assumptions about their contents. I think that would be adequate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem to be an appropriate use of spare_capacity_mut/set_len.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some more time exploring this (and arguing with Codex about it)

The main issue is the Rust snappy implementation's contract takes an output buffer and doesn't say it can handle uninitialized bytes

That being said I can't think of how passing uninitialized bytes as an output location could cause problems (even if snappy changes how it internally works)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, technically I think because snappy function is not marked unsafe, it breaks the contract (i.e. it might read the buffer). In practice it doesn't need to read anything.

A MaybeUninit API would solve that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a PR now:
BurntSushi/rust-snappy#65

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 20, 2026

run benchmark arrow_reader

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Linux bench-c4098640020-479-gzgft 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing pr/snappy-zero-fill (eaa3ae4) to 88422cb (merge-base) diff
BENCH_NAME=arrow_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader
BENCH_FILTER=
Results will be posted here when complete

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 31, 2026

run benchmark arrow_reader

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4165328642-639-s2g89 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing pr/snappy-zero-fill (4149d2b) to 51bf8a4 (merge-base) diff
BENCH_NAME=arrow_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am torn on this one. Let's see if we can get a measurable perf win and then I can hem and haw about it more

let n = self
.decoder
.decompress(input_buf, &mut spare_bytes[..len])
.map_err(|e| -> ParquetError { e.into() })?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some more time exploring this (and arguing with Codex about it)

The main issue is the Rust snappy implementation's contract takes an output buffer and doesn't say it can handle uninitialized bytes

That being said I can't think of how passing uninitialized bytes as an output location could cause problems (even if snappy changes how it internally works)

@Dandandan
Copy link
Copy Markdown
Contributor Author

run benchmark arrow_reader

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4186819387-783-wl9px 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing pr/snappy-zero-fill (4149d2b) to 51bf8a4 (merge-base) diff
BENCH_NAME=arrow_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid unnecessary buffer zero-fill in Snappy decompression

4 participants