Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 18 additions & 5 deletions parquet/src/compression.rs
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ mod snappy_codec {
use snap::raw::{Decoder, Encoder, decompress_len, max_compress_len};

use crate::compression::Codec;
use crate::errors::Result;
use crate::errors::{ParquetError, Result};

/// Codec for Snappy compression format.
pub struct SnappyCodec {
Expand Down Expand Up @@ -231,10 +231,23 @@ mod snappy_codec {
None => decompress_len(input_buf)?,
};
let offset = output_buf.len();
output_buf.resize(offset + len, 0);
self.decoder
.decompress(input_buf, &mut output_buf[offset..])
.map_err(|e| e.into())
output_buf.reserve(len);
// SAFETY: we pass the spare capacity to snappy which will write exactly
// `len` bytes on success. The `set_len` below is only reached when
// decompression succeeds. `MaybeUninit<u8>` has the same layout as `u8`.
let spare = output_buf.spare_capacity_mut();
let spare_bytes = unsafe {
std::slice::from_raw_parts_mut(spare.as_mut_ptr().cast::<u8>(), spare.len())
};
let n = self
.decoder
.decompress(input_buf, &mut spare_bytes[..len])
.map_err(|e| -> ParquetError { e.into() })?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this returns on error before setting len, will the buffer be left in an inconsistent state?

I think the use of the mut slice ensures that the call to decompress won't overwrite the newly allocated bytes.

However, this also basically passes in uninitialized bytes to decompress -- how do we know that the decompress doesn't read them? Maybe we should add a SAFETY warning to the decompress API that says it can't rely on initialized bytes 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively we rely on this:

https://docs.rs/snap/latest/snap/raw/struct.Decoder.html#errors

  • output has length less than decompress_len(input).

To not use unsafe we would need to have this feature:
BurntSushi/rust-snappy#62

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could improve the documentation around Decoder::decompress to mention it can receive non zero bytes and should not make any assumptions about their contents. I think that would be adequate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem to be an appropriate use of spare_capacity_mut/set_len.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some more time exploring this (and arguing with Codex about it)

The main issue is the Rust snappy implementation's contract takes an output buffer and doesn't say it can handle uninitialized bytes

That being said I can't think of how passing uninitialized bytes as an output location could cause problems (even if snappy changes how it internally works)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, technically I think because snappy function is not marked unsafe, it breaks the contract (i.e. it might read the buffer). In practice it doesn't need to read anything.

A MaybeUninit API would solve that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a PR now:
BurntSushi/rust-snappy#65

// SAFETY: snappy wrote exactly `n` bytes into the spare capacity
unsafe {
output_buf.set_len(offset + n);
}
Ok(n)
}

fn compress(&mut self, input_buf: &[u8], output_buf: &mut Vec<u8>) -> Result<()> {
Expand Down
Loading