Skip to content

Conversation

@williamhbaker
Copy link
Contributor

@williamhbaker williamhbaker commented Jan 2, 2026

Each active gzip writer introduces a small but significant amount of memory overhead, on the order of several hundred KB. When there are many active journals being written, this can add up to a large amount of memory usage.

This adds a threshold where incremental compression will occur only if there is at least 1 MB of data to compress, and creates a new gzip writing mechanism that allows closing & creating new gzip members, concatenated into the same output file. The spool logic uses this mechanism to create a new gzip member for every batch of incremental compression, eliminating the need to hold a gzip writer in memory for the entire lifetime of the fragment file.

This change only applies to standard gzip compression with client-side decompression. If decompression offloading is used, gzip files will continue to be written in a single stream, as there are issues with some object stores truncating multi-member gzip content after the first member.

Manual Testing:

  • For 1000 journals actively being written observed a ~1 GiB drop in RSS memory usage post-change, confirmed by profiling to be from a reduction in flate memory overhead.
  • Basic writes and reads from journals covering no compression, gzip, and snappy from file / S3 / GCS / Azure stores
  • E2E testing with a Flow local stack using AWS, GCS, and Azure storage mappings
  • Quick throughput testing: There wasn't any difference in attainable throughput when writing to a modest number of journals as fast as possible running things on my laptop, and CPU usage looked about the same. Theoretically I'd expect some amount of increased CPU for re-initializing new GZIP writers, but at least from this crude test I didn't see anything major.

@williamhbaker williamhbaker marked this pull request as draft January 2, 2026 22:26
@williamhbaker williamhbaker force-pushed the wb/compression branch 6 times, most recently from 6754c4f to 019b816 Compare January 7, 2026 17:14
Each active gzip writer introduces a small but significant amount of memory
overhead, on the order of several hundred KB. When there are many active
journals being written, this can add up to a large amount of memory usage.

This adds a threshold where incremental compression will occur only if there is
at least 1 MB of data to compress, and creates a new gzip writing mechanism that
allows closing & creating new gzip members, concatenated into the same output
file. The spool logic uses this mechanism to create a new gzip member for every
batch of incremental compression, eliminating the need to hold a gzip writer in
memory for the entire lifetime of the fragment file.

This change only applies to standard gzip compression with client-side
decompression. If decompression offloading is used, gzip files will continue to
be written in a single stream, as there are issues with some object stores
truncating multi-member gzip content after the first member.
@williamhbaker williamhbaker marked this pull request as ready for review January 7, 2026 18:10
Copy link
Contributor

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants