-
Notifications
You must be signed in to change notification settings - Fork 666
[zipsync] Add new tool to efficiently pack and unpack cache entries #5361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
620afe6 to
3e3a75d
Compare
2636c4a to
de527a2
Compare
6d1bc86 to
744277a
Compare
|
I've removed the integration into the build cache. We would need to re-design some things to use a worker pool. Using zipsync without a worker pool will end up being slower than tar+gzip. This is because of the overhead of booting up a worker and the node require calls. |
744277a to
e169cba
Compare
dmichon-msft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to see more JSDoc and code comments around the binary-heavy parts, especially.
dmichon-msft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple minor notes left.
d266ff0 to
f7164c6
Compare
Summary
zipsync is a tool to pack and unpack zip archives. It is designed as a single-purpose tool to pack and unpack build cache entries.
Details
Unpack
Pack
Supported compression types are store (no compression), deflate (level 9), auto (switches between store/deflate based on file extension).
Constraints
Though archives created by zipsync can be used by other zip compatible programs, the opposite is not the case. zipsync only implements a subset of zip features to achieve greater performance.
What's wrong with the current setup?
The current setup cleans target directories when unpacking; then the build cache entry is unpacked. This setup ends up deleting and rewriting a lot of the same files.
Pros
With tar + gzip files are archived first and compressed second. This allows the compression to work across file boundaries. Duplicate content across files can be efficiently compressed.
Cons
Since compression is the last step, uncompressing the archive is required to inspect it. To enumerate the archive, it must be uncompressed first.
It does not clean the target directory so a
rm -rfstep is required.Requirements
zipsync was created with the following constraints in mind
Optimize for partial unpack scenario
Optimize for unpack performance. Most of the build cached files already exist on disk and there is a good chance for them to be already in the expected state.
Only write files when needed
This will minimize the number of write syscalls. Also, if the kernel has already cached the file from a recent read, the cache remains intact if we don't needlessly delete and rewrite the file.
Clean extra files and directories
This will remove the need to run
rm -rfon the target directories. More time savedDisallow symlinks
Symlinks in build cache entries are not supported. This will remove the need to scan the target directories for symlinks before running tar.
Why zip
zip was picked because:
How it was tested
node apps/rush/lib/start-dev.js --debug build --verbose -t module-minifierBenchmark Results
This document contains performance measurements for packing and unpacking a synthetic dataset using tar, zip, and zipsync.
The dataset consists of two directory trees (subdir1, subdir2) populated with 1000 text files each.
zipsync scenarios
zip and tar scenarios clean the unpack directory before unpacking. This time is included in the measurements because
zipsync internally handles cleaning as part of its operation.
System
Iterations: 100
Compressed (baseline: tar-gz)
Unpack Phase
Pack Phase
Uncompressed (baseline: tar)
Unpack Phase
Pack Phase