LibZip and Cloning (APFS and Btrfs) #519
-
|
Hello, Could someone kindly explain how LibZip takes advantage of cloning on Apple APFS and Linux Btrfs? I know that on systems that support file cloning, when using zip_close(), LibZip doesn’t have to rewrite the whole file, and so is much faster. But the only explicit information about this that I can find is the following:
However, in my tests, LibZip seems to be even faster than I expect on APFS given the above:
This is where I get confused. I would expect this to take around 15 seconds to save again. The updated text file appears above the 800MB movie file in the zip’s entries, and given that the zip is supposed to be “rewritten starting with the first changed entry”, I expect this text file, the movie file, and the text file after it all to be rewritten, thus making the save just as slow as adding the 800MB movie file in the first place. However, this is not the case. In my tests, when I update the text file above the movie file, the save only takes around 1-2 seconds: slower than updating the entry after the large movie file, but much faster than rewriting the entire movie file again. This is great of course, but I’d love to understand how this is. Presumably cloning means that LibZip doesn’t necessarily have to rewrite everything below an updated entry after all? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
|
Cloning does work the way I described. So if you change the entry before the video, the video is rewritten. Your timing measurements are probably thrown off by the file system cache: The first time you add the video file, it has to be read in from disk. When you update the zip archive, it is already in RAM, so writing it back out is much faster. Getting repeatable measurements of file access is hard. Making sure everything is in the cache first helps. You can usually get pretty close by repeating it multiple times and discarding the first measurements until it settles into similar run times. |
Beta Was this translation helpful? Give feedback.
-
|
Great, thank you very much for the reply and for the information. That all makes sense. I'm using LibZip to read and write the zip-based file format of my app, so there's always a copy of the archive around in memory for reading, which must then be why I always see the faster times after the movie file is initially added. The reason I wanted to clarify this by the way is that understanding this is really helpful for optimising save times in my app. It's a project-based app where projects are by default stored as zip files and users can import all sorts of files - PDF files, movie files and so on. They can also create text files which are editable (whereas research files such as PFDs and movie files cannot be edited). Given how LibZip works on systems that support cloning, then, we can achieve much faster saves if we ensure text files are lower down in the zip's entries than research files. When a user makes changes to a text file and I update its entry in the zip file, I was previously using zip_file_add with the "overwrite" flag, but this can make saves slower if the text file is above some big research files in the zip. On APFS it's therefore faster to use zip_delete and then zip_file_add to recreate the text file as the last entry in the zip file. (That won't speed up the save after the first edit, but as the user continues to make changes to the text file, subsequent saves will be blazing fast because it's now the final entry in the zip.) In fact I'm now thinking that whenever a user imports a research file such as a movie, I should probably move all text files below it in the zip at the same time. Given that there are no move or insert functions, I guess I will need to delete all of the text files and then re-add them below the imported research file. Anyway, thanks again, this was really helpful! |
Beta Was this translation helpful? Give feedback.
-
|
Keeping all the small, changeable file at the end is probably best, yes. If you just want to move the files to the end, you can create a source with However, I' not sure using a zip archive to store your documents is the best option. libzip does not support cloning on Windows, which would make saving there inefficient. I would probably keep the files separate and use a directory to store the document. macOS has support for bundles, which basically treat directories as files in the UI. |
Beta Was this translation helpful? Give feedback.
-
This is what I'm going to do, thanks.
Unfortunately bundles bring their own problems. On Windows they appear as regular folders, tempting users to move things out of them (we have an existing cross-platform app that uses a bundle file format where this is occasionally a problem). The bigger problem though is iOS. On iOS, Files app and Our solution is therefore to use a zip-based file format by default, since that causes less friction for users, and most projects are going to be quite small (importing an 800MB video file would be an unusual but not impossible use case). For this we use LibZip on macOS and iOS; on Windows we have to unzip into a temp folder which is clunkier and slower. But (similar to Pages app) we allow users to switch to a package-based format if they want, and recommend doing so if saving gets too slow or they need to work with projects of hundreds of megabytes, especially on Windows. And LibZip is working brilliantly for our macOS and iOS versions. Anyway, thanks again, knowing the details is really helpful. |
Beta Was this translation helpful? Give feedback.
Cloning does work the way I described. So if you change the entry before the video, the video is rewritten.
Your timing measurements are probably thrown off by the file system cache: The first time you add the video file, it has to be read in from disk. When you update the zip archive, it is already in RAM, so writing it back out is much faster.
Getting repeatable measurements of file access is hard. Making sure everything is in the cache first helps. You can usually get pretty close by repeating it multiple times and discarding the first measurements until it settles into similar run times.