Conversation
There was a problem hiding this comment.
Could you take some cues from the tar loader on dealing with the compression (and magic), as well as dealing with a few more compression formats besides gzip? Possibly look into fsutil.open_decompress too.
If we want to incorporate this nicely, it would be nice to remove any dealing with compression that other loaders currently do themselves, as this loader would be dealing with that.
|
Could a very dire warning be added to this loader that most gzip'ed files cannot be random-read from, which is why this will generally be very slow on large archives? |
Similar warnings already exist in all the other loaders that deal with compression, so it makes sense to centralize all that logic into one super big warning here. |
|
I just created #1589 as an idea for facilitating this kind of loader mechanism a bit more cleanly. @Matthijsy if you want, you can try to base this PR on top of that one. |
5e8eccc to
c479dc3
Compare
|
Thank you @Schamper! I reworked the PR to use the MiddelwareLoader. It now supports gz, lzma, bz2 and zst. These are all formats currently accepted by |
Schamper
left a comment
There was a problem hiding this comment.
Can you replace all existing compression support from other loaders, to see if it could be handled with this?
|
I have removed the handling of compression from the TarLoader, and added a benchmark. However, that shows an issue with performance. Before: After: This shows that the median goes from 233 to 476. I guess this is due to the fact that we need to map the file within a VFS which is slower than using the native tar compression handling. Anyone an idea how we still can do this in a generic way (eg for all kinds of compressed files, so not only tar), while not loosing this performance? |
This PR adds a loader that allows reading gzipped files. It will pack the gzip file into a
VirtualFileSystem, this way we can access the content without the need to decompress. Than we can use this to pass the new Path (which now no longer has the gzip) to all loaders again. This way you can compress any kind of files currently support by dissect without the need to adapt all loaders for it.From a performance perspective this is not the best, it is considerbly slower than processing a de-compressed version. If someone has a better idea than using the VirtualFileSystem which might perform better let me know!
closes #1455