Skip to content

Add generic Compression Loader#1456

Open
Matthijsy wants to merge 8 commits intofox-it:mainfrom
Matthijsy:feature/gzip-files
Open

Add generic Compression Loader#1456
Matthijsy wants to merge 8 commits intofox-it:mainfrom
Matthijsy:feature/gzip-files

Conversation

@Matthijsy
Copy link
Contributor

This PR adds a loader that allows reading gzipped files. It will pack the gzip file into a VirtualFileSystem, this way we can access the content without the need to decompress. Than we can use this to pass the new Path (which now no longer has the gzip) to all loaders again. This way you can compress any kind of files currently support by dissect without the need to adapt all loaders for it.

From a performance perspective this is not the best, it is considerbly slower than processing a de-compressed version. If someone has a better idea than using the VirtualFileSystem which might perform better let me know!

closes #1455

Copy link
Member

@Schamper Schamper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you take some cues from the tar loader on dealing with the compression (and magic), as well as dealing with a few more compression formats besides gzip? Possibly look into fsutil.open_decompress too.

If we want to incorporate this nicely, it would be nice to remove any dealing with compression that other loaders currently do themselves, as this loader would be dealing with that.

@JSCU-CNI
Copy link
Contributor

Could a very dire warning be added to this loader that most gzip'ed files cannot be random-read from, which is why this will generally be very slow on large archives?

@Schamper
Copy link
Member

Could a very dire warning be added to this loader that most gzip'ed files cannot be random-read from, which is why this will generally be very slow on large archives?

Similar warnings already exist in all the other loaders that deal with compression, so it makes sense to centralize all that logic into one super big warning here.

@Schamper
Copy link
Member

I just created #1589 as an idea for facilitating this kind of loader mechanism a bit more cleanly. @Matthijsy if you want, you can try to base this PR on top of that one.

@Matthijsy Matthijsy force-pushed the feature/gzip-files branch from 5e8eccc to c479dc3 Compare March 2, 2026 19:51
@Matthijsy
Copy link
Contributor Author

Thank you @Schamper! I reworked the PR to use the MiddelwareLoader. It now supports gz, lzma, bz2 and zst. These are all formats currently accepted by fsutil.open_decompress, which we now also use to open the file itself.

@Matthijsy Matthijsy changed the title Add generic GZIP Loader Add generic Compression Loader Mar 3, 2026
Copy link
Member

@Schamper Schamper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you replace all existing compression support from other loaders, to see if it could be handled with this?

@codspeed-hq
Copy link

codspeed-hq bot commented Mar 16, 2026

Merging this PR will not alter performance

✅ 12 untouched benchmarks


Comparing Matthijsy:feature/gzip-files (7f2b247) with main (c9b1e1d)

Open in CodSpeed

@Matthijsy
Copy link
Contributor Author

I have removed the handling of compression from the TarLoader, and added a benchmark. However, that shows an issue with performance.

Before:

---------------------------------------------------------------------------------------------------------- benchmark: 2 tests ----------------------------------------------------------------------------------------------------------
Name (time in us)                                              Min                    Max                Mean              StdDev              Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark[_data/loaders/tar/test-archive.tar.gz]     223.7920 (1.0)       2,060.1670 (1.0)      257.6979 (1.0)      164.0039 (1.0)      233.4580 (1.0)      12.8954 (2.06)         4;10        3.8805 (1.0)         136           1
test_benchmark[_data/loaders/tar/test-archive.tar]        238.4580 (1.07)     21,046.2080 (10.22)    283.0146 (1.10)     739.2065 (4.51)     246.6665 (1.06)      6.2504 (1.0)          2;61        3.5334 (0.91)        796           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After:

--------------------------------------------------------------------- benchmark: 1 tests ---------------------------------------------------------------------
Name (time in us)                                              Min       Max      Mean   StdDev    Median      IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark[_data/loaders/tar/test-archive.tar.gz]     460.5000  716.9580  509.5390  82.8613  476.7920  36.6880       2;2        1.9626      15           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------

This shows that the median goes from 233 to 476. I guess this is due to the fact that we need to map the file within a VFS which is slower than using the native tar compression handling. Anyone an idea how we still can do this in a generic way (eg for all kinds of compressed files, so not only tar), while not loosing this performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add generic support for compression format in loaders

3 participants