Add optional extraction of single ZIP input (slow $MFT parsing workaround)#21
Add optional extraction of single ZIP input (slow $MFT parsing workaround)#21einarssonm wants to merge 1 commit intoopenrelik:mainfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
c062317 to
08e9722
Compare
|
Hi Markus, thanks for the contribution. I was wondering if you have tried to use the extraction worker to extract eg the MFT from the (7)zipfile and run the log2timeline task against that? See screenshot.
|
|
Valid point! The extraction worker is a good option when you want to extract a subset of the input data and/or keep the extracted artifacts. I use that in the following workflow, where .evtx files are extracted and sent to Hayabusa.
A common use case for me (and others) is to process ZIP compressed triage images, created with a Velociraptor offline collector. Such triage images often contain 3.000-8.000 files and 1.000-2.000 directories, and there is rarely a need to add all those files to the OpenRelik repository. Here is a similar workflow where the extraction worker has been added before the log2timeline task. This results in 3.700 files being extracted and added to the OpenRelik repository.
Here are a few issues with the extraction worker + log2timeline approach:
Timesketch output for triage image processed with the extraction worker + log2timeline approach: Timesketch output for triage image processed with the modified log2timeline worker approach (this PR): |
|
A recent update to my log2timeline/plaso issue shows that there are known performance issues when processing large archives with Plaso. There are open GitHub issues related to this since 2016/2021. If we would proceed with this PR, we could consider a more general approach by optionally extracting ZIP input even when multiple ZIP input files are received by the log2timeline task, not just for single ZIP input. Your call. 😉 |
|
Solid arguments, thank you. Let me discuss this next week with my team and get back to you! |





Summary
The changes add a workaround for slow Plaso parsing of $MFT files which reside in a ZIP archive.
Technical details
p7zip-fullpackage to theDockerfile, to be able to use the7zcommand for ZIP file extraction.Extract single ZIP inputconfiguration option to thelog2timelinetask.log2timelinefunction for extracting single ZIP input to a tempdir using theextract_archivefunction, and passing the tempdir to thelog2timeline.pycommand.Background
Plaso parsing of $MFT files is very slow when the $MFT resides in a ZIP archive. Parsing of an $MFT test file (87 MB) takes ~3 minutes, but parsing an $MFT.zip file which contains the same $MFT (nothing else) takes ~73 minutes.
This issue was initially seen when Plaso parsing of triage images created by a Velociraptor offline collector sometimes took up to 4-5 days(!). When the same triage image was extracted to a tempdir and the tempdir was passed to log2timeline.py, the parsing took ~3 hours. When examining the parsing process closely, it was clear that the long tail was always caused by the $MFT file(s) while the other artifacts were parsed reasonably fast.
See also log2timeline/plaso#4999 for more details.