Skip to content

Add optional extraction of single ZIP input (slow $MFT parsing workaround)#21

Open
einarssonm wants to merge 1 commit intoopenrelik:mainfrom
einarssonm:fix-mft-parsing
Open

Add optional extraction of single ZIP input (slow $MFT parsing workaround)#21
einarssonm wants to merge 1 commit intoopenrelik:mainfrom
einarssonm:fix-mft-parsing

Conversation

@einarssonm
Copy link
Copy Markdown

Summary

The changes add a workaround for slow Plaso parsing of $MFT files which reside in a ZIP archive.

Technical details

  • Adds the p7zip-full package to the Dockerfile, to be able to use the 7z command for ZIP file extraction.
  • Adds an Extract single ZIP input configuration option to the log2timeline task.
  • Adds logic to the log2timeline function for extracting single ZIP input to a tempdir using the extract_archive function, and passing the tempdir to the log2timeline.py command.

Background

Plaso parsing of $MFT files is very slow when the $MFT resides in a ZIP archive. Parsing of an $MFT test file (87 MB) takes ~3 minutes, but parsing an $MFT.zip file which contains the same $MFT (nothing else) takes ~73 minutes.

This issue was initially seen when Plaso parsing of triage images created by a Velociraptor offline collector sometimes took up to 4-5 days(!). When the same triage image was extracted to a tempdir and the tempdir was passed to log2timeline.py, the parsing took ~3 hours. When examining the parsing process closely, it was clear that the long tail was always caused by the $MFT file(s) while the other artifacts were parsed reasonably fast.

See also log2timeline/plaso#4999 for more details.

@google-cla
Copy link
Copy Markdown

google-cla bot commented Oct 10, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@hacktobeer
Copy link
Copy Markdown
Contributor

Hi Markus, thanks for the contribution.

I was wondering if you have tried to use the extraction worker to extract eg the MFT from the (7)zipfile and run the log2timeline task against that?

See screenshot.

Screenshot 2025-10-11 at 08 24 23

@einarssonm
Copy link
Copy Markdown
Author

Valid point! The extraction worker is a good option when you want to extract a subset of the input data and/or keep the extracted artifacts. I use that in the following workflow, where .evtx files are extracted and sent to Hayabusa.

image

A common use case for me (and others) is to process ZIP compressed triage images, created with a Velociraptor offline collector. Such triage images often contain 3.000-8.000 files and 1.000-2.000 directories, and there is rarely a need to add all those files to the OpenRelik repository.

Here is a similar workflow where the extraction worker has been added before the log2timeline task. This results in 3.700 files being extracted and added to the OpenRelik repository.

image

Here are a few issues with the extraction worker + log2timeline approach:

  • increased disk usage when all triage artifacts are extracted and kept when you only want the log2timeline output
  • OpenRelik processing overhead when adding metadata for 3.000-8.000 files per triage image
  • lost filename and file hierarchy context (see below)

Timesketch output for triage image processed with the extraction worker + log2timeline approach:
image

Timesketch output for triage image processed with the modified log2timeline worker approach (this PR):
image

@einarssonm
Copy link
Copy Markdown
Author

einarssonm commented Oct 11, 2025

A recent update to my log2timeline/plaso issue shows that there are known performance issues when processing large archives with Plaso. There are open GitHub issues related to this since 2016/2021.

If we would proceed with this PR, we could consider a more general approach by optionally extracting ZIP input even when multiple ZIP input files are received by the log2timeline task, not just for single ZIP input.

Your call. 😉

@hacktobeer
Copy link
Copy Markdown
Contributor

Solid arguments, thank you. Let me discuss this next week with my team and get back to you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants