Add optional extraction of single ZIP input (slow $MFT parsing workaround) by einarssonm · Pull Request #21 · openrelik/openrelik-worker-plaso

einarssonm · 2025-10-10T22:19:30Z

Summary

The changes add a workaround for slow Plaso parsing of $MFT files which reside in a ZIP archive.

Technical details

Adds the p7zip-full package to the Dockerfile, to be able to use the 7z command for ZIP file extraction.
Adds an Extract single ZIP input configuration option to the log2timeline task.
Adds logic to the log2timeline function for extracting single ZIP input to a tempdir using the extract_archive function, and passing the tempdir to the log2timeline.py command.

Background

Plaso parsing of $MFT files is very slow when the $MFT resides in a ZIP archive. Parsing of an $MFT test file (87 MB) takes ~3 minutes, but parsing an $MFT.zip file which contains the same $MFT (nothing else) takes ~73 minutes.

This issue was initially seen when Plaso parsing of triage images created by a Velociraptor offline collector sometimes took up to 4-5 days(!). When the same triage image was extracted to a tempdir and the tempdir was passed to log2timeline.py, the parsing took ~3 hours. When examining the parsing process closely, it was clear that the long tail was always caused by the $MFT file(s) while the other artifacts were parsed reasonably fast.

See also log2timeline/plaso#4999 for more details.

google-cla · 2025-10-10T22:19:34Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

hacktobeer · 2025-10-11T06:27:29Z

Hi Markus, thanks for the contribution.

I was wondering if you have tried to use the extraction worker to extract eg the MFT from the (7)zipfile and run the log2timeline task against that?

See screenshot.

einarssonm · 2025-10-11T12:43:19Z

Valid point! The extraction worker is a good option when you want to extract a subset of the input data and/or keep the extracted artifacts. I use that in the following workflow, where .evtx files are extracted and sent to Hayabusa.

A common use case for me (and others) is to process ZIP compressed triage images, created with a Velociraptor offline collector. Such triage images often contain 3.000-8.000 files and 1.000-2.000 directories, and there is rarely a need to add all those files to the OpenRelik repository.

Here is a similar workflow where the extraction worker has been added before the log2timeline task. This results in 3.700 files being extracted and added to the OpenRelik repository.

Here are a few issues with the extraction worker + log2timeline approach:

increased disk usage when all triage artifacts are extracted and kept when you only want the log2timeline output
OpenRelik processing overhead when adding metadata for 3.000-8.000 files per triage image
lost filename and file hierarchy context (see below)

Timesketch output for triage image processed with the extraction worker + log2timeline approach:

Timesketch output for triage image processed with the modified log2timeline worker approach (this PR):

einarssonm · 2025-10-11T12:45:48Z

A recent update to my log2timeline/plaso issue shows that there are known performance issues when processing large archives with Plaso. There are open GitHub issues related to this since 2016/2021.

If we would proceed with this PR, we could consider a more general approach by optionally extracting ZIP input even when multiple ZIP input files are received by the log2timeline task, not just for single ZIP input.

Your call. 😉

hacktobeer · 2025-10-11T12:56:05Z

Solid arguments, thank you. Let me discuss this next week with my team and get back to you!

Add extraction of single ZIP input (slow parsing workaround)

08e9722

einarssonm force-pushed the fix-mft-parsing branch from c062317 to 08e9722 Compare October 10, 2025 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional extraction of single ZIP input (slow $MFT parsing workaround)#21

Add optional extraction of single ZIP input (slow $MFT parsing workaround)#21
einarssonm wants to merge 1 commit intoopenrelik:mainfrom
einarssonm:fix-mft-parsing

einarssonm commented Oct 10, 2025

Uh oh!

google-cla bot commented Oct 10, 2025

Uh oh!

hacktobeer commented Oct 11, 2025

Uh oh!

einarssonm commented Oct 11, 2025

Uh oh!

einarssonm commented Oct 11, 2025 •

edited

Loading

Uh oh!

hacktobeer commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

einarssonm commented Oct 10, 2025

Summary

Technical details

Background

Uh oh!

google-cla bot commented Oct 10, 2025

Uh oh!

hacktobeer commented Oct 11, 2025

Uh oh!

einarssonm commented Oct 11, 2025

Uh oh!

einarssonm commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hacktobeer commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

einarssonm commented Oct 11, 2025 •

edited

Loading