vmray: support parsing flog.txt (Download Function Log)#2878
vmray: support parsing flog.txt (Download Function Log)#2878devs6186 wants to merge 3 commits intomandiant:masterfrom
Conversation
Summary of ChangesHello @devs6186, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces support for parsing VMRay's flog.txt format, which is a great addition for users who don't have access to the full analysis ZIP. The implementation is generally sound and integrates well with the existing VMRay extractor. I have identified a few areas for improvement regarding robustness against malformed input and the completeness of the extracted features (specifically API arguments).
| def _parse_hex_or_decimal(s: str) -> int: | ||
| s = s.strip().strip('"') | ||
| if s.startswith("0x") or s.startswith("0X"): | ||
| return int(s, 16) | ||
| return int(s, 10) |
There was a problem hiding this comment.
The _parse_hex_or_decimal function is not robust against empty strings. If a property in the flog.txt file is present but has no value (e.g., os_pid = ), int(s, 10) will raise a ValueError, causing the parser to crash. It would be safer to handle empty strings by returning a default value (like 0) or skipping the property.
| def _parse_hex_or_decimal(s: str) -> int: | |
| s = s.strip().strip('"') | |
| if s.startswith("0x") or s.startswith("0X"): | |
| return int(s, 16) | |
| return int(s, 10) | |
| def _parse_hex_or_decimal(s: str) -> int: | |
| s = s.strip().strip('"') | |
| if not s: | |
| return 0 | |
| if s.lower().startswith("0x"): | |
| return int(s, 16) | |
| return int(s, 10) |
| thread_blocks = [p.strip() for p in parts[1:] if p.strip()] | ||
|
|
||
| # First part: Process properties then Region: blocks | ||
| process_props = _parse_properties(header_and_regions.split("\nRegion:\n")[0]) |
There was a problem hiding this comment.
The split by \nRegion:\n is less flexible than the regex-based splits used for Process: and Thread: blocks. If the log file contains trailing spaces after Region:, the split will fail to isolate the process properties. For consistency and robustness, consider using a regex similar to the ones used in lines 122 and 200.
| process_props = _parse_properties(header_and_regions.split("\nRegion:\n")[0]) | |
| process_props = _parse_properties(re.split(r"\n\s*Region:\s*\n", header_and_regions)[0]) |
| params_in=None, # flog.txt args could be parsed later into Param list | ||
| params_out=None, |
There was a problem hiding this comment.
Currently, API call arguments are not being parsed and are set to None. Since many capa rules rely on specific argument values (e.g., registry keys, file paths, or flags), this significantly limits the effectiveness of the extractor when using flog.txt. While the comment acknowledges this as a future improvement, implementing even a basic parser for the args_str extracted in _parse_event would greatly enhance the utility of this new feature.
|
Thanks for the review! I've addressed all three suggestions in eca9286:
Please re-review |
williballenthin
left a comment
There was a problem hiding this comment.
- what are the pros/cons of using flog versus the full archive? we should document that clearly somewhere
- aside: could we provide a helper script that, given a sample hash, automatically retrieves the flog from vmray and shows the results?
- we need a reasonable collection of flog files committed to testfiles and run during CI. especially with the string parsing, which tends to be brittle, we need infrastructure in place to find regressions and bugs.
|
thanks @devs6186 |
… flog.txt Addresses reviewer feedback on mandiant#2878: 1. Document flog.txt vs full archive trade-offs in doc/usage.md with a comparison table (available features, how to obtain, file size). 2. Add scripts/fetch-vmray-flog.py — given a VMRay instance URL, API key, and sample SHA-256, downloads flog.txt via the REST API and optionally runs capa against it. 3. Add fixture-based regression tests (tests/fixtures/vmray/flog_txt/) with three representative flog.txt files: - windows_apis.flog.txt: Win32 APIs, string args with backslash paths, numeric args, multi-process - linux_syscalls.flog.txt: Linux sys_-prefixed calls (all stripped) - string_edge_cases.flog.txt: paths with spaces, UNC paths, URLs, empty tests/test_vmray_flog_txt.py gains 14 new feature-presence tests covering API, String, and Number extraction at the call scope, plus negative checks (double-backslash must not appear; sys_ prefix must not appear). Fixes mandiant#2878
hey @williballenthin , I have addressed everything in the latest commit — for the docs i added a comparison section in usage.md with a table laying out exactly what you get from for the fetch script - for the fixtures I added three flog.txt files under on adding real samples to testfiles , I am totally happy to do that as a follow-up, just wanted to point out that it needs a separate PR to the submodule. if you have particular samples in mind let me know and i'll set it up. |
Adds a parser for the VMRay flog.txt format (the free "Download Function Log" available from VMRay Threat Feed -> Full Report). Users no longer need the full ZIP archive to run capa against VMRay output. - capa/features/extractors/vmray/flog_txt.py: new parser for flog.txt header validation, Process/Thread/Region block splitting, API trace line parsing, sys_ prefix stripping - VMRayAnalysis.from_flog_txt() and VMRayExtractor.from_flog_txt() for constructing the extractor from a standalone flog.txt - helpers.py: detect flog.txt by filename + header magic; update unsupported-format error message to mention flog.txt - loader.py: route flog.txt inputs through VMRayExtractor.from_flog_txt - tests/test_vmray_flog_txt.py: 5 unit tests covering parse, header rejection, sys_ stripping, analysis and extractor construction Fixes mandiant#2452
- Handle empty strings in _parse_hex_or_decimal (return 0 instead of crash) - Use regex for Region: block splitting (consistent with Process:/Thread:) - Parse API call arguments into Param objects so String/Number features are extracted (string args use void_ptr+str deref to match XML convention) - Use FunctionCall.model_validate instead of __init__ to work around Pydantic alias "in" clashing with Python keyword - Add test_parse_flog_txt_args_parsed covering string, numeric, and no-arg API calls
… flog.txt Addresses reviewer feedback on mandiant#2878: 1. Document flog.txt vs full archive trade-offs in doc/usage.md with a comparison table (available features, how to obtain, file size). 2. Add scripts/fetch-vmray-flog.py — given a VMRay instance URL, API key, and sample SHA-256, downloads flog.txt via the REST API and optionally runs capa against it. 3. Add fixture-based regression tests (tests/fixtures/vmray/flog_txt/) with three representative flog.txt files: - windows_apis.flog.txt: Win32 APIs, string args with backslash paths, numeric args, multi-process - linux_syscalls.flog.txt: Linux sys_-prefixed calls (all stripped) - string_edge_cases.flog.txt: paths with spaces, UNC paths, URLs, empty tests/test_vmray_flog_txt.py gains 14 new feature-presence tests covering API, String, and Number extraction at the call scope, plus negative checks (double-backslash must not appear; sys_ prefix must not appear). Fixes mandiant#2878
b58fbeb to
548d814
Compare
closes #2452
Adds support for parsing VMRay's flog.txt format — the free "Download Function Log" available from VMRay Threat Feed → Full Report → Download Function Log. Users no longer need the full analysis ZIP archive to run capa against VMRay output.
What changed
capa/features/extractors/vmray/flog_txt.pysys_prefix strippingcapa/features/extractors/vmray/__init__.pyVMRayAnalysis.from_flog_txt()— builds analysis object from standalone flog.txt (no ZIP)capa/features/extractors/vmray/extractor.pyVMRayExtractor.from_flog_txt()— convenience classmethodcapa/helpers.pyget_format_from_extension; updated unsupported-format error message to mention flog.txtcapa/loader.pyflog.txtinputs throughVMRayExtractor.from_flog_txtin bothget_extractorandget_file_extractorstests/test_vmray_flog_txt.pysys_stripping,VMRayAnalysisconstruction,VMRayExtractorconstructiondoc/usage.mdUsage
Notes
tests/test_vmray_features.pyare pre-existing and unrelated: they require the large ZIP test fixture (tests/data/dynamic/vmray/...) which is not part of this repoChecklist