fixtures: add dynamic sample discovery and reverse MD5 lookup#2831
fixtures: add dynamic sample discovery and reverse MD5 lookup#2831kami922 wants to merge 5 commits intomandiant:masterfrom
Conversation
Summary of ChangesHello @kami922, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant improvement to how test fixtures are managed by transitioning from a brittle, manually updated system to a dynamic, automated discovery and indexing mechanism. By building a comprehensive cache of test samples at module import, it enables efficient lookups by various identifiers, including a new reverse MD5-to-name lookup. This enhancement not only streamlines the process of adding new test data but also boosts the reliability and scalability of the testing infrastructure. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed
There was a problem hiding this comment.
Code Review
This pull request introduces dynamic sample discovery and a reverse MD5 lookup function, significantly improving the maintainability and scalability of test fixtures. The new _build_sample_cache function efficiently indexes samples by various attributes, and get_sample_short_name_by_md5 provides a convenient way to retrieve sample names from their MD5 hashes. Comprehensive tests have been added to validate the new functionality, including forward, reverse, and round-trip lookups. Overall, this is a well-implemented solution that addresses the maintenance burden of hardcoded lists.
CHANGELOG updated or no update needed, thanks! 😄
|
@mike-hunhoff @mr-tz Hello I am working on CI failed tests meanwhile can you please tell me is running CI tests locally Via "act" a good approach? so I can see failed tests and fix them before committing and pushing. currently I am not following this approach if yes could you also tell me few guidelines and best practices related to this project. |
|
using "act" is fine. sometimes i do this to avoid waiting for the round trip of committing, pushing, and waiting for GH Actions to spawn and complete. i'm not quite sure what hints would help. i'd suggest running the linters, formatters, and tests before submitting a PR. what else? |
ok alot of tests are failing currently I will fix them. but one confusion I am having currently regarding running act locally is that I think act might take a lot of disk space of my laptop even tho docker image is pulled and current failing tests might according to my comprehension are related to OS. anyways I will experiment with it then fix it and commit again |
d9b662c to
98227bf
Compare
833644f to
d6481ef
Compare
|
Hello @mike-hunhoff @williballenthin The CI is showing one failing test (test_fix262 in tests/test_main.py), but this failure is unrelated to the fixture discovery changes in this PR. Here's what's happening: The failure is caused by upstream rules changes (capa-rules#1099) that were merged after I created this branch. The "send HTTP request" rule was made more strict to fix false positives, but this broke the test expectations.
I've verified this locally - the test fails with the same error after rebasing onto latest master with updated submodules. |
|
@mike-hunhoff @mr-tz Hello can I Please get an update on this one. Also Please tell me what can I do about the failing tests. |
401f7ad to
d37a178
Compare
|
@kami922 the tests are passing now. Please address any open feedback and we'll give this another review. |
8e1856b to
f3869c8
Compare
hello I believe everything is done from my side, awaiting for your review |
|
@kami922 it looks like there are a couple of feedback comments listed above that need to be addressed. Please explicitly respond/resolve these comments if/when they are completed. Generally, this is a good habit for you to build for future code reviews. It makes the process smoother for you and the reviewers. |
87cbe02 to
8e5e7e3
Compare
|
@mike-hunhoff Hello 1-2 changes were skipped and now done also I have responded to comments with appropriate message. |
|
@mike-hunhoff Hello any update on this one? I had update this branch to merge with main one. can you please run the tests. |
0bf16dd to
59083ca
Compare
|
@kami922 CI is failing: Please ensure all tests are passing locally before you request another review, thank you! |
Walk the tests/data directory at module import time to build a cache indexed by MD5, SHA256, filename, and stem. Add get_sample_short_name_by_md5() for reverse lookups, eliminating the need for hardcoded hash mappings. Closes mandiant#1743
Address maintainer feedback by adding return type annotation.
- Fix expected values in test_reverse_lookup: al-khaser files use underscores on disk (al-khaser_x86, al-khaser_x64), not spaces - Replace broad isinstance/len assertions with exact equality checks - Restrict roundtrip test to samples that genuinely round-trip: mimikatz, kernel32, kernel32-64 (pma* and al-khaser names differ from their on-disk stems so they cannot round-trip exactly)
…cache When two files share the same content (e.g. mimikatz.exe_ and its hash-named copy 5f66b82558ca92e54e77f216ef4c066c.exe_), Linux ext4 rglob order is hash-based rather than alphabetical, so the hash-named entry was overwriting the friendly-named entry in the cache — causing test_reverse_lookup to see '5f66b825...' instead of 'mimikatz'. Fix: only overwrite an existing cache entry when the current entry's stem looks like a hex hash (32 or 64 chars). This makes the result deterministic regardless of OS filesystem ordering.
59083ca to
452609f
Compare
|
@mike-hunhoff Local tests: 992 passed, 179 xfailed, 566 skipped, 3 failed. The 2 failures in test_segfault_fix.py are from a local debug file (not part of the repo). test_fix262 fails due to rules PR #1099 (8caf489d) changing string: /HTTP/i to a stricter HTTP request regex — pre-existing on upstream/master, unrelated to this PR. check ss
the skipped tests are related to pyghidra ida pro and binary ninja as i dont have them installed locally. |
|
@mike-hunhoff can you please review? |


Summary
Implements dynamic file discovery for test fixtures, addressing issue #1743.
This is a cleaned-up resubmission of #2802, incorporating maintainer feedback.
Background
Issue #1743 requested adding
get_sample_short_name_by_md5()as the inverse of the existingget_sample_md5_by_name()function. The previous PR #2802 was closed with feedback to follow the original issue's vision of using dynamic discovery instead of maintaining hardcoded lists.Problem
The codebase maintains hardcoded lists of file names and MD5 hashes in
get_sample_md5_by_name(), requiring manual updates whenever test samples are added. This creates maintenance burden and potential for errors.Solution
This PR implements dynamic sample discovery as requested by @williballenthin in #1743:
_build_sample_cache()- Walkstests/data/directory once at module import timeget_sample_short_name_by_md5()- Reverse MD5→name lookup using the dynamic cacheBenefits
Implementation Details
Follows the approach suggested by @williballenthin in #1743:
Implementation is similar to
collect_samples()inscripts/lint.py, as referenced in the issue discussion.Changes
tests/fixtures.py: Add cache building and reverse lookup function (+90 lines)tests/test_fixtures.py: Add parametrized tests for lookups and roundtrips (+126 lines)Testing