Skip to content

loader: fix recycled TID overwriting matched calls in dynamic layout#2882

Draft
devs6186 wants to merge 1 commit intomandiant:masterfrom
devs6186:fix/2619-recycled-tid-dynamic-layout
Draft

loader: fix recycled TID overwriting matched calls in dynamic layout#2882
devs6186 wants to merge 1 commit intomandiant:masterfrom
devs6186:fix/2619-recycled-tid-dynamic-layout

Conversation

@devs6186
Copy link
Contributor

closes #2619

Root cause

When an OS recycles a thread ID (TID) within the same process, compute_dynamic_layout() iterated the same ThreadAddress twice. On the second encounter it executed:

calls_by_thread[t.address] = []   # overwrites the first instance's calls

The matched calls from the first thread instance were still present in matched_calls (added by the rule engine), but had been erased from the layout, causing _get_call_name() to raise:

ValueError: ('name not found for call', ...)

Additionally, if both the first and the second instance had matched calls, t.address was appended to threads_by_process[p.address] twice, creating a duplicate thread entry in the layout.

Fix

Two targeted changes in compute_dynamic_layout() (capa/loader.py):

  1. Replace the overwriting assignment with setdefault() so calls from all instances of a recycled TID accumulate under the same key:

    # before
    calls_by_thread[t.address] = []
    # after
    calls_by_thread.setdefault(t.address, [])
  2. Guard the thread-registration block with a membership check so the thread address is only added to matched_threads / threads_by_process once, regardless of how many times the TID is recycled:

    # before
    if calls_by_thread[t.address]:
        matched_threads.add(t.address)
        threads_by_process[p.address].append(t.address)
    # after
    if calls_by_thread[t.address] and t.address not in matched_threads:
        matched_threads.add(t.address)
        threads_by_process[p.address].append(t.address)

This is the source-level fix requested by the maintainer: the data is now consistent before it reaches the rendering stage, so no exception handling in the renderer is needed.

Test

tests/test_loader.py::test_compute_dynamic_layout_recycled_tid constructs a minimal mock DynamicFeatureExtractor that yields the same ThreadAddress twice (simulating TID recycling), with a matched call only in the first instance. It asserts that compute_dynamic_layout() returns a layout containing exactly one thread entry with the matched call intact.

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed
  • This submission includes AI-generated code and I have provided details in the description.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @devs6186, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the dynamic analysis loader where the recycling of thread IDs by the operating system could lead to data loss and runtime errors. Specifically, it prevents the overwriting of matched calls associated with a recycled TID and ensures that such TIDs are not registered multiple times in the layout. These changes enhance the robustness and accuracy of dynamic analysis results, particularly in environments where TIDs are frequently reused. Additionally, the pull request includes a new test to cover this specific scenario and updates the documentation to provide a clearer overview of capa's output consumption methods.

Highlights

  • Dynamic Layout Bug Fix: Resolved an issue where recycled Thread IDs (TIDs) could overwrite previously matched calls in the dynamic layout, leading to ValueError during rendering. The fix ensures that calls from all instances of a recycled TID are accumulated correctly.
  • Duplicate Thread Entry Prevention: Implemented a check to prevent duplicate thread entries in the dynamic layout when an OS recycles a TID. This ensures each unique thread address is registered only once, maintaining data consistency.
  • New Test Case: Added a dedicated test case (test_compute_dynamic_layout_recycled_tid) to validate the fix for recycled TIDs, simulating the scenario and asserting correct behavior.
  • Documentation Update: Updated the usage documentation to include a table comparing various methods for consuming capa output, such as CLI, IDA Pro, Ghidra, Binary Ninja, dynamic sandboxes, and the web UI.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • CHANGELOG.md
    • Added bug fix entry for recycled TID issue.
    • Added documentation entry for capa output consumption table.
  • capa/loader.py
    • Fixed recycled TID overwriting matched calls in dynamic layout.
    • Prevented duplicate thread entries for recycled TIDs.
    • Reformatted several function calls and import statements for improved readability.
  • doc/usage.md
    • Added a table outlining different ways to consume capa output.
  • tests/test_loader.py
    • Added a new test case to verify the fix for recycled TIDs in dynamic layout.
Activity
  • No human activity has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where recycled thread IDs in dynamic analysis could lead to overwritten data and subsequent errors. The fix correctly uses setdefault to accumulate calls from all instances of a recycled TID and adds a check to prevent duplicate thread registration. A comprehensive test case has been added to validate the fix. The overall implementation is correct and of high quality. The PR also includes significant code formatting changes, which improve readability. For future PRs, it would be beneficial to separate such large-scale formatting changes from functional bug fixes to make reviewing easier. The documentation updates are also a welcome addition.

When an OS recycles a thread ID (TID) within the same process,
compute_dynamic_layout() was resetting calls_by_thread[t.address] to
an empty list on the second encounter of that ThreadAddress.  This
erased the matched calls accumulated from the first thread instance.

Those calls were still present in matched_calls (added by the rule
engine), so the renderer could not locate them in the layout and
raised ValueError("name not found for call").

Fix the overwrite by using setdefault() instead of direct assignment,
and guard the threads_by_process.append() with a membership check so
a recycled TID does not produce a duplicate thread entry in the layout.

A dedicated unit test covering the recycled-TID scenario is added in
tests/test_loader.py.

Fixes mandiant#2619
@devs6186 devs6186 force-pushed the fix/2619-recycled-tid-dynamic-layout branch from 72dbf97 to 29bb083 Compare February 24, 2026 20:24
Copy link
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @devs6186, thanks for looking into the data overwriting issue. While this fix prevents matched calls from being lost when a TID is recycled, it doesn't quite address the requirement for unique tracking of distinct thread and process lifecycles. By accumulating all calls under the same ThreadAddress, the layout still fuses separate executions into a single entry, which doesn't solve the core problem described in #2361.

Because we want to avoid merging partial solutions for this component, we'd like to see a more comprehensive fix that addresses the uniqueness and de-duplication problem entirely for both thread and process lifecycles (for example, by incorporating sequence IDs or timestamps into the Address classes).

If you're interested in fully addressing these requirements, we'd love to see those updates in this PR. Alternatively, if you don't have the time or interest to take on the broader scope right now, let us know and we can close this for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains many format-based changes. Please revert these if they are not necessary to pass linting checks.

@devs6186
Copy link
Contributor Author

Hi @devs6186, thanks for looking into the data overwriting issue. While this fix prevents matched calls from being lost when a TID is recycled, it doesn't quite address the requirement for unique tracking of distinct thread and process lifecycles. By accumulating all calls under the same ThreadAddress, the layout still fuses separate executions into a single entry, which doesn't solve the core problem described in #2361.

Because we want to avoid merging partial solutions for this component, we'd like to see a more comprehensive fix that addresses the uniqueness and de-duplication problem entirely for both thread and process lifecycles (for example, by incorporating sequence IDs or timestamps into the Address classes).

If you're interested in fully addressing these requirements, we'd love to see those updates in this PR. Alternatively, if you don't have the time or interest to take on the broader scope right now, let us know and we can close this for now.

i appreciate your review, i will look into it moe broadly, i wanna solve a problem completely instead of just submitting a pr, give me days time or two maximum, ill come back with a better solution. Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dynamic: render: ValueError "name not found for call"

2 participants