Skip to content

Repeated TF-IDF recomputation causing scan slowdown #124

@DhruvrajSinhZala24

Description

@DhruvrajSinhZala24

Description

While running the evaluation pipeline and reviewing the agents, I noticed that some computations are repeated for every file processed, which impacts performance.

Observations

  • In atarashi/agents/tfidf.py, the TfidfVectorizer is fitted on the full license corpus inside the scan flow, rather than being reused across files.
  • In wordFrequencySimilarity, tokenization is repeatedly applied on the same license texts within the loop.
  • Some agents perform linear comparisons across all licenses for each file, which may not scale well as the dataset grows.
  • Temporary files created in CommentPreprocessor.extract do not appear to be consistently cleaned up.

Impact

These patterns introduce additional overhead and increase scan time as the number of licenses and files grows.

In local testing, avoiding repeated TF-IDF fitting led to noticeable improvements in execution time.

Question

Would it make sense to reuse precomputed representations (e.g., TF-IDF vectors or tokenized data) across scans to reduce repeated work?

Happy to explore this further if this aligns with the intended design.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions