Repeated TF-IDF recomputation causing scan slowdown

## Description

While running the evaluation pipeline and reviewing the agents, I noticed that some computations are repeated for every file processed, which impacts performance.

## Observations

- In `atarashi/agents/tfidf.py`, the `TfidfVectorizer` is fitted on the full license corpus inside the scan flow, rather than being reused across files.
- In `wordFrequencySimilarity`, tokenization is repeatedly applied on the same license texts within the loop.
- Some agents perform linear comparisons across all licenses for each file, which may not scale well as the dataset grows.
- Temporary files created in `CommentPreprocessor.extract` do not appear to be consistently cleaned up.

## Impact

These patterns introduce additional overhead and increase scan time as the number of licenses and files grows.

In local testing, avoiding repeated TF-IDF fitting led to noticeable improvements in execution time.

## Question

Would it make sense to reuse precomputed representations (e.g., TF-IDF vectors or tokenized data) across scans to reduce repeated work?

Happy to explore this further if this aligns with the intended design.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated TF-IDF recomputation causing scan slowdown #124

Description

Observations

Impact

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Repeated TF-IDF recomputation causing scan slowdown #124

Description

Description

Observations

Impact

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions