Description
While running the evaluation pipeline and reviewing the agents, I noticed that some computations are repeated for every file processed, which impacts performance.
Observations
- In
atarashi/agents/tfidf.py, the TfidfVectorizer is fitted on the full license corpus inside the scan flow, rather than being reused across files.
- In
wordFrequencySimilarity, tokenization is repeatedly applied on the same license texts within the loop.
- Some agents perform linear comparisons across all licenses for each file, which may not scale well as the dataset grows.
- Temporary files created in
CommentPreprocessor.extract do not appear to be consistently cleaned up.
Impact
These patterns introduce additional overhead and increase scan time as the number of licenses and files grows.
In local testing, avoiding repeated TF-IDF fitting led to noticeable improvements in execution time.
Question
Would it make sense to reuse precomputed representations (e.g., TF-IDF vectors or tokenized data) across scans to reduce repeated work?
Happy to explore this further if this aligns with the intended design.
Description
While running the evaluation pipeline and reviewing the agents, I noticed that some computations are repeated for every file processed, which impacts performance.
Observations
atarashi/agents/tfidf.py, theTfidfVectorizeris fitted on the full license corpus inside the scan flow, rather than being reused across files.wordFrequencySimilarity, tokenization is repeatedly applied on the same license texts within the loop.CommentPreprocessor.extractdo not appear to be consistently cleaned up.Impact
These patterns introduce additional overhead and increase scan time as the number of licenses and files grows.
In local testing, avoiding repeated TF-IDF fitting led to noticeable improvements in execution time.
Question
Would it make sense to reuse precomputed representations (e.g., TF-IDF vectors or tokenized data) across scans to reduce repeated work?
Happy to explore this further if this aligns with the intended design.