perf(agents): avoid per-file recomputation and clean up temp files by DhruvrajSinhZala24 · Pull Request #126 · fossology/atarashi

DhruvrajSinhZala24 · 2026-03-27T13:07:51Z

Overview

This PR resolves repeated recomputation in the Atarashi similarity agents and fixes temporary comment-file leaks noted in Issue #124. Expensive license-corpus preprocessing is performed once per agent instance, and scan paths now consistently clean up temp files.

Key Changes

Performance (precomputation / hoisted work)
- TF-IDF (CosineSim): precomputes the license TF-IDF matrix once; per-file scan only transforms the input and computes vectorized cosine similarity.
- TF-IDF (ScoreSim): preserves baseline semantics (per-file vocabulary + L2 normalization) while avoiding per-file vectorizer refit via precomputed license-side statistics.
- WordFrequencySimilarity / DLD / N-gram: precomputes license-side tokenization/frequencies once per agent instance.
Resource management (temp-file cleanup)
- Standardizes AtarashiAgent.cleanup() and ensures it runs via finally in all agents so temp files created by CommentPreprocessor.extract() are removed even on exact matches / early returns / exceptions.
Bug fixes
- N-gram state corruption: avoids destructive mutation of shared license data between scans.

Benchmarks (local, illustrative)

20 .py files from atarashi/ (sorted), 774-license corpus (processedLicenses.csv), single agent instance:

Agent	`origin/master`	This PR	Speedup
TF-IDF (CosineSim)	4.22s	1.77s	2.39x
TF-IDF (ScoreSim)	3.84s	2.08s	1.85x
WordFrequencySimilarity	3.03s	0.69s	4.42x
DLD	42.69s	39.23s	1.09x

Note: timings vary by machine; speedup grows with larger multi-file scans since one-time initialization is amortized.

Verification

Ran the CLI smoke suite locally (same coverage as .github/workflows/build-test.yml).
Compared results vs origin/master on representative files; top matches unchanged for TF-IDF (Cosine/Score), WordFrequency, DLD, and N-gram.

Fixes #124

- Transitioned TF-IDF, Word Frequency, Damerau-Levenshtein, and N-gram agents to a precomputed vector search model. - Hoisted heavy initialization (fitting, tokenization) to __init__ to improve scan performance from O(N*F) to O(F). - Implemented cleanup() mechanism in AtarashiAgent to resolve temporary file leaks in /tmp. - Fixed a shared-state mutation bug in NgramAgent that corrupted datasets across multiple scans. - Verified semantic parity with baseline implementation.

DhruvrajSinhZala24 added 2 commits March 27, 2026 17:37

fix: ensure temp cleanup + TF-IDF ScoreSim parity

bb40444

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(agents): avoid per-file recomputation and clean up temp files#126

perf(agents): avoid per-file recomputation and clean up temp files#126
DhruvrajSinhZala24 wants to merge 2 commits intofossology:masterfrom
DhruvrajSinhZala24:fix/issue-124

DhruvrajSinhZala24 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DhruvrajSinhZala24 commented Mar 27, 2026

Overview

Key Changes

Benchmarks (local, illustrative)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant