Skip to content

perf(agents): avoid per-file recomputation and clean up temp files#126

Open
DhruvrajSinhZala24 wants to merge 2 commits intofossology:masterfrom
DhruvrajSinhZala24:fix/issue-124
Open

perf(agents): avoid per-file recomputation and clean up temp files#126
DhruvrajSinhZala24 wants to merge 2 commits intofossology:masterfrom
DhruvrajSinhZala24:fix/issue-124

Conversation

@DhruvrajSinhZala24
Copy link
Copy Markdown

Overview

This PR resolves repeated recomputation in the Atarashi similarity agents and fixes temporary comment-file leaks noted in Issue #124. Expensive license-corpus preprocessing is performed once per agent instance, and scan paths now consistently clean up temp files.

Key Changes

  1. Performance (precomputation / hoisted work)

    • TF-IDF (CosineSim): precomputes the license TF-IDF matrix once; per-file scan only transforms the input and computes vectorized cosine similarity.
    • TF-IDF (ScoreSim): preserves baseline semantics (per-file vocabulary + L2 normalization) while avoiding per-file vectorizer refit via precomputed license-side statistics.
    • WordFrequencySimilarity / DLD / N-gram: precomputes license-side tokenization/frequencies once per agent instance.
  2. Resource management (temp-file cleanup)

    • Standardizes AtarashiAgent.cleanup() and ensures it runs via finally in all agents so temp files created by CommentPreprocessor.extract() are removed even on exact matches / early returns / exceptions.
  3. Bug fixes

    • N-gram state corruption: avoids destructive mutation of shared license data between scans.

Benchmarks (local, illustrative)

20 .py files from atarashi/ (sorted), 774-license corpus (processedLicenses.csv), single agent instance:

Agent origin/master This PR Speedup
TF-IDF (CosineSim) 4.22s 1.77s 2.39x
TF-IDF (ScoreSim) 3.84s 2.08s 1.85x
WordFrequencySimilarity 3.03s 0.69s 4.42x
DLD 42.69s 39.23s 1.09x

Note: timings vary by machine; speedup grows with larger multi-file scans since one-time initialization is amortized.

Verification

  • Ran the CLI smoke suite locally (same coverage as .github/workflows/build-test.yml).
  • Compared results vs origin/master on representative files; top matches unchanged for TF-IDF (Cosine/Score), WordFrequency, DLD, and N-gram.

Fixes #124

- Transitioned TF-IDF, Word Frequency, Damerau-Levenshtein, and N-gram agents to a precomputed vector search model.
- Hoisted heavy initialization (fitting, tokenization) to __init__ to improve scan performance from O(N*F) to O(F).
- Implemented cleanup() mechanism in AtarashiAgent to resolve temporary file leaks in /tmp.
- Fixed a shared-state mutation bug in NgramAgent that corrupted datasets across multiple scans.
- Verified semantic parity with baseline implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Repeated TF-IDF recomputation causing scan slowdown

1 participant