Date: 2026-02-04 Auditor: Antigravity
| Rank | Priority Score | Item | Category | Effort | Risk | Impact |
|---|---|---|---|---|---|---|
| 1 | 23 | Fix Search Indexing | Correctness | S | Low | High |
| 2 | 18 | Fix Non-English Keywords | Correctness | S | Low | High |
| 3 | 12 | Singleton/Reuse ContentAnalyzer | Performance | M | Med | High |
| 4 | 10 | Remove Dead Code (Search Vectorizer) | Maintainability | S | Low | Low |
| 5 | 9 | Refactor _process_pdf Nesting |
Maintainability | M | Med | Med |
| 6 | 8 | Extract Encryption Validation | Robustness | S | Low | Med |
| 7 | 7 | Implement Result Streaming | Scalability | L | High | High |
| 8 | 7 | Explicit NLTK Data Check | Robustness | S | Low | Med |
| 9 | 6 | Type Strictness for PdfBatch |
Maintainability | S | Low | Low |
| 10 | 5 | Switch to ProcessPoolExecutor |
Performance | L | High | High |
Priority = (Impact[1-5] × Confidence[1-5]) - (Risk[1-5] + Effort[1-5])
- Description: Change
PdfSearchEngine.add_documentto accept logic for full text indexing. Currently, it pullsanalysis_results['text_preview']. - Files:
search.py,pdf_processor.py(caller). - Change:
- Update
PdfSearchEngine.add_documentsignature to acceptfull_textoptional arg? No, better to fix the data passed. - Or better: The caller
PdfProcessor.maincallssearch_engine.add_document(..., results['analysis'], ...). result['analysis'] typically doesn't contain full text (by design to save memory). - Recommendation:
PdfSearchEngineshould probably accept thefull_textexplicitly at index time if available, or we need to rethink the "no full text in result" policy if search is required locally.
- Update
- Acceptance: Searching for a unique word in the middle of a PDF (past 500 chars) returns the result.
- Description:
ContentAnalyzerhardcodesstop_words='english'. - Files:
text_analysis.py. - Change:
- Update
__init__to mapself.languageto sklearn compatible stop words list or passstop_words=self.languageif supported (sklearn supports 'english' but for others usually requires a list). - Reuse the mapping logic from
pdf_processor(or move it to a sharedutils/languagesmodule).
- Update
- Acceptance: Processing a Spanish PDF filters out "de", "la", "que" from keywords.
- Description:
PdfProcessorcreates newContentAnalyzer(andTfidfVectorizer) for every call. - Files:
pdf_processor.py. - Change:
- Create a cache/registry of Analyzers by language key:
self.analyzers = {}. if language not in self.analyzers: self.analyzers[language] = ContentAnalyzer(language).
- Create a cache/registry of Analyzers by language key:
- Acceptance: Batch processing time drops significantly.
- Description: Remove
self.vectorizerfromPdfSearchEngine. - Files:
search.py. - Change: Delete lines 16-20 and import.
- Acceptance: Code is cleaner, imports fewer libs.
- Description: Move valid pure functions out of the method scope.
- Files:
pdf_processor.py. - Change: Move
process_in_threadinner logic to_extract_pdf_data(content: bytes) -> tuple. - Acceptance:
_process_pdfis readable and calls a helper. Unit tests can test extraction without async loop.
- API Contract: Must keep
process_urlsignature identical. - Output Format: JSON structure of
resultsmust remains stable for consumers. - Dependencies: Do not add new heavy libs (e.g., SpaCy) to replace NLTK unless approved.