-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
The document processing jobs (Documents::AnalyzePdfJob and Documents::OcrJob) have no test coverage. These jobs handle PDF text extraction and OCR — the foundation for all downstream AI analysis.
Jobs Needing Tests
Documents::AnalyzePdfJob (app/jobs/documents/analyze_pdf_job.rb)
- Extracts text from PDF documents using
pdftotext - Performs page-level analysis (creates
Extractionrows) - Classifies
text_quality(good, poor, no_text) - Triggers
OcrJobfor scanned/image-based documents
Test scenarios:
- Text-based PDF → extracts text, sets
text_quality: good, createsExtractionrows per page - Image-based/scanned PDF → detects low text quality, enqueues
OcrJob - Already-processed document (idempotent re-run) → clears and rebuilds extractions
- Handles corrupt/unreadable PDFs gracefully (doesn't crash)
- Updates
MeetingDocumentfields:extracted_text,text_chars,avg_chars_per_page,page_count
Documents::OcrJob (app/jobs/documents/ocr_job.rb)
- Runs Tesseract OCR on image-based PDFs
- Updates
MeetingDocument.ocr_statusand extracted text
Test scenarios:
- Scanned PDF → OCR produces text, updates
extracted_textandocr_status - PDF with mixed text/image pages → handles correctly
- Tesseract unavailable → graceful failure with appropriate status
- Idempotent re-run
Approach
- Create small test PDF fixtures (
test/fixtures/files/): one text-based, one image-based - Stub system calls to
pdftotextandtesseractwhere appropriate - Test the full flow: download → analyze → OCR → extraction rows
- Verify
text_qualityclassification logic
Dependencies
Documents::DownloadJobalready has tests (test/jobs/documents/download_job_test.rb) — use as a pattern
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels