Skip to content

Test coverage: Document processing jobs (AnalyzePdf, OCR) #55

@AndreRobitaille

Description

@AndreRobitaille

Summary

The document processing jobs (Documents::AnalyzePdfJob and Documents::OcrJob) have no test coverage. These jobs handle PDF text extraction and OCR — the foundation for all downstream AI analysis.

Jobs Needing Tests

Documents::AnalyzePdfJob (app/jobs/documents/analyze_pdf_job.rb)

  • Extracts text from PDF documents using pdftotext
  • Performs page-level analysis (creates Extraction rows)
  • Classifies text_quality (good, poor, no_text)
  • Triggers OcrJob for scanned/image-based documents

Test scenarios:

  • Text-based PDF → extracts text, sets text_quality: good, creates Extraction rows per page
  • Image-based/scanned PDF → detects low text quality, enqueues OcrJob
  • Already-processed document (idempotent re-run) → clears and rebuilds extractions
  • Handles corrupt/unreadable PDFs gracefully (doesn't crash)
  • Updates MeetingDocument fields: extracted_text, text_chars, avg_chars_per_page, page_count

Documents::OcrJob (app/jobs/documents/ocr_job.rb)

  • Runs Tesseract OCR on image-based PDFs
  • Updates MeetingDocument.ocr_status and extracted text

Test scenarios:

  • Scanned PDF → OCR produces text, updates extracted_text and ocr_status
  • PDF with mixed text/image pages → handles correctly
  • Tesseract unavailable → graceful failure with appropriate status
  • Idempotent re-run

Approach

  • Create small test PDF fixtures (test/fixtures/files/): one text-based, one image-based
  • Stub system calls to pdftotext and tesseract where appropriate
  • Test the full flow: download → analyze → OCR → extraction rows
  • Verify text_quality classification logic

Dependencies

  • Documents::DownloadJob already has tests (test/jobs/documents/download_job_test.rb) — use as a pattern

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions