Skip to content

Latest commit

 

History

History
150 lines (110 loc) · 4.34 KB

File metadata and controls

150 lines (110 loc) · 4.34 KB

Orange County EMS Protocol Ingestion - OCR Requirement

Problem Summary

Orange County EMS Agency (agency_id: 2520) protocols cannot be ingested because all 93 protocol PDFs are scanned images without embedded text (no OCR layer).

Impact

  • Population affected: 3.2 million people (3rd largest county in California)
  • Coverage gap: Critical Tier 1 LEMSA missing from database
  • User impact: Orange County EMS providers cannot search their local protocols

Technical Details

Discovery Process

  1. Successfully extracted all 93 PDF URLs from Orange County EMS website
  2. Downloaded 92 PDFs (1 404 error on SO-FR-003 AED protocol)
  3. PDF text extraction yielded 0-12 characters per document (only whitespace)
  4. Sample verification:
    • SO-M-55_Sepsis_9-2024.pdf: 2 pages, 4 chars
    • SO-T-25_Taser_6-2024.pdf: 2 pages, 4 chars
    • SO-C-40_VT_with_Pulse_10-2025.pdf: 2 pages, 4 chars

Current Status

  • lemsa-configs.ts: Orange County configuration complete with all 93 directPdfUrls
  • PDFs cached: 92 files downloaded to scripts/.cache/pdfs/orange-county-ems-agency/
  • Parsing rules: Ready (matches SO-X-### format)
  • Blocker: No OCR capability in ingestion pipeline

Solutions

Option 1: Add Tesseract.js OCR (Recommended)

Pros:

  • Open source, free
  • Runs locally, no API costs
  • Good accuracy for English medical text
  • Works offline

Cons:

  • Slower processing (~5-15 seconds per page)
  • Requires additional disk space for language models (~50MB)
  • May struggle with poor quality scans

Implementation:

pnpm add tesseract.js

Update scripts/lib/protocol-extractor.ts:

import Tesseract from 'tesseract.js';

export async function extractPdfText(
  localPath: string
): Promise<{ text: string; pages: number }> {
  const buffer = readFileSync(localPath);
  const data = await pdfParse(buffer, { max: 0 });

  // If PDF has no text, use OCR
  if (data.text.trim().length < 50) {
    console.log('    PDF has no text layer, running OCR...');
    return await ocrPdf(localPath);
  }

  return { text: data.text, pages: data.numpages };
}

async function ocrPdf(pdfPath: string): Promise<{ text: string; pages: number }> {
  // Convert PDF to images, then OCR each page
  // Implementation details TBD
}

Estimated effort: 4-6 hours

Option 2: Cloud OCR (Google Vision API, AWS Textract)

Pros:

  • Faster than Tesseract (~1-2 seconds per page)
  • Better accuracy, especially for poor scans
  • Handles handwriting and complex layouts

Cons:

  • API costs (~$1.50 per 1000 pages)
  • Requires internet connection
  • Adds external dependency

Cost estimate for Orange County:

  • 93 PDFs × average 2 pages = 186 pages
  • 186 pages × $0.0015 = $0.28 for initial ingestion

Estimated effort: 2-3 hours

Option 3: Contact Orange County for Text-Based PDFs

Pros:

  • No OCR needed
  • Highest accuracy
  • Free

Cons:

  • Requires external coordination
  • May not be available
  • Timeline uncertain

Contact:

Recommendation

Implement Option 1 (Tesseract.js) for these reasons:

  1. One-time investment: Benefits all future LEMSAs with scanned PDFs
  2. No ongoing costs: Unlike cloud OCR
  3. Complete control: No external dependencies or rate limits
  4. California coverage: Many smaller LEMSAs likely have similar scanned PDFs

File References

  • Config: /Users/tanner-osterkamp/Protocol-Guide/scripts/archive/parsers/lemsa-configs.ts (lines 95-194)
  • PDFs: /Users/tanner-osterkamp/Protocol-Guide/scripts/.cache/pdfs/orange-county-ems-agency/
  • Extractor: /Users/tanner-osterkamp/Protocol-Guide/scripts/lib/protocol-extractor.ts

Next Steps

  1. Decide on OCR approach (Tesseract vs. Cloud)
  2. Implement OCR in protocol-extractor.ts
  3. Re-run ingestion: npx tsx scripts/ingest-ca-protocols.ts --lemsa "Orange"
  4. Verify chunks in database:
    SELECT COUNT(*) FROM manus_protocol_chunks WHERE agency_id = 2520;
    Expected: 600-800 chunks

Related Issues

This same OCR requirement may affect:

  • Smaller California LEMSAs (Tier 3)
  • Other state EMS systems
  • Historical protocol versions

Created: 2026-02-18 Status: Blocked on OCR implementation Priority: High (Tier 1 LEMSA, 3.2M population)