Orange County EMS Protocol Ingestion - OCR Requirement

Problem Summary

Orange County EMS Agency (agency_id: 2520) protocols cannot be ingested because all 93 protocol PDFs are scanned images without embedded text (no OCR layer).

Impact

Population affected: 3.2 million people (3rd largest county in California)
Coverage gap: Critical Tier 1 LEMSA missing from database
User impact: Orange County EMS providers cannot search their local protocols

Technical Details

Discovery Process

Successfully extracted all 93 PDF URLs from Orange County EMS website
Downloaded 92 PDFs (1 404 error on SO-FR-003 AED protocol)
PDF text extraction yielded 0-12 characters per document (only whitespace)
Sample verification:
- SO-M-55_Sepsis_9-2024.pdf: 2 pages, 4 chars
- SO-T-25_Taser_6-2024.pdf: 2 pages, 4 chars
- SO-C-40_VT_with_Pulse_10-2025.pdf: 2 pages, 4 chars

Current Status

lemsa-configs.ts: Orange County configuration complete with all 93 directPdfUrls
PDFs cached: 92 files downloaded to scripts/.cache/pdfs/orange-county-ems-agency/
Parsing rules: Ready (matches SO-X-### format)
Blocker: No OCR capability in ingestion pipeline

Solutions

Option 1: Add Tesseract.js OCR (Recommended)

Pros:

Open source, free
Runs locally, no API costs
Good accuracy for English medical text
Works offline

Cons:

Slower processing (~5-15 seconds per page)
Requires additional disk space for language models (~50MB)
May struggle with poor quality scans

Implementation:

pnpm add tesseract.js

Update scripts/lib/protocol-extractor.ts:

import Tesseract from 'tesseract.js';

export async function extractPdfText(
  localPath: string
): Promise<{ text: string; pages: number }> {
  const buffer = readFileSync(localPath);
  const data = await pdfParse(buffer, { max: 0 });

  // If PDF has no text, use OCR
  if (data.text.trim().length < 50) {
    console.log('    PDF has no text layer, running OCR...');
    return await ocrPdf(localPath);
  }

  return { text: data.text, pages: data.numpages };
}

async function ocrPdf(pdfPath: string): Promise<{ text: string; pages: number }> {
  // Convert PDF to images, then OCR each page
  // Implementation details TBD
}

Estimated effort: 4-6 hours

Option 2: Cloud OCR (Google Vision API, AWS Textract)

Pros:

Faster than Tesseract (~1-2 seconds per page)
Better accuracy, especially for poor scans
Handles handwriting and complex layouts

Cons:

API costs (~$1.50 per 1000 pages)
Requires internet connection
Adds external dependency

Cost estimate for Orange County:

93 PDFs × average 2 pages = 186 pages
186 pages × $0.0015 = $0.28 for initial ingestion

Estimated effort: 2-3 hours

Option 3: Contact Orange County for Text-Based PDFs

Pros:

No OCR needed
Highest accuracy
Free

Cons:

Requires external coordination
May not be available
Timeline uncertain

Contact:

Orange County EMS Agency: (714) 834-7200
Website: https://ochealthinfo.com/about-hca/medical-health-services/emergency-medical-services

Recommendation

Implement Option 1 (Tesseract.js) for these reasons:

One-time investment: Benefits all future LEMSAs with scanned PDFs
No ongoing costs: Unlike cloud OCR
Complete control: No external dependencies or rate limits
California coverage: Many smaller LEMSAs likely have similar scanned PDFs

File References

Config: /Users/tanner-osterkamp/Protocol-Guide/scripts/archive/parsers/lemsa-configs.ts (lines 95-194)
PDFs: /Users/tanner-osterkamp/Protocol-Guide/scripts/.cache/pdfs/orange-county-ems-agency/
Extractor: /Users/tanner-osterkamp/Protocol-Guide/scripts/lib/protocol-extractor.ts

Next Steps

Decide on OCR approach (Tesseract vs. Cloud)
Implement OCR in protocol-extractor.ts
Re-run ingestion: npx tsx scripts/ingest-ca-protocols.ts --lemsa "Orange"

Verify chunks in database:

SELECT COUNT(*) FROM manus_protocol_chunks WHERE agency_id = 2520;

Expected: 600-800 chunks

Related Issues

This same OCR requirement may affect:

Smaller California LEMSAs (Tier 3)
Other state EMS systems
Historical protocol versions

Created: 2026-02-18 Status: Blocked on OCR implementation Priority: High (Tier 1 LEMSA, 3.2M population)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orange County EMS Protocol Ingestion - OCR Requirement

Problem Summary

Impact

Technical Details

Discovery Process

Current Status

Solutions

Option 1: Add Tesseract.js OCR (Recommended)

Option 2: Cloud OCR (Google Vision API, AWS Textract)

Option 3: Contact Orange County for Text-Based PDFs

Recommendation

File References

Next Steps

Related Issues

FilesExpand file tree

ORANGE_COUNTY_OCR_REQUIREMENT.md

Latest commit

History

ORANGE_COUNTY_OCR_REQUIREMENT.md

File metadata and controls

Orange County EMS Protocol Ingestion - OCR Requirement

Problem Summary

Impact

Technical Details

Discovery Process

Current Status

Solutions

Option 1: Add Tesseract.js OCR (Recommended)

Option 2: Cloud OCR (Google Vision API, AWS Textract)

Option 3: Contact Orange County for Text-Based PDFs

Recommendation

File References

Next Steps

Related Issues