Orange County EMS Agency (agency_id: 2520) protocols cannot be ingested because all 93 protocol PDFs are scanned images without embedded text (no OCR layer).
- Population affected: 3.2 million people (3rd largest county in California)
- Coverage gap: Critical Tier 1 LEMSA missing from database
- User impact: Orange County EMS providers cannot search their local protocols
- Successfully extracted all 93 PDF URLs from Orange County EMS website
- Downloaded 92 PDFs (1 404 error on SO-FR-003 AED protocol)
- PDF text extraction yielded 0-12 characters per document (only whitespace)
- Sample verification:
SO-M-55_Sepsis_9-2024.pdf: 2 pages, 4 charsSO-T-25_Taser_6-2024.pdf: 2 pages, 4 charsSO-C-40_VT_with_Pulse_10-2025.pdf: 2 pages, 4 chars
- lemsa-configs.ts: Orange County configuration complete with all 93 directPdfUrls
- PDFs cached: 92 files downloaded to
scripts/.cache/pdfs/orange-county-ems-agency/ - Parsing rules: Ready (matches SO-X-### format)
- Blocker: No OCR capability in ingestion pipeline
Pros:
- Open source, free
- Runs locally, no API costs
- Good accuracy for English medical text
- Works offline
Cons:
- Slower processing (~5-15 seconds per page)
- Requires additional disk space for language models (~50MB)
- May struggle with poor quality scans
Implementation:
pnpm add tesseract.jsUpdate scripts/lib/protocol-extractor.ts:
import Tesseract from 'tesseract.js';
export async function extractPdfText(
localPath: string
): Promise<{ text: string; pages: number }> {
const buffer = readFileSync(localPath);
const data = await pdfParse(buffer, { max: 0 });
// If PDF has no text, use OCR
if (data.text.trim().length < 50) {
console.log(' PDF has no text layer, running OCR...');
return await ocrPdf(localPath);
}
return { text: data.text, pages: data.numpages };
}
async function ocrPdf(pdfPath: string): Promise<{ text: string; pages: number }> {
// Convert PDF to images, then OCR each page
// Implementation details TBD
}Estimated effort: 4-6 hours
Pros:
- Faster than Tesseract (~1-2 seconds per page)
- Better accuracy, especially for poor scans
- Handles handwriting and complex layouts
Cons:
- API costs (~$1.50 per 1000 pages)
- Requires internet connection
- Adds external dependency
Cost estimate for Orange County:
- 93 PDFs × average 2 pages = 186 pages
- 186 pages × $0.0015 = $0.28 for initial ingestion
Estimated effort: 2-3 hours
Pros:
- No OCR needed
- Highest accuracy
- Free
Cons:
- Requires external coordination
- May not be available
- Timeline uncertain
Contact:
- Orange County EMS Agency: (714) 834-7200
- Website: https://ochealthinfo.com/about-hca/medical-health-services/emergency-medical-services
Implement Option 1 (Tesseract.js) for these reasons:
- One-time investment: Benefits all future LEMSAs with scanned PDFs
- No ongoing costs: Unlike cloud OCR
- Complete control: No external dependencies or rate limits
- California coverage: Many smaller LEMSAs likely have similar scanned PDFs
- Config:
/Users/tanner-osterkamp/Protocol-Guide/scripts/archive/parsers/lemsa-configs.ts(lines 95-194) - PDFs:
/Users/tanner-osterkamp/Protocol-Guide/scripts/.cache/pdfs/orange-county-ems-agency/ - Extractor:
/Users/tanner-osterkamp/Protocol-Guide/scripts/lib/protocol-extractor.ts
- Decide on OCR approach (Tesseract vs. Cloud)
- Implement OCR in protocol-extractor.ts
- Re-run ingestion:
npx tsx scripts/ingest-ca-protocols.ts --lemsa "Orange" - Verify chunks in database:
Expected: 600-800 chunks
SELECT COUNT(*) FROM manus_protocol_chunks WHERE agency_id = 2520;
This same OCR requirement may affect:
- Smaller California LEMSAs (Tier 3)
- Other state EMS systems
- Historical protocol versions
Created: 2026-02-18 Status: Blocked on OCR implementation Priority: High (Tier 1 LEMSA, 3.2M population)