Last Updated: 2026-02-07 Purpose: Single source of truth for how Protocol Guide models the U.S. EMS protocol ecosystem
Protocol Guide is a national EMS protocol library with jurisdiction-aware retrieval-augmented generation (RAG). It ingests, chunks, embeds, and serves the actual clinical protocols that govern prehospital care across the United States — scoped to the specific jurisdiction where a provider operates.
Every architectural decision flows from one constraint: EMS protocols are not national. They are set by local medical authorities (LEMSAs, state offices, regional agencies), and a paramedic in Los Angeles follows different standing orders than one in San Diego. The system must resolve a user's jurisdiction, scope search to that jurisdiction's protocols, and generate answers citing only those protocols.
U.S. prehospital protocol authority flows through a strict hierarchy. Protocol Guide models this as:
Nation (United States)
└─ State (e.g., California, Texas)
└─ LEMSA / Agency (e.g., "Los Angeles County EMS Agency")
└─ County/Counties (one LEMSA may cover multiple counties)
└─ Protocols (stored as vector-embedded chunks)
| Term | Definition |
|---|---|
| LEMSA | Local Emergency Medical Services Agency. The regional authority that writes, approves, and publishes clinical protocols for prehospital providers. California has 33 LEMSAs. Other states use different structures (state EMS offices, regional medical directors). |
| Agency | The generalized term in the database for any protocol-issuing authority. Maps 1:1 with a LEMSA in California; will map to state offices or regional bodies in other states. |
| County | The geographic unit users select. A single LEMSA often covers multiple counties (e.g., Central California EMS Agency covers Fresno, Kings, Madera, Tulare). |
| Protocol | A clinical treatment guideline — cardiac arrest management, medication dosing, trauma assessment — authored by the LEMSA/agency. |
| Protocol Chunk | A semantically bounded segment of a protocol, optimized for vector search retrieval. Typically 400-1800 characters. |
| Standing Order | A subset of protocols that authorize specific treatments without real-time physician contact. Scope-of-practice dependent. |
Many LEMSAs cover more than one county. The county_agency_mapping table bridges this:
county_agency_mapping
┌─────────────┬────────────┬──────────────────────────┬────────────┐
│ county_id │ agency_id │ agency_name │ state_code │
├─────────────┼────────────┼──────────────────────────┼────────────┤
│ 1234 (Fresno)│ 42 │ Central California EMS │ CA │
│ 1235 (Kings) │ 42 │ Central California EMS │ CA │
│ 1236 (Madera)│ 42 │ Central California EMS │ CA │
│ 1237 (Tulare)│ 42 │ Central California EMS │ CA │
└─────────────┴────────────┴──────────────────────────┴────────────┘
When a user selects "Fresno County," the system resolves to agency_id=42 and searches only Central California EMS protocols.
┌──────────────────────┐ ┌──────────────────────────┐
│ manus_agencies │ │ manus_protocol_chunks │
├──────────────────────┤ ├──────────────────────────┤
│ id (PK) │◄──────│ agency_id (FK) │
│ name │ │ agency_name │
│ state_code │ │ state_code │
│ state_name │ │ state_name │
│ protocol_count │ │ protocol_number │
│ integration_partner │ │ protocol_title │
└──────────────────────┘ │ section │
▲ │ content │
│ │ embedding (vector 1536) │
┌───────┴──────────────┐ │ protocol_year │
│ county_agency_mapping│ │ protocol_effective_date │
├──────────────────────┤ │ source_pdf_url │
│ county_id (FK) │ │ content_type │
│ agency_id (FK) │ │ embedding_version │
│ agency_name │ └──────────────────────────┘
│ state_code │
└──────────────────────┘
│
▼
┌──────────────────────┐
│ counties │
├──────────────────────┤
│ id (PK) │
│ name │
│ state │
│ uses_state_protocols │
│ protocol_version │
└──────────────────────┘
| Table | EMS Role |
|---|---|
manus_agencies |
Registry of all protocol-issuing authorities. 2,738 agencies across 53 states/territories. |
manus_protocol_chunks |
58,000+ vector-embedded protocol segments. Each chunk belongs to one agency and one state. This is the search corpus. |
county_agency_mapping |
Bridges user county selection to the correct agency. Enables "select your county → get your LEMSA's protocols." |
counties |
2,713 U.S. counties. The uses_state_protocols flag marks counties that follow statewide (not regional) protocols. |
manus_users |
User accounts with selectedAgencyId for persistent jurisdiction preference. |
Protocols flow from LEMSA websites into the search corpus through a seven-step pipeline:
LEMSA Website Local Cache Supabase
───────────── ─────────── ────────
1. Discover PDFs ──► 2. Download PDFs ──► 3. Extract Text
(crawl LEMSA (cache in │
website) .cache/pdfs/) ▼
4. Split into
Protocol Blocks
│
▼
5. Chunk Protocols
(400-1800 chars)
│
▼
6. Generate Embeddings
(Gemini Embedding 2 Preview, 1536 dim; Voyage removed 2026-03-24)
│
▼
7. Insert into
manus_protocol_chunks
+ update agency count
Step 1 — PDF Discovery (scripts/lib/pdf-url-discoverer.ts)
- Crawls the LEMSA's protocol webpage
- Finds PDF links via Cheerio HTML parsing
- Filters out non-protocol PDFs (forms, agendas, job postings)
- Supports three discovery strategies:
pdf(direct links),web(sub-page crawling),acidremap(portal extraction)
Step 2 — PDF Download (scripts/lib/pdf-downloader.ts)
- Downloads with retry (3 attempts, exponential backoff)
- Validates PDF magic bytes (
%PDFheader) - Caches locally at
.cache/pdfs/{lemsa-slug}/{hash}_{filename}.pdf - Rate-limited: 500ms between downloads, max 2 concurrent
Step 3 — Text Extraction (scripts/lib/protocol-extractor.ts)
- Extracts text via
pdf-parselibrary - Per-LEMSA parsing rules handle different PDF formats
- Cleans: removes page numbers, null bytes, normalizes whitespace
Step 4 — Protocol Splitting
- Splits raw text into individual protocol blocks
- Uses configurable regex patterns per LEMSA (e.g.,
Protocol #XXX,TP-1.2) - Extracts: protocol number, title, section
Step 5 — Semantic Chunking (server/_core/protocol-chunker.ts)
- Target chunk size: 1,200 characters (min 400, max 1,800)
- 150-character overlap between consecutive chunks
- Splits at semantic boundaries (paragraph breaks, section headers, sentence ends)
- Classifies content type:
medication,procedure,assessment,general
Step 6 — Embedding Generation
- Model: Google
gemini-embedding-2-preview(1536 dimensions; Voyage removed 2026-03-24) - Batch size: 128 chunks per API call
- Rate-limited: 200ms between batches
- Generates context-enriched embedding text (prepends protocol title + section)
Step 7 — Database Insert
- Deletes old chunks for the agency (clean replace, not append)
- Inserts in batches of 50
- Updates
manus_agencies.protocol_count
Each LEMSA has a configuration record defining how to ingest its protocols:
interface LEMSAConfig {
name: string; // "Los Angeles County EMS Agency"
counties: string[]; // ["Los Angeles"]
protocolUrl: string; // Base URL for protocol discovery
protocolType: 'pdf' | 'web' | 'acidremap';
population: number; // For priority ranking
priority: 1 | 2 | 3; // Tier 1 = 60% of CA population
parsingRules: ParsingRules;
}Priority Tiers (California):
- Tier 1 (5 LEMSAs, 60% of CA population): LA, San Diego, Orange, Riverside, Inland Counties
- Tier 2 (5 LEMSAs, 25%): Santa Clara, Sacramento, Alameda, Contra Costa, SF
- Tier 3 (23 LEMSAs, 15%): Central CA, Northern CA, Sierra-Sacramento Valley, Kern, Ventura, etc.
Search is the core feature. Every query is scoped to the user's jurisdiction.
User types: "epi dose for cardiac arrest"
│
▼
┌──────────────────┐
│ Query Normalizer │ Expands "epi" → "epinephrine"
│ │ Detects intent: medication_dosing
│ │ Flags: isEmergent = true
└────────┬─────────┘
│
▼
┌──────────────────┐
│ County → Agency │ countyId → agency_id via
│ Resolution │ county_agency_mapping
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Vector Search │ Gemini Embedding 2 Preview + pgvector (Voyage removed 2026-03-24)
│ (Scoped) │ WHERE agency_id = X
│ │ AND state_code = 'CA'
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Re-ranking │ Term frequency, synonym matching,
│ │ exact phrase, title relevance,
│ │ context boost (+15 same agency)
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Claude RAG │ Retrieval-only (no training data)
│ │ Cites: [Protocol #XXX]
│ │ Model: Haiku (free) / Sonnet (pro)
└────────┬─────────┘
│
▼
"Epinephrine 1mg IV/IO every 3-5 minutes
per [Protocol 4.2 - Cardiac Arrest]"
The normalizer expands 150+ EMS abbreviations before embedding:
| Category | Examples |
|---|---|
| Cardiac | VF/VFib → ventricular fibrillation, SVT → supraventricular tachycardia, STEMI → ST-elevation myocardial infarction |
| Respiratory | SOB → shortness of breath, CPAP → continuous positive airway pressure, RSI → rapid sequence intubation |
| Neurological | CVA → cerebrovascular accident (stroke), TBI → traumatic brain injury, GCS → Glasgow Coma Scale |
| Trauma | MVC → motor vehicle collision, GSW → gunshot wound, C-spine → cervical spine |
| Medications | Epi → epinephrine, Narcan → naloxone, Versed → midazolam, Zofran → ondansetron |
| Vitals | BP → blood pressure, SpO2 → oxygen saturation, HR → heart rate |
Queries are classified by intent to tune search behavior:
| Intent | Priority | Search Behavior |
|---|---|---|
contraindication_check |
100 | Multi-query fusion, high precision threshold (0.38) |
pediatric_specific |
90 | Enhanced accuracy, weight-based dosing alerts |
medication_dosing |
50 | Multi-query fusion, medication-in-title boosting (+12) |
protocol_lookup |
75 | Direct protocol number matching (+50 boost) |
procedure_steps |
70 | Step-detection scoring |
assessment_criteria |
60 | Assessment pattern matching |
emergency |
— | Always triggers enhanced accuracy mode |
Certain query types get enhanced processing regardless of subscription tier:
Triggers:
- Medication dosing queries (any tier)
- Contraindication checks (any tier)
- Emergent patterns: cardiac arrest, anaphylaxis, status epilepticus, massive hemorrhage, tension pneumothorax, airway obstruction, unresponsive
- Pediatric + medication combinations (weight-based dosing)
Enhanced processing includes:
- Multi-query fusion (3 query variations searched in parallel, merged via Reciprocal Rank Fusion)
- Higher similarity thresholds (0.38 vs 0.30 default)
- Advanced re-ranking with intent-specific signal weighting
Claude model selection is tier-gated and complexity-aware:
| User Tier | Simple Query | Complex Query |
|---|---|---|
| Free | Haiku 4.5 | Haiku 4.5 |
| Pro | Haiku 4.5 | Sonnet 4.6 |
Complexity triggers (routes Pro users to Sonnet):
- Differential diagnosis ("compare," "versus," "differential")
- Multi-condition queries ("multiple," "interaction")
- Explanation queries ("why," "explain," "mechanism")
- Pediatric edge cases ("neonatal," "pregnancy")
- Atypical presentations ("unusual," "atypical," "complicated")
The system prompt enforces retrieval-only behavior:
- Retrieval-only — Never generate clinical content from training data
- Mandatory citations — Every clinical statement cites
[Protocol #XXX] - Concise — 3-10 sentences (paramedics need fast, actionable answers)
- No assumptions — If protocols don't cover the query: "Contact medical control"
- Pediatric alerts — Always flag weight-based dosing considerations
| Metric | Value |
|---|---|
| Protocol chunks | 58,000+ |
| Agencies/LEMSAs | 2,738 |
| States + territories | 53 |
| California LEMSAs | 33 of 33 (100%) |
| Counties | 2,713 |
California is the launch market with full LEMSA coverage:
- 33 LEMSAs covering all 58 counties
- Each LEMSA has custom parsing rules for its PDF format
- Priority-tiered ingestion: Tier 1 (60% population) → Tier 2 (25%) → Tier 3 (15%)
The architecture supports national expansion through:
state_codefield on all protocol data (already populated for 53 states/territories)manus_agenciestable already contains 2,738 agencies nationwide- Per-state ingestion scripts follow the same pipeline pattern as California
uses_state_protocolsflag on counties handles states with centralized (not regional) protocols
Common abbreviations encountered in the codebase and protocol data:
| Abbr | Meaning |
|---|---|
| GCS | Glasgow Coma Scale (3-15) |
| AVPU | Alert, Verbal, Pain, Unresponsive |
| PERRLA | Pupils Equal Round Reactive to Light and Accommodation |
| AMS | Altered Mental Status |
| LOC | Loss of Consciousness |
| Abbr | Meaning |
|---|---|
| VF / VFib | Ventricular Fibrillation |
| VT / VTach | Ventricular Tachycardia |
| SVT | Supraventricular Tachycardia |
| PEA | Pulseless Electrical Activity |
| ROSC | Return of Spontaneous Circulation |
| STEMI | ST-Elevation Myocardial Infarction |
| ACS | Acute Coronary Syndrome |
| Abbr | Meaning |
|---|---|
| RSI | Rapid Sequence Intubation |
| ETT | Endotracheal Tube |
| BVM | Bag-Valve Mask |
| CPAP | Continuous Positive Airway Pressure |
| SpO2 | Peripheral Oxygen Saturation |
| Abbr | Meaning |
|---|---|
| MVC | Motor Vehicle Collision |
| GSW | Gunshot Wound |
| TBI | Traumatic Brain Injury |
| C-spine | Cervical Spine |
| ICP | Intracranial Pressure |
| Level | Scope |
|---|---|
| EMR | Emergency Medical Responder (basic first aid) |
| EMT | Emergency Medical Technician (BLS) |
| AEMT | Advanced EMT (limited ALS) |
| Paramedic | Full ALS (medications, intubation, cardiac) |
The integration_partner field on manus_agencies tracks ePCR (electronic patient care report) vendors:
| Partner | Purpose |
|---|---|
| ImageTrend | ePCR and data analytics platform |
| ESOS | EMS operational software |
| Zoll | Cardiac monitor/defibrillator data integration |
| EMSCloud | Cloud-based EMS data management |
These integrations enable future features: protocol-aware ePCR autofill, post-call protocol review, QA/QI analytics.