Skip to content

Latest commit

 

History

History
411 lines (330 loc) · 18.7 KB

File metadata and controls

411 lines (330 loc) · 18.7 KB

EMS Domain Architecture

Last Updated: 2026-02-07 Purpose: Single source of truth for how Protocol Guide models the U.S. EMS protocol ecosystem


What Protocol Guide Is

Protocol Guide is a national EMS protocol library with jurisdiction-aware retrieval-augmented generation (RAG). It ingests, chunks, embeds, and serves the actual clinical protocols that govern prehospital care across the United States — scoped to the specific jurisdiction where a provider operates.

Every architectural decision flows from one constraint: EMS protocols are not national. They are set by local medical authorities (LEMSAs, state offices, regional agencies), and a paramedic in Los Angeles follows different standing orders than one in San Diego. The system must resolve a user's jurisdiction, scope search to that jurisdiction's protocols, and generate answers citing only those protocols.


Jurisdiction Hierarchy

U.S. prehospital protocol authority flows through a strict hierarchy. Protocol Guide models this as:

Nation (United States)
  └─ State (e.g., California, Texas)
       └─ LEMSA / Agency (e.g., "Los Angeles County EMS Agency")
            └─ County/Counties (one LEMSA may cover multiple counties)
                 └─ Protocols (stored as vector-embedded chunks)

Key Concepts

Term Definition
LEMSA Local Emergency Medical Services Agency. The regional authority that writes, approves, and publishes clinical protocols for prehospital providers. California has 33 LEMSAs. Other states use different structures (state EMS offices, regional medical directors).
Agency The generalized term in the database for any protocol-issuing authority. Maps 1:1 with a LEMSA in California; will map to state offices or regional bodies in other states.
County The geographic unit users select. A single LEMSA often covers multiple counties (e.g., Central California EMS Agency covers Fresno, Kings, Madera, Tulare).
Protocol A clinical treatment guideline — cardiac arrest management, medication dosing, trauma assessment — authored by the LEMSA/agency.
Protocol Chunk A semantically bounded segment of a protocol, optimized for vector search retrieval. Typically 400-1800 characters.
Standing Order A subset of protocols that authorize specific treatments without real-time physician contact. Scope-of-practice dependent.

Multi-County Agencies

Many LEMSAs cover more than one county. The county_agency_mapping table bridges this:

county_agency_mapping
┌─────────────┬────────────┬──────────────────────────┬────────────┐
│ county_id   │ agency_id  │ agency_name              │ state_code │
├─────────────┼────────────┼──────────────────────────┼────────────┤
│ 1234 (Fresno)│ 42        │ Central California EMS   │ CA         │
│ 1235 (Kings) │ 42        │ Central California EMS   │ CA         │
│ 1236 (Madera)│ 42        │ Central California EMS   │ CA         │
│ 1237 (Tulare)│ 42        │ Central California EMS   │ CA         │
└─────────────┴────────────┴──────────────────────────┴────────────┘

When a user selects "Fresno County," the system resolves to agency_id=42 and searches only Central California EMS protocols.


Data Model

Core Tables

┌──────────────────────┐       ┌──────────────────────────┐
│   manus_agencies     │       │  manus_protocol_chunks   │
├──────────────────────┤       ├──────────────────────────┤
│ id (PK)              │◄──────│ agency_id (FK)           │
│ name                 │       │ agency_name              │
│ state_code           │       │ state_code               │
│ state_name           │       │ state_name               │
│ protocol_count       │       │ protocol_number          │
│ integration_partner  │       │ protocol_title           │
└──────────────────────┘       │ section                  │
        ▲                      │ content                  │
        │                      │ embedding (vector 1536)  │
┌───────┴──────────────┐       │ protocol_year            │
│ county_agency_mapping│       │ protocol_effective_date   │
├──────────────────────┤       │ source_pdf_url           │
│ county_id (FK)       │       │ content_type             │
│ agency_id (FK)       │       │ embedding_version        │
│ agency_name          │       └──────────────────────────┘
│ state_code           │
└──────────────────────┘
        │
        ▼
┌──────────────────────┐
│     counties         │
├──────────────────────┤
│ id (PK)              │
│ name                 │
│ state                │
│ uses_state_protocols │
│ protocol_version     │
└──────────────────────┘

Table Purposes

Table EMS Role
manus_agencies Registry of all protocol-issuing authorities. 2,738 agencies across 53 states/territories.
manus_protocol_chunks 58,000+ vector-embedded protocol segments. Each chunk belongs to one agency and one state. This is the search corpus.
county_agency_mapping Bridges user county selection to the correct agency. Enables "select your county → get your LEMSA's protocols."
counties 2,713 U.S. counties. The uses_state_protocols flag marks counties that follow statewide (not regional) protocols.
manus_users User accounts with selectedAgencyId for persistent jurisdiction preference.

Protocol Ingestion Pipeline

Protocols flow from LEMSA websites into the search corpus through a seven-step pipeline:

LEMSA Website           Local Cache              Supabase
─────────────           ───────────              ────────
                                                 
1. Discover PDFs  ──►  2. Download PDFs  ──►  3. Extract Text
   (crawl LEMSA            (cache in                │
    website)               .cache/pdfs/)            ▼
                                              4. Split into
                                                 Protocol Blocks
                                                     │
                                                     ▼
                                              5. Chunk Protocols
                                                 (400-1800 chars)
                                                     │
                                                     ▼
                                              6. Generate Embeddings
                                                 (Gemini Embedding 2 Preview, 1536 dim; Voyage removed 2026-03-24)
                                                     │
                                                     ▼
                                              7. Insert into
                                                 manus_protocol_chunks
                                                 + update agency count

Pipeline Steps in Detail

Step 1 — PDF Discovery (scripts/lib/pdf-url-discoverer.ts)

  • Crawls the LEMSA's protocol webpage
  • Finds PDF links via Cheerio HTML parsing
  • Filters out non-protocol PDFs (forms, agendas, job postings)
  • Supports three discovery strategies: pdf (direct links), web (sub-page crawling), acidremap (portal extraction)

Step 2 — PDF Download (scripts/lib/pdf-downloader.ts)

  • Downloads with retry (3 attempts, exponential backoff)
  • Validates PDF magic bytes (%PDF header)
  • Caches locally at .cache/pdfs/{lemsa-slug}/{hash}_{filename}.pdf
  • Rate-limited: 500ms between downloads, max 2 concurrent

Step 3 — Text Extraction (scripts/lib/protocol-extractor.ts)

  • Extracts text via pdf-parse library
  • Per-LEMSA parsing rules handle different PDF formats
  • Cleans: removes page numbers, null bytes, normalizes whitespace

Step 4 — Protocol Splitting

  • Splits raw text into individual protocol blocks
  • Uses configurable regex patterns per LEMSA (e.g., Protocol #XXX, TP-1.2)
  • Extracts: protocol number, title, section

Step 5 — Semantic Chunking (server/_core/protocol-chunker.ts)

  • Target chunk size: 1,200 characters (min 400, max 1,800)
  • 150-character overlap between consecutive chunks
  • Splits at semantic boundaries (paragraph breaks, section headers, sentence ends)
  • Classifies content type: medication, procedure, assessment, general

Step 6 — Embedding Generation

  • Model: Google gemini-embedding-2-preview (1536 dimensions; Voyage removed 2026-03-24)
  • Batch size: 128 chunks per API call
  • Rate-limited: 200ms between batches
  • Generates context-enriched embedding text (prepends protocol title + section)

Step 7 — Database Insert

  • Deletes old chunks for the agency (clean replace, not append)
  • Inserts in batches of 50
  • Updates manus_agencies.protocol_count

LEMSA Configuration

Each LEMSA has a configuration record defining how to ingest its protocols:

interface LEMSAConfig {
  name: string;              // "Los Angeles County EMS Agency"
  counties: string[];        // ["Los Angeles"]
  protocolUrl: string;       // Base URL for protocol discovery
  protocolType: 'pdf' | 'web' | 'acidremap';
  population: number;        // For priority ranking
  priority: 1 | 2 | 3;      // Tier 1 = 60% of CA population
  parsingRules: ParsingRules;
}

Priority Tiers (California):

  • Tier 1 (5 LEMSAs, 60% of CA population): LA, San Diego, Orange, Riverside, Inland Counties
  • Tier 2 (5 LEMSAs, 25%): Santa Clara, Sacramento, Alameda, Contra Costa, SF
  • Tier 3 (23 LEMSAs, 15%): Central CA, Northern CA, Sierra-Sacramento Valley, Kern, Ventura, etc.

Jurisdiction-Aware Search Pipeline

Search is the core feature. Every query is scoped to the user's jurisdiction.

End-to-End Flow

User types: "epi dose for cardiac arrest"
                    │
                    ▼
         ┌──────────────────┐
         │ Query Normalizer │  Expands "epi" → "epinephrine"
         │                  │  Detects intent: medication_dosing
         │                  │  Flags: isEmergent = true
         └────────┬─────────┘
                  │
                  ▼
         ┌──────────────────┐
         │ County → Agency  │  countyId → agency_id via
         │ Resolution       │  county_agency_mapping
         └────────┬─────────┘
                  │
                  ▼
         ┌──────────────────┐
         │ Vector Search    │  Gemini Embedding 2 Preview + pgvector (Voyage removed 2026-03-24)
         │ (Scoped)         │  WHERE agency_id = X
         │                  │  AND state_code = 'CA'
         └────────┬─────────┘
                  │
                  ▼
         ┌──────────────────┐
         │ Re-ranking       │  Term frequency, synonym matching,
         │                  │  exact phrase, title relevance,
         │                  │  context boost (+15 same agency)
         └────────┬─────────┘
                  │
                  ▼
         ┌──────────────────┐
         │ Claude RAG       │  Retrieval-only (no training data)
         │                  │  Cites: [Protocol #XXX]
         │                  │  Model: Haiku (free) / Sonnet (pro)
         └────────┬─────────┘
                  │
                  ▼
         "Epinephrine 1mg IV/IO every 3-5 minutes
          per [Protocol 4.2 - Cardiac Arrest]"

Query Normalization (server/_core/ems-query-normalizer.ts)

The normalizer expands 150+ EMS abbreviations before embedding:

Category Examples
Cardiac VF/VFib → ventricular fibrillation, SVT → supraventricular tachycardia, STEMI → ST-elevation myocardial infarction
Respiratory SOB → shortness of breath, CPAP → continuous positive airway pressure, RSI → rapid sequence intubation
Neurological CVA → cerebrovascular accident (stroke), TBI → traumatic brain injury, GCS → Glasgow Coma Scale
Trauma MVC → motor vehicle collision, GSW → gunshot wound, C-spine → cervical spine
Medications Epi → epinephrine, Narcan → naloxone, Versed → midazolam, Zofran → ondansetron
Vitals BP → blood pressure, SpO2 → oxygen saturation, HR → heart rate

Intent Classification

Queries are classified by intent to tune search behavior:

Intent Priority Search Behavior
contraindication_check 100 Multi-query fusion, high precision threshold (0.38)
pediatric_specific 90 Enhanced accuracy, weight-based dosing alerts
medication_dosing 50 Multi-query fusion, medication-in-title boosting (+12)
protocol_lookup 75 Direct protocol number matching (+50 boost)
procedure_steps 70 Step-detection scoring
assessment_criteria 60 Assessment pattern matching
emergency Always triggers enhanced accuracy mode

Safety-Critical Query Handling

Certain query types get enhanced processing regardless of subscription tier:

Triggers:

  • Medication dosing queries (any tier)
  • Contraindication checks (any tier)
  • Emergent patterns: cardiac arrest, anaphylaxis, status epilepticus, massive hemorrhage, tension pneumothorax, airway obstruction, unresponsive
  • Pediatric + medication combinations (weight-based dosing)

Enhanced processing includes:

  • Multi-query fusion (3 query variations searched in parallel, merged via Reciprocal Rank Fusion)
  • Higher similarity thresholds (0.38 vs 0.30 default)
  • Advanced re-ranking with intent-specific signal weighting

Model Routing

Claude model selection is tier-gated and complexity-aware:

User Tier Simple Query Complex Query
Free Haiku 4.5 Haiku 4.5
Pro Haiku 4.5 Sonnet 4.6

Complexity triggers (routes Pro users to Sonnet):

  • Differential diagnosis ("compare," "versus," "differential")
  • Multi-condition queries ("multiple," "interaction")
  • Explanation queries ("why," "explain," "mechanism")
  • Pediatric edge cases ("neonatal," "pregnancy")
  • Atypical presentations ("unusual," "atypical," "complicated")

Claude System Prompt

The system prompt enforces retrieval-only behavior:

  1. Retrieval-only — Never generate clinical content from training data
  2. Mandatory citations — Every clinical statement cites [Protocol #XXX]
  3. Concise — 3-10 sentences (paramedics need fast, actionable answers)
  4. No assumptions — If protocols don't cover the query: "Contact medical control"
  5. Pediatric alerts — Always flag weight-based dosing considerations

Current Coverage

Metric Value
Protocol chunks 58,000+
Agencies/LEMSAs 2,738
States + territories 53
California LEMSAs 33 of 33 (100%)
Counties 2,713

California (Primary Market)

California is the launch market with full LEMSA coverage:

  • 33 LEMSAs covering all 58 counties
  • Each LEMSA has custom parsing rules for its PDF format
  • Priority-tiered ingestion: Tier 1 (60% population) → Tier 2 (25%) → Tier 3 (15%)

National Expansion

The architecture supports national expansion through:

  • state_code field on all protocol data (already populated for 53 states/territories)
  • manus_agencies table already contains 2,738 agencies nationwide
  • Per-state ingestion scripts follow the same pipeline pattern as California
  • uses_state_protocols flag on counties handles states with centralized (not regional) protocols

EMS Terminology Quick Reference

Common abbreviations encountered in the codebase and protocol data:

Patient Assessment

Abbr Meaning
GCS Glasgow Coma Scale (3-15)
AVPU Alert, Verbal, Pain, Unresponsive
PERRLA Pupils Equal Round Reactive to Light and Accommodation
AMS Altered Mental Status
LOC Loss of Consciousness

Cardiac

Abbr Meaning
VF / VFib Ventricular Fibrillation
VT / VTach Ventricular Tachycardia
SVT Supraventricular Tachycardia
PEA Pulseless Electrical Activity
ROSC Return of Spontaneous Circulation
STEMI ST-Elevation Myocardial Infarction
ACS Acute Coronary Syndrome

Respiratory

Abbr Meaning
RSI Rapid Sequence Intubation
ETT Endotracheal Tube
BVM Bag-Valve Mask
CPAP Continuous Positive Airway Pressure
SpO2 Peripheral Oxygen Saturation

Trauma

Abbr Meaning
MVC Motor Vehicle Collision
GSW Gunshot Wound
TBI Traumatic Brain Injury
C-spine Cervical Spine
ICP Intracranial Pressure

Provider Levels

Level Scope
EMR Emergency Medical Responder (basic first aid)
EMT Emergency Medical Technician (BLS)
AEMT Advanced EMT (limited ALS)
Paramedic Full ALS (medications, intubation, cardiac)

Integration Partners

The integration_partner field on manus_agencies tracks ePCR (electronic patient care report) vendors:

Partner Purpose
ImageTrend ePCR and data analytics platform
ESOS EMS operational software
Zoll Cardiac monitor/defibrillator data integration
EMSCloud Cloud-based EMS data management

These integrations enable future features: protocol-aware ePCR autofill, post-call protocol review, QA/QI analytics.