iTREDS Project - Extract maternal health program data from U.S. county websites using LLMs.
The United States faces a significant maternal health crisis. Key statistics:
- Maternal mortality: The U.S. has the highest maternal mortality rate among developed nations, with rates rising over the past two decades
- Racial disparities: Black women are 3-4 times more likely to die from pregnancy-related causes than white women
- Maternity care deserts: Over 2 million women of childbearing age live in counties with no obstetric hospitals or birth centers, and no obstetric providers (March of Dimes, 2022)
- Declining birth rates: Birth rates have become a global concern, making maternal health support increasingly critical
Maternal health programs exist at federal, state, and county levels, but information about them is:
- Scattered across hundreds of county government websites
- Inconsistently structured - each county organizes information differently
- Hard to discover - programs are buried in complex website hierarchies
- Not centralized - no comprehensive database of county-level maternal health programs exists
This fragmentation makes it difficult for:
- Researchers to study program availability and coverage
- Policymakers to identify gaps in maternal health services
- Pregnant women and families to find available resources
This project aims to create a structured, searchable database of county-level maternal health programs by:
- Automated discovery - Using web scraping to find maternal health program pages on county websites
- Information extraction - Using LLMs to extract program details (eligibility, services, contacts)
- Standardized output - Producing structured data that can be analyzed and compared across counties
- Gap analysis - Detecting programs missing from the registry using TF-IDF semantic matching
Current scope: California (all 58 counties)
Target programs: WIC, Black Infant Health, Nurse-Family Partnership, Perinatal Care Networks, Home Visiting Programs, Breastfeeding Support, Teen Pregnancy Programs, and other maternal/child health services
This project's program taxonomy is grounded in established public health frameworks:
Based on the WHO Conceptual Framework (Solar & Irwin, 2010), we recognize that maternal health outcomes are shaped by:
- Healthcare Access - availability of prenatal, delivery, and postpartum care
- Quality of Care - patient voice, equity, culturally appropriate care
- Social Support - community health workers, doulas, home visiting
- Economic Stability - nutrition programs, workplace protections
- Neighborhood Environment - environmental exposures, housing
The Biden-Harris Administration's five-goal framework provides structure for our program categories:
- Healthcare Access & Coverage - comprehensive maternal health services
- Quality of Care & Patient Voice - accountable, equitable care systems
- Data Collection & Research - evidence-based practices
- Perinatal Workforce - doulas, midwives, community health workers
- Social & Economic Supports - WIC, housing, food security
SDOH Framework:
- Braveman, P., Egerter, S., & Williams, D. R. (2011). The social determinants of health: coming of age. Annual Review of Public Health, 32(1), 381-398.
- Braveman, P., & Gottlieb, L. (2014). The social determinants of health: it's time to consider the causes of the causes. Public Health Reports, 129(1_suppl2), 19-31.
- Marmot, M., Allen, J., Bell, R., Bloomer, E., & Goldblatt, P. (2012). WHO European review of social determinants of health and the health divide. The Lancet, 380(9846), 1011-1029.
- Solar, O., & Irwin, A. (2010). A conceptual framework for action on the social determinants of health. WHO Document Production Services.
Maternal Health Policy:
- White House. (2022). Blueprint for Addressing the Maternal Health Crisis. Link
- March of Dimes. (2022). Maternity Care Deserts Report. Link
- California Department of Public Health. Maternal, Child and Adolescent Health Division. Link
Methods:
- Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
This pipeline discovers and extracts maternal health programs (WIC, Black Infant Health, Nurse-Family Partnership, etc.) from all 58 California county government websites. It uses a 3-phase approach — Discovery → Extraction → Structuring — followed by automated Gap Analysis.
# 1. Setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# 2. Configure API key
cp .env.example .env
# Edit .env and add your OpenAI API key
# 3. Run the full pipeline
python run_pipeline.pyPHASE 1: DISCOVERY PHASE 2: EXTRACTION PHASE 3: STRUCTURING
┌──────────────────┐ ┌────────────────────┐ ┌─────────────────────┐
│ DuckDuckGo search│ ──▶ │ Fetch Program │ ──▶ │ LLM → Structured │
│ + validated URLs │ │ Page Content │ │ CSV Output (vN) │
│ + fallback crawl │ │ (text, contacts, │ │ + registry match │
│ │ │ PDF links) │ │ + gap candidates │
└──────────────────┘ └────────────────────┘ └─────────────────────┘
│
▼
┌─────────────────────┐
│ GAP ANALYSIS │
│ TF-IDF similarity │
│ vs. 31-program │
│ federal registry │
└─────────────────────┘
3-tier strategy per county:
- Tier 1: Use advisor-validated MCH URLs directly (highest precision)
- Tier 2: DuckDuckGo search for county MCH page on the county's own domain
- Tier 3: Fall back to known health dept URL or county root
Link scoring uses a 2-layer taxonomy:
- Layer 1: Federal Program Registry aliases (31 known programs, +3.0 score) — high precision
- Layer 2: Maternal taxonomy keywords (
src/maternal_taxonomy.py, +2.0) — broader recall
- Fetches each discovered program page
- Extracts: full text content, phone/email contacts, PDF links, registry signals
- Sends page content to GPT-4o-mini with registry-grounded prompt
- Async batch LLM calls (up to 5 concurrent) — ~3x faster than sequential
- Each extracted program is matched to a
program_idfrom the Federal Program Registry - Unmatched programs flagged as gap candidates
- Output saved to a new versioned directory (
data/structured/vN) on every run
- Reads Phase 3 output CSVs and Phase 2 raw JSON files
- TF-IDF cosine similarity compares unmatched extractions against registry
- 3 signal types: novel programs, alias misses, LLM/alias disagreement
- Output:
data/gap_analysis/gap_report.txt
Latest gap report summary (35 counties):
- 577 total extractions, 423 matched to registry (73.3%)
- 40 gap candidates identified, including TAPP, LAMB, AAIMM, PEI
- 21 alias miss signals (e.g., FQHC, MEDICAID_PRENATAL)
Create .env from .env.example:
API_PROVIDER=openai
OPENAI_API_KEY=sk-your-key-here
DATA_COLLECTOR_NAME=Your Name├── run_pipeline.py # Main entry point — runs all 3 phases + gap analysis
├── scraper_discovery.py # Phase 1: Search-first discovery (DuckDuckGo + fallback)
├── scraper_extract.py # Phase 2: Page content extraction
├── scraper_structure.py # Phase 3: Async LLM structuring (registry-grounded)
│
├── src/
│ ├── config.py # County URLs, API settings, budget guardrails
│ ├── federal_program_registry.py # 31-program ground-truth registry (CA/IN/TX)
│ ├── maternal_taxonomy.py # Keyword taxonomy (25 types, 14 categories)
│ ├── llm_program_classifier.py # (legacy) LLM-based re-classification
│ └── utils.py # save_to_csv, get_next_structured_version_dir
│
├── eval/
│ ├── gap_detector.py # TF-IDF gap analysis vs. federal registry
│ ├── run_eval.py # Evaluation runner
│ ├── gold_maternal.jsonl # Gold dataset (validated counties)
│ ├── gold_schema.py
│ └── metrics.py
│
├── data/
│ ├── discovery_results.json # Phase 1 output (all 58 counties)
│ ├── raw/{county}/*.json # Phase 2 output (per-page JSON)
│ ├── structured/ # Phase 3 output (auto-versioned)
│ │ ├── v1/ … v4/ # Each run creates a new vN folder
│ │ └── vN/California_{County}_Healthcare_Data.csv
│ └── gap_analysis/
│ └── gap_report.txt # Latest gap detection report
│
└── docs/
├── QUICK_START.md
├── ARCHITECTURE.md
└── INSIGHTS_FATHERHOOD_MATERNAL_HEALTH.md
# Full pipeline — all 58 counties (recommended)
python run_pipeline.py
# Individual phases
python scraper_discovery.py # Phase 1 only
python scraper_discovery.py --county "Alameda" "Fresno" # specific counties
python scraper_extract.py # Phase 2 only
python scraper_structure.py # Phase 3 only
# Gap analysis only (uses existing data/structured and data/raw)
python eval/gap_detector.py
# Evaluation
python eval/run_eval.py| Path | Description |
|---|---|
data/discovery_results.json |
Discovered program links per county |
data/raw/{county}/*.json |
Raw page content (text, contacts, PDF links) |
data/structured/vN/ |
Structured CSVs — new folder created every run |
data/gap_analysis/gap_report.txt |
Gap detection report |
src/federal_program_registry.py defines 31 known maternal health programs used as ground truth for matching and gap detection:
| Tier | Description | Count |
|---|---|---|
| Tier 1 | Universal — every county should list these (WIC, NFP, FQHC, …) | 11 |
| Tier 2 | State-wide — CA-funded, most counties receive funding | 3 |
| Tier 3 | Selective — CA-specific or evidence-based models (BIH, LAMB, PEI, …) | 17 |
Key programs: WIC, Black Infant Health (BIH), Nurse-Family Partnership (NFP), MCAH/Title V, Perinatal Care Network (PCN), Home Visiting, FQHC, Medi-Cal Prenatal, Doula programs.
Per advisor feedback, the pipeline focuses exclusively on maternal health programs:
- Included: WIC, Black Infant Health, Nurse-Family Partnership, MCAH, Perinatal Care, Breastfeeding Support, Teen Pregnancy Programs, Doula Programs, Fatherhood & Partner Engagement Programs
- Excluded: Medi-Cal (general), CalFresh, Behavioral Health, Senior Services, and other non-maternal programs
The taxonomy in src/maternal_taxonomy.py defines 25 program types across 14 categories, aligned with the SDOH framework and White House Blueprint:
| Blueprint Goal | SDOH Domain | Categories |
|---|---|---|
| Goal 1: Access | Healthcare Access | Perinatal Care, Behavioral Health, Reproductive Health |
| Goal 2: Quality | Quality of Care | Health Equity, Quality Improvement |
| Goal 4: Workforce | Social Support | Home Visiting, Birth Support, Community Health, Breastfeeding |
| Goal 5: Social | Nutrition, Social Support | Nutrition, Adolescent Health, Early Childhood, Partner & Family Engagement |