Build a clinician-facing knowledge graph from depression/anxiety research papers.
Pipeline: Docling β GPT-4o extraction β validation β quality assessment β ontology normalization β Neo4j graph β Streamlit UI
Goal: Surface evidence-based treatment options with full provenance (paper citations, exact text spans, confidence scores)
Problem: Clinicians are overwhelmed by research literature. Finding evidence for treatment decisions requires hours of manual search across scattered papers.
Solution: Automatically extract clinical facts (drug-condition relationships, efficacy, side effects) from papers, normalize to medical ontologies (RxNorm/SNOMED), store in a queryable graph database, and present ranked treatment options with citations.
Key Features:
- Provenance-first: Every fact links back to source paper, section, and exact text span
- Multi-method validation: Heuristic rules + semantic similarity + optional LLM judge
- Human-in-the-loop: Manual approval gates between pipeline stages via Streamlit UI
- Ontology grounding: Maps drug/condition names to standard medical codes
- Transparent ranking: Configurable weights for efficacy, safety, acceptability
flowchart TD
A[PDF/URL Input] --> B[Docling Parser]
B --> C[Structured JSON<br/>sections + spans]
C --> D[GPT-4o Extraction]
D --> E[Clinical Facts<br/>JSON triples]
E --> F[Validation Layer<br/>schema + rules]
F --> G[Quality Assessment<br/>heuristic/NLI/LLM]
G --> H[Ontology Normalization<br/>RxNorm/SNOMED]
H --> I[Neo4j Graph<br/>nodes + edges + provenance]
I --> J[Streamlit UI<br/>search + rank + citations]
J --> K[Clinician Review<br/>mark edges reviewed]
K --> I
style I fill:#4CAF50
style J fill:#2196F3
Pipeline Stages:
- Parse (Docling) β Extract structured text with section headers and sentence offsets
- Extract (LLM) β GPT-4o converts text into JSON triples:
(drug, relation, condition/outcome) - Validate β Pydantic schema + rule filters (required fields, entity types, span checks)
- Quality Check β Multi-method assessment (heuristic/NLI/LLM judge) estimates precision
- Normalize β Map drug/condition strings to RxNorm/SNOMED IDs for entity merging
- Load to Graph β Upsert nodes and edges into Neo4j with full provenance metadata
- UI (Streamlit) β Human-in-the-loop control panel for running and monitoring pipeline
| Component | Technology | Purpose |
|---|---|---|
| PDF Parsing | Docling | Extracts structured text with layouts preserved |
| LLM Extraction | OpenAI GPT-4o | Converts text to structured JSON facts |
| Validation | Pydantic + Custom Rules | Schema enforcement and quality filters |
| Quality Assessment | Heuristic + sentence-transformers + GPT-4o | Multi-method fact verification |
| Ontology Mapping | RxNorm / SNOMED CT / UMLS | Standardizes medical terminology |
| Graph Database | Neo4j 5.x | Stores entities and relationships with provenance |
| User Interface | Streamlit | Interactive pipeline control and visualization |
| Language | Python 3.11+ | Core runtime |
brightside-health-ai/
βββ README.md # This file
βββ .env.example # Environment template (API keys, DB config)
βββ .gitignore
βββ requirements.txt # Python dependencies
βββ Makefile # macOS/Linux shortcuts
β
βββ configs/ # Configuration files
β βββ app.yaml # Paths and feature flags
β βββ mappings.yaml # Ontology mapping rules (RxNorm/SNOMED)
β βββ relations.yaml # Valid relationship types
β βββ weights.yaml # Ranking weights (efficacy/safety/acceptability)
β
βββ data/ # Data directories (gitignored)
β βββ raw_papers/ # Input PDFs
β βββ interim/ # Parsed documents (JSON)
β βββ processed/
β β βββ extracted/ # LLM-extracted facts
β β βββ validated/ # Schema-validated facts
β β βββ normalized/ # Ontology-normalized facts
β βββ eval/ # Quality assessment reports
β βββ reports/ # Batch processing summaries
β
βββ scripts/ # Command-line tools
β βββ add_paper.py # End-to-end pipeline runner
β βββ parse_doc.py # Parse PDF to JSON
β βββ extract.py # Extract facts with LLM
β βββ validate.py # Validate extracted facts
β βββ auto_validate_quality.py # Quality assessment
β βββ normalize.py # Ontology normalization
β βββ load_neo4j.py # Load facts into Neo4j
β βββ show_unmatched_normalized.py # Show unmapped entities
β βββ neo4j_schema.py # Neo4j schema inspector
β βββ neo4j_validate.py # Neo4j data quality checks
β βββ tasks.ps1 # Windows PowerShell scripts
β
ββββ src/
βββ app/
β βββ streamlit_app.py # Main UI application
βββ core/
β βββ ingest_docling.py # PDF parsing logic
β βββ extract_llm.py # LLM extraction pipeline
β βββ validate.py # Validation rules and logic
β βββ normalize_ontology.py # Ontology mapping
βββ analytics/
β βββ networkx_sidecar.py # Graph analytics (future implementations)
βββ utils/ # Shared utilities
A snapshot of the knowledge graph for demonstration purposes:
- View: https://graphxr.kineviz.com/share/69223a6664d44cc4b9506b1f/Brightside/693b2fb1729dbcca686fda9f/Knowledge%20Graph
- Note: This is a pre-run example to illustrate structure and relationships.
- Python 3.11+ (tested on 3.11, 3.12)
- Neo4j 5.x (Community or Enterprise)
- Download: https://neo4j.com/download/
- OpenAI API Key (for GPT-4o extraction)
- Git
git clone https://github.com/your-org/brightside-health-ai.git
cd brightside-health-aiOption A: Manual Setup (recommended)
# Create virtual environment
python -m venv .venv
# Activate (macOS/Linux)
source .venv/bin/activate
# Activate (Windows)
.venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy environment template
cp .env.example .env
# Create data directories
mkdir -p data/raw_papers data/interim data/processed/extracted data/processed/validated data/processed/normalized data/eval data/reportsOption B: Using PowerShell (Windows)
./scripts/tasks.ps1 -Task setupOption C (Dont use needs fixing): Using Make (macOS/Linux)
make setup
# Installs dependencies, creates .env file, creates data directoriesEdit .env file:
# OpenAI API Configuration
OPENAI_API_KEY=sk-proj-your-api-key-here
# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password-here
# Optional: Logging
LOG_LEVEL=INFOOption A: Neo4j Desktop (Recommended for Development)
- Download Neo4j Desktop from https://neo4j.com/download/
- Install and create a new project
- Click "Add" β "Local DBMS"
- Set database name (e.g.,
brightside-kg) - Set password (remember this!)
- Click "Create"
- Click "Start" on your database
- Note the Bolt URI (usually
bolt://localhost:7687)
Option B: Neo4j Cloud (AuraDB - Production)
- Create free instance at https://console.neo4j.io/
- Copy connection URI (e.g.,
neo4j+s://xxxxx.databases.neo4j.io) - Save credentials securely
Verify Connection:
- Browser UI: http://localhost:7474
- Or run:
curl http://localhost:7474
Update .env file with your Neo4j credentials:
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password-hereThe Streamlit UI provides an interactive, stage-by-stage interface with manual approval gates.
Launch:
# macOS/Linux
make ui
# Windows
./scripts/tasks.ps1 -Task ui
# Or direct command
streamlit run src/app/streamlit_app.pyOpen browser to: http://localhost:8501
UI Features:
- π Paper Input: Upload PDF, enter URL, or select from raw_papers
- βοΈ Pipeline Settings: Quality thresholds, fuzzy matching scores, assessment methods
- π Neo4j Connection: Test database connection before loading
- π Stage-by-Stage Execution: Manual approval required between stages
- π Preview & Statistics: View outputs, quality scores, and validation issues
- π― Progress Tracking: Visual pipeline progress indicator
Pipeline Stages in UI:
- Parse Document β Convert PDF to structured JSON
- Extract Facts β GPT-4o extraction (costs API calls)
- Validate Facts β Schema and rule checks
- Quality Assessment β Multi-method verification (select methods)
- Normalize β Map to RxNorm/SNOMED
- Load to Graph β Insert into Neo4j (with pre-flight checks)
After completion:
- View quality reports
- Check unmatched terms
- Open Neo4j Browser to query graph
For batch processing or automation, run scripts manually in sequence.
# Place a PDF in data/raw_papers/
cp path/to/paper.pdf data/raw_papers/sample.pdf
# Run full pipeline
make add_paper ARGS="--pdf data/raw_papers/sample.pdf"Windows:
./scripts/tasks.ps1 -Task add_paper -Args "--pdf data/raw_papers/sample.pdf"What it does:
- Parses PDF β
data/interim/sample_parsed.json - Extracts facts β
data/processed/extracted/sample_extracted.json - Validates β
data/processed/validated/sample_validated.json - Assesses quality β
data/eval/sample_quality_report.json - Normalizes β
data/processed/normalized/sample_normalized.json
Then manually load to Neo4j:
python scripts/load_neo4j.py \
--input data/processed/normalized/sample_normalized.json \
--clear # Use --clear only for first paper to delete existing dataRun each stage independently to understand or debug the pipeline:
Purpose: Extract structured text with sections and sentence spans
python scripts/parse_doc.py \
--source data/raw_papers/sample.pdf \
--out data/interim/sample_parsed.jsonOutput: data/interim/sample_parsed.json
What it does:
- Uses Docling to extract text while preserving structure
- Identifies sections (Abstract, Introduction, Methods, Results, Discussion)
- Splits text into sentences with character offsets
- Preserves metadata (source_id, title, year)
Example output structure:
{
"full_text": "...",
"sections": [
{
"name": "results",
"text": "...",
"start_offset": 1234,
"end_offset": 2345,
"sentences": [{ "text": "...", "start": 1290, "end": 1320 }]
}
],
"metadata": { "source_id": "sample", "title": "...", "year": 2024 }
}Purpose: Use GPT-4o to extract clinical facts as JSON triples
python scripts/extract.py \
--input data/interim/sample_parsed.json \
--output data/processed/extracted/sample_extracted.jsonOutput: data/processed/extracted/sample_extracted.json
What it does:
- Sends each section to GPT-4o with structured prompt
- Extracts facts as
(drug_name, relation, condition_name)triples - Captures provenance:
span(exact text),section,confidence - Includes optional fields:
treatment_line,sample_size,study_design,dose,duration,p_value
Example fact:
{
"drug_name": "escitalopram",
"condition_name": "major depressive disorder",
"relation": "FIRST_LINE_FOR",
"outcome": "remission rate",
"effect_size": "42% vs 28% placebo, NNT=7",
"confidence": 0.95,
"study_design": "RCT",
"sample_size": 485,
"duration": "8 weeks",
"dose": "10-20mg daily",
"treatment_line": "first",
"p_value": 0.001,
"span": "In an 8-week randomized trial (n=485)...",
"source_id": "sample",
"section": "Results"
}Supported relations:
TREATS,IMPROVES,FIRST_LINE_FOR,MAINTENANCE_FORPREVENTS_RELAPSE_IN,WELL_TOLERATED_IN,EFFECTIVE_IN_SUBGROUPASSOCIATED_WITH_SE,CONTRAINDICATED_FORAUGMENTS,SUPERIOR_TO,EQUIVALENT_TO
Purpose: Check facts against schema and filter invalid entries
python scripts/validate.py \
--input data/processed/extracted/sample_extracted.json \
--output data/processed/validated/sample_validated.json \
--issues data/processed/validated/sample_issues.json \
--show-detailsOutput:
data/processed/validated/sample_validated.json(clean facts)data/processed/validated/sample_issues.json(rejected facts with reasons)
What it validates:
- Required fields present:
drug_name,condition_name,relation,span,confidence - Confidence in valid range: 0.0 to 1.0
- Drug name not generic ("medication", "treatment")
- Condition name not an outcome ("remission", "response")
- Relation type in allowed list
- Drug mentioned in span (strict check)
- Condition mentioned in span (lenient check)
- Side effects specific (not "side effects")
- Span not too short (<15 chars) or too long (>500 chars)
Common rejection reasons:
outcome_as_condition: "remission" used as conditionplaceholder_drug_name: "medication" instead of specific drugdrug_not_in_span_strict: Drug name not found in supporting textinvalid_side_effects: Generic "side effects" instead of specific events
Purpose: Estimate precision using multiple validation methods
# Heuristic only (fast, free)
python scripts/auto_validate_quality.py \
--input data/processed/validated/sample_validated.json \
--output data/eval/sample_quality_report.json \
--methods heuristic
# Multiple methods (more accurate)
python scripts/auto_validate_quality.py \
--input data/processed/validated/sample_validated.json \
--output data/eval/sample_quality_report.json \
--methods heuristic nli llm_judgeAvailable methods:
| Method | Speed | Cost | Accuracy | Use Case |
|---|---|---|---|---|
heuristic |
β‘ Fast | FREE | 70-80% | Default quick check |
nli |
π’ Slow | FREE | 80-90% | Semantic similarity (requires sentence-transformers) |
llm_judge |
π Very Slow | $$$ (~$0.001/fact) | 90-95% | High-stakes verification |
knowledge_base |
β‘ Fast | FREE | 100% on known | Cross-reference against curated list |
Output: data/eval/sample_quality_report.json
What it contains:
{
"total_facts": 50,
"average_quality_score": 78.5,
"estimated_precision": 0.92,
"results": [
{
"fact": {...},
"quality_score": 85,
"likely_correct": true,
"method_results": {
"heuristic": {"quality_score": 80, "likely_correct": true},
"nli": {"quality_score": 90, "likely_correct": true}
}
}
]
}Purpose: Map drug/condition names to standard medical codes
python scripts/normalize.py \
--input data/processed/validated/sample_validated.json \
--output data/processed/normalized/sample_normalized.json \
--config configs/mappings.yaml \
--min-fuzzy-score 0.86Output: data/processed/normalized/sample_normalized.json
What it does:
- Maps drug names to RxNorm concept IDs
- Maps condition names to SNOMED CT concept IDs
- Uses exact matching first, then fuzzy matching (RapidFuzz)
- Preserves original raw fact for reference
- Stores match type:
exact,fuzzy, orunmatched - Stores match score (0.0 to 1.0)
Example normalized fact:
{
"drug": {
"text": "escitalopram",
"concept_id": "RXNORM:321988",
"label": "Escitalopram",
"match_type": "exact",
"score": 1.0
},
"condition": {
"text": "major depressive disorder",
"concept_id": "SNOMED:370143000",
"label": "Major depressive disorder",
"match_type": "fuzzy",
"score": 0.91
},
"relation": {"text": "FIRST_LINE_FOR"},
"raw_fact": {
"drug_name": "escitalopram",
"condition_name": "major depressive disorder",
...
}
}Check for unmapped entities:
python scripts/show_unmatched_normalized.py \
--input data/processed/normalized/sample_normalized.jsonThis shows:
- Unmatched drugs with occurrence counts
- Unmatched conditions
- Unmatched outcomes
- Unmatched side effects
To improve matching:
- Add missing terms to mappings.yaml
- Run normalization again
- Goal: >95% match rate
Purpose: Insert normalized facts as nodes and edges with full provenance
# First paper - clear existing data
python scripts/load_neo4j.py \
--input data/processed/normalized/sample_normalized.json \
--clear
# Additional papers - append without clearing
python scripts/load_neo4j.py \
--input data/processed/normalized/paper2_normalized.jsonWhat it creates:
Nodes:
Drugnodes:{id, name, normalized_name, match_type, match_score, category}Conditionnodes: same propertiesOutcomenodes (forIMPROVESrelations)SideEffectnodes (forASSOCIATED_WITH_SErelations)
Relationships with provenance:
(:Drug)-[:FIRST_LINE_FOR {
// Core provenance
evidence: "In an 8-week randomized trial...",
confidence: 0.95,
source_id: "sample",
section: "Results",
// Clinical context
treatment_line: "first",
patient_subgroup: null,
study_design: "RCT",
sample_size: 485,
duration: "8 weeks",
dose: "10-20mg daily",
// Quantitative evidence
effect_size: "42% remission vs 28% placebo",
confidence_interval: "95% CI: 1.2-2.4",
p_value: 0.001,
outcome: "remission rate"
}]->(:Condition)Verify in Neo4j Browser:
// Count nodes by type
MATCH (n) RETURN labels(n) AS type, count(n) AS count
// Show sample relationships
MATCH (d:Drug)-[r:FIRST_LINE_FOR]->(c:Condition)
RETURN d.name, c.name, r.evidence, r.confidence
LIMIT 5
// Find drug with most connections
MATCH (d:Drug)-[r]->()
RETURN d.name, count(r) AS connections
ORDER BY connections DESC
LIMIT 10MATCH (d:Drug)-[r:FIRST_LINE_FOR]->(c:Condition)
WHERE c.normalized_name CONTAINS "depressive disorder"
AND r.confidence > 0.8
RETURN d.name AS drug,
r.effect_size AS efficacy,
r.confidence AS confidence,
r.sample_size AS n,
r.source_id AS paper
ORDER BY r.confidence DESC
LIMIT 10MATCH (d1:Drug {name: "sertraline"})-[r1:ASSOCIATED_WITH_SE]->(se:SideEffect)
MATCH (d2:Drug {name: "escitalopram"})-[r2:ASSOCIATED_WITH_SE]->(se)
RETURN se.name AS side_effect,
COUNT(r1) AS sertraline_reports,
COUNT(r2) AS escitalopram_reportsMATCH (d1:Drug)-[r:AUGMENTS]->(d2:Drug)
WHERE r.patient_subgroup = "treatment-resistant"
RETURN d1.name AS augmenting_drug,
d2.name AS base_treatment,
r.evidence AS evidence,
r.source_id AS paperMATCH (d:Drug {name: "fluoxetine"})-[r:TREATS]->(c:Condition)
WHERE c.normalized_name CONTAINS "depression"
RETURN r.source_id AS paper,
r.section AS section,
r.evidence AS exact_text,
r.confidence AS confidence,
r.study_design AS design,
r.sample_size AS ndrugs:
RxNorm:
- "sertraline": "RXNORM:36437"
- "escitalopram": "RXNORM:321988"
synonyms:
- ["zoloft", "sertraline"]
- ["lexapro", "escitalopram"]
conditions:
SNOMED:
- "major depressive disorder": "SNOMED:370143000"
- "generalized anxiety disorder": "SNOMED:21897009"valid_relations:
- TREATS
- IMPROVES
- FIRST_LINE_FOR
- PREVENTS_RELAPSE_IN
- WELL_TOLERATED_IN
- EFFECTIVE_IN_SUBGROUP
- ASSOCIATED_WITH_SE
- AUGMENTS
- CONTRAINDICATED_FOR
- SUPERIOR_TO
- EQUIVALENT_TOSolution:
pip install --upgrade pip
pip install -r requirements.txtSolution:
- Check .env file exists in project root
- Verify
OPENAI_API_KEY=sk-...line present - Restart terminal/IDE to reload environment
Solution:
- Verify Neo4j is running:
curl http://localhost:7474 - Check credentials in .env match Neo4j
- Ensure bolt port 7687 is not blocked by firewall
- Test connection in Streamlit UI sidebar
Causes:
- Paper has no extractable clinical facts
- PDF parsing failed (check docling for debug output)
- LLM prompt needs tuning for paper type
Solution:
- Check parsed JSON has populated
sections - Review extraction prompts in extract_llm.py
- Try with a different paper
Solution:
- Adjust extraction prompt for stricter output
- Increase
min_fuzzy_scorethreshold - Filter facts by
min_quality_score(e.g., 70) - Review and fix mappings.yaml for missing terms
Solution:
- Run:
python show_unmatched_normalized.py --input <normalized_file> - Add missing terms to mappings.yaml
- Re-run normalization
- Docling (PDF/HTML parsing)
- OpenAI Python SDK (LLM extraction, optional LLM judge)
- Pydantic (schemas for validation)
- sentence-transformers (NLI similarity for quality assessment)
- RapidFuzz (fuzzy ontology matching)
- NetworkX (graph analytics β future implementations)
- pandas (tabular previews in UI)
- python-dotenv (env management)
- requests (URL fetching for ingestion)
- Neo4j (graph database)
- Neo4j Python Driver
- Cypher (query language)
- Streamlit (human-in-the-loop pipeline UI)
- Mermaid (architecture diagrams in README)
- Make (CLI shortcuts; macOS/Linux)
- PowerShell (Windows tasks)
- Docker (containerization; optional)
- Python 3.11+ (core runtime)
- pip / uv (dependency management)
-
RxNorm (drug normalization)
-
SNOMED CT (condition normalization)
-
UMLS (optional cross-references)
- https://www.nlm.nih.gov/research/umls/### Graph & DB
-
Neo4j (graph database)
-
Neo4j Python Driver
-
Cypher (query language)
- Streamlit (human-in-the-loop pipeline UI)
- Mermaid (architecture diagrams in README)
- Make (CLI shortcuts; macOS/Linux)
- PowerShell (Windows tasks)
- Docker (containerization; optional)
- Python 3.11+ (core runtime)
- pip / uv (dependency management)
- RxNorm (drug normalization)
- SNOMED CT (condition normalization)
- UMLS (optional cross-references)
- UI:
src/app/streamlit_app.py - Parsing:
src/core/ingest_docling.py,scripts/parse_doc.py - Extraction:
src/core/extract_llm.py,scripts/extract.py - Validation:
src/core/validate.py,src/schemas/triples.py,scripts/validate.py - Quality:
scripts/auto_validate_quality.py - Normalization:
src/core/normalize_ontology.py,scripts/normalize.py - Graph Load:
scripts/load_neo4j.py - Configs:
configs/mappings.yaml,configs/relations.yaml,configs/weights.yaml
We are grateful to everyone who contributed to this project.
-
Team
- Aaron Don
- Khushi Gauli
- Guan Ying Goh
-
Advisors
- Andrew Norris
- Diane Bernardoni
-
Coach
- Harshini Donepudi
MIT License - see LICENSE file for details
