RAG-HPO

Updates — Sept. 30, 2025

Based on user feedback, we have uploaded a new file that provides clearer alignment between cases, manual annotations, and results. The file is organized into sheets that contain well-matched information, allowing users to more easily compare outputs and trace findings across sources. This update ensures that results can be understood in context, reduces confusion, and improves the overall usability of the dataset.

Our paper, Improving Automated Deep Phenotyping Through Large Language Models Using Retrieval-Augmented Generation, has been accepted for publication in Genome Medicine!

https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-025-01521-w

If you use RAG-HPO in your work, please cite our publication!

RAG-HPO Updates

The HPO vector database has been updated for improved consistency and coverage.
We've added:
- More phrase-to-HPO ID matches
- An additional index with structured metadata for downstream analysis
- Matched SNOMED CT and UMLS IDs for many HPO terms
- Included alternate HPO IDs from the original ontology source

Benchmarking and Evaluation Data

We’ve released:

Test cases used in the manuscript
Input/output data from the analysis
Source code for calculating:
- True Positives / False Positives / False Negatives
- Precision / Recall / F1 Scores

We welcome any feedback or comments regarding the program or datasets.

Help Us Test the GUI!

We are nearing completion of a web-based GUI for RAG-HPO and are actively seeking beta testers to try out the interface and provide feedback.

If you are interested in participating, please contact:
📧 Jennifer Posey — jep2156@cumc.columbia.edu
📧 Brandon Garcia — brandon.garcia@bcm.edu

RAG-HPO

RAG-HPO is a Python-based tool designed to extract Human Phenotype Ontology (HPO) terms from clinical notes. It leverages large language models (LLMs) and Retrieval Augmented Generation (RAG) to provide standardized phenotypic descriptions critical for genomics and clinical research. RAG-HPO itself is not an LLM, but it utilizes LLMs provided by the user to process and annotate clinical text.

Note: Protecting patient information and ensuring compliance with institutional guidelines and HIPAA is the end user’s responsibility.

📄 View our article on medRxiv

📬 Interested in receiving updates? Join our Mailing List

How RAG-HPO Works:

Input Clinical Notes: You provide clinical notes either manually during runtime or by uploading a CSV file.
Extract Key Phrases via LLM: The tool uses a configured LLM to identify clinically relevant phrases from these notes.
Match Phrases to HPO Terms: It employs vector similarity search (FAISS) and fuzzy matching to map these phrases to appropriate HPO terms.
Output Structured Data: Finally, it produces a CSV with patient IDs, phenotypic descriptions, and matched HPO terms.

What You Need Before You Start:

LLM Configuration: An API key for accessing your chosen LLM (cloud-based or local). The base URL of the LLM API. The LLM model name.
HPO Data and Vectorization: A preprocessed HPO embeddings file (G2GHPO_metadata.npy), generated by vectorizing HPO terms. This file is used to match clinical phrases to HPO terms. Access to the HPO ontology and additional validated phrases (HPO_addons.csv) if you plan to update the vector database.
Python & Dependencies: Python installed. All required packages as listed in requirements.txt.

Note: RAG-HPO has been tested with Groq.com, which offers free/cheap API keys and access to cloud-based LLMS. However, any OpenAI-compatible LLM should work, including locally hosted ones like LM-studio.

WARNING: Do not submit sensitive or identifying information. ALWAYS de-identify your data

Setting Up the Environment:

Install Jupyter Notebook or Microsoft Visual Studio Code.
Clone this repository and navigate to its directory.
Create a virtual python environment(recommended).
Install dependencies with:

pip install -r requirements.txt

Or, using the provided script:

import os
import sys
import subprocess

def check_python_version(min_version=(3, 7)):
    """
    Ensure the Python version meets the minimum requirement.
    """
    if sys.version_info < min_version:
        print(f"Error: Python {'.'.join(map(str, min_version))} or higher is required.")
        print("Please update your Python installation and try again.")
        sys.exit(1)

def install_package(package):
    """
    Install an individual package, handling any installation errors.
    """
    try:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"Successfully installed {package}.")
    except subprocess.CalledProcessError as e:
        print(f"Failed to install {package}: {e}")

def install_requirements(requirements_file="requirements.txt"):
    """
    Install packages from a requirements file.
    """
    if not os.path.exists(requirements_file):
        print(f"Error: {requirements_file} not found.")
        sys.exit(1)

    print(f"Installing packages from {requirements_file}...")
    with open(requirements_file, "r") as f:
        for line in f:
            package = line.strip()
            if package and not package.startswith("#"):
                install_package(package)
    print("Finished processing requirements.")

if __name__ == "__main__":
    # Check Python version first
    check_python_version(min_version=(3, 7))
    # Install packages
    install_requirements()

Using RAG-HP0:

Prepare Your LLM Configuration: Have your API key, base URL, and model name ready.
Run the Jupyter Notebook: Open the .ipynb file in Jupyter or VS Code.
Follow the Prompts: Configure the LLM by providing the requested API and model details. Input your clinical notes (manually or via CSV).
Generate Results: Run the notebook cells sequentially. View the annotated HPO terms in the terminal or save them as a CSV.

Vectorization of the HPO Database:

RAG-HPO relies on a vectorized database of HPO terms. Before running the annotation tool, you may need to:

Obtain HPO Data: Download the HPO database from the HPO website.
Incorporate Additional Phrases: Add validated phrases to HPO_addons.csv to improve precision.
Vectorize the Database: Use the provided notebook to: Process HPO data and generate a .csv for inspection. Vectorize the database, producing the G2GHPO_metadata.npy file. This step takes about 10 minutes and can be repeated as needed when HPO is updated (usually monthly).

Planned Improvements:

Annotator App: An app that combines LLM phenotype extraction with concept recognition tools and RAG to quickly and efficiently extract and assign HPO terms. This app will also allow users to manually edit their results to reduce the time needed for phenotype analysis.
Containerization: A fully containerized version of RAG-HPO is under development, allowing use without manual command line interaction.
Enhanced Error Handling: We are continually improving error messages and recovery steps for a smoother user experience.

Feedback and Contributions:

If you have feedback, suggestions, or want to integrate RAG-HPO into existing pipelines, please contact Jennifer Posey (jennifer.posey@bcm.edu) or Brandon Garcia (brandon.garcia@bcm.edu) If you’d like to contribute additional validated phrases to the HPO ontology, send us a .csv file following the format of HPO_addons.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
HPO_Vectorization.ipynb		HPO_Vectorization.ipynb
HPO_addons.csv		HPO_addons.csv
LICENSE		LICENSE
RAG-HPO ASHG Poster.pptx		RAG-HPO ASHG Poster.pptx
RAG-HPO Tests and Data Analysis copy.xlsx		RAG-HPO Tests and Data Analysis copy.xlsx
RAG-HPO.ipynb		RAG-HPO.ipynb
README.md		README.md
Test_Cases.csv		Test_Cases.csv
requirements.txt		requirements.txt
setup_environment.py		setup_environment.py
system_prompts.json		system_prompts.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Updates — Sept. 30, 2025

RAG-HPO Updates

Benchmarking and Evaluation Data

Help Us Test the GUI!

RAG-HPO

Note: Protecting patient information and ensuring compliance with institutional guidelines and HIPAA is the end user’s responsibility.

How RAG-HPO Works:

What You Need Before You Start:

Setting Up the Environment:

Using RAG-HP0:

Vectorization of the HPO Database:

Planned Improvements:

Feedback and Contributions:

Here is an example of how RAG-HPO Compares to other programs!

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Updates — Sept. 30, 2025

RAG-HPO Updates

Benchmarking and Evaluation Data

Help Us Test the GUI!

RAG-HPO

Note: Protecting patient information and ensuring compliance with institutional guidelines and HIPAA is the end user’s responsibility.

How RAG-HPO Works:

What You Need Before You Start:

Setting Up the Environment:

Using RAG-HP0:

Vectorization of the HPO Database:

Planned Improvements:

Feedback and Contributions:

Here is an example of how RAG-HPO Compares to other programs!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages