Based on user feedback, we have uploaded a new file that provides clearer alignment between cases, manual annotations, and results. The file is organized into sheets that contain well-matched information, allowing users to more easily compare outputs and trace findings across sources. This update ensures that results can be understood in context, reduces confusion, and improves the overall usability of the dataset.
Our paper, Improving Automated Deep Phenotyping Through Large Language Models Using Retrieval-Augmented Generation, has been accepted for publication in Genome Medicine!
https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-025-01521-w
If you use RAG-HPO in your work, please cite our publication!
- The HPO vector database has been updated for improved consistency and coverage.
- We've added:
- More phrase-to-HPO ID matches
- An additional index with structured metadata for downstream analysis
- Matched SNOMED CT and UMLS IDs for many HPO terms
- Included alternate HPO IDs from the original ontology source
We’ve released:
- Test cases used in the manuscript
- Input/output data from the analysis
- Source code for calculating:
- True Positives / False Positives / False Negatives
- Precision / Recall / F1 Scores
We welcome any feedback or comments regarding the program or datasets.
We are nearing completion of a web-based GUI for RAG-HPO and are actively seeking beta testers to try out the interface and provide feedback.
If you are interested in participating, please contact:
📧 Jennifer Posey — jep2156@cumc.columbia.edu
📧 Brandon Garcia — brandon.garcia@bcm.edu
RAG-HPO is a Python-based tool designed to extract Human Phenotype Ontology (HPO) terms from clinical notes. It leverages large language models (LLMs) and Retrieval Augmented Generation (RAG) to provide standardized phenotypic descriptions critical for genomics and clinical research. RAG-HPO itself is not an LLM, but it utilizes LLMs provided by the user to process and annotate clinical text.
Note: Protecting patient information and ensuring compliance with institutional guidelines and HIPAA is the end user’s responsibility.
📬 Interested in receiving updates? Join our Mailing List
- Input Clinical Notes: You provide clinical notes either manually during runtime or by uploading a CSV file.
- Extract Key Phrases via LLM: The tool uses a configured LLM to identify clinically relevant phrases from these notes.
- Match Phrases to HPO Terms: It employs vector similarity search (FAISS) and fuzzy matching to map these phrases to appropriate HPO terms.
- Output Structured Data: Finally, it produces a CSV with patient IDs, phenotypic descriptions, and matched HPO terms.
- LLM Configuration: An API key for accessing your chosen LLM (cloud-based or local). The base URL of the LLM API. The LLM model name.
- HPO Data and Vectorization: A preprocessed HPO embeddings file (G2GHPO_metadata.npy), generated by vectorizing HPO terms. This file is used to match clinical phrases to HPO terms. Access to the HPO ontology and additional validated phrases (HPO_addons.csv) if you plan to update the vector database.
- Python & Dependencies: Python installed. All required packages as listed in requirements.txt.
Note: RAG-HPO has been tested with Groq.com, which offers free/cheap API keys and access to cloud-based LLMS. However, any OpenAI-compatible LLM should work, including locally hosted ones like LM-studio.
WARNING: Do not submit sensitive or identifying information. ALWAYS de-identify your data
- Install Jupyter Notebook or Microsoft Visual Studio Code.
- Clone this repository and navigate to its directory.
- Create a virtual python environment(recommended).
- Install dependencies with:
pip install -r requirements.txtOr, using the provided script:
import os
import sys
import subprocess
def check_python_version(min_version=(3, 7)):
"""
Ensure the Python version meets the minimum requirement.
"""
if sys.version_info < min_version:
print(f"Error: Python {'.'.join(map(str, min_version))} or higher is required.")
print("Please update your Python installation and try again.")
sys.exit(1)
def install_package(package):
"""
Install an individual package, handling any installation errors.
"""
try:
print(f"Installing {package}...")
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
print(f"Successfully installed {package}.")
except subprocess.CalledProcessError as e:
print(f"Failed to install {package}: {e}")
def install_requirements(requirements_file="requirements.txt"):
"""
Install packages from a requirements file.
"""
if not os.path.exists(requirements_file):
print(f"Error: {requirements_file} not found.")
sys.exit(1)
print(f"Installing packages from {requirements_file}...")
with open(requirements_file, "r") as f:
for line in f:
package = line.strip()
if package and not package.startswith("#"):
install_package(package)
print("Finished processing requirements.")
if __name__ == "__main__":
# Check Python version first
check_python_version(min_version=(3, 7))
# Install packages
install_requirements()- Prepare Your LLM Configuration: Have your API key, base URL, and model name ready.
- Run the Jupyter Notebook: Open the .ipynb file in Jupyter or VS Code.
- Follow the Prompts: Configure the LLM by providing the requested API and model details. Input your clinical notes (manually or via CSV).
- Generate Results: Run the notebook cells sequentially. View the annotated HPO terms in the terminal or save them as a CSV.
RAG-HPO relies on a vectorized database of HPO terms. Before running the annotation tool, you may need to:
- Obtain HPO Data: Download the HPO database from the HPO website.
- Incorporate Additional Phrases: Add validated phrases to HPO_addons.csv to improve precision.
- Vectorize the Database: Use the provided notebook to: Process HPO data and generate a .csv for inspection. Vectorize the database, producing the G2GHPO_metadata.npy file. This step takes about 10 minutes and can be repeated as needed when HPO is updated (usually monthly).
- Annotator App: An app that combines LLM phenotype extraction with concept recognition tools and RAG to quickly and efficiently extract and assign HPO terms. This app will also allow users to manually edit their results to reduce the time needed for phenotype analysis.
- Containerization: A fully containerized version of RAG-HPO is under development, allowing use without manual command line interaction.
- Enhanced Error Handling: We are continually improving error messages and recovery steps for a smoother user experience.
If you have feedback, suggestions, or want to integrate RAG-HPO into existing pipelines, please contact Jennifer Posey (jennifer.posey@bcm.edu) or Brandon Garcia (brandon.garcia@bcm.edu) If you’d like to contribute additional validated phrases to the HPO ontology, send us a .csv file following the format of HPO_addons.csv.
