Skip to content

TDogaNazli/SecureRAG

Repository files navigation

🩺 SecureMed-RAG

SecureMed-RAG is a privacy-aware clinical question answering pipeline that uses synthetic EHR data and PrimeKG to evaluate the effectiveness of Retrieval-Augmented Generation (RAG) under varying levels of patient data anonymization.

The system generates patient-specific questions, retrieves relevant subgraphs from PrimeKG, and compares responses from a base LLM and a RAG-enhanced model. It supports multiple privacy levels including k-anonymity and l-diversity.


⚙️ Setup Instructions

1. Clone the Repository

git clone https://github.com/TDogaNazli/SecureRAG.git
cd SecureRAG

2. Python Setup

This project uses Python 3.9.22. You can check with:

python3 --version

If needed, install it using pyenv:

pyenv install 3.9.22
pyenv local 3.9.22

Then create and activate a virtual environment:

python3.9 -m venv .myenv
source .myenv/bin/activate

Make sure that your pip package is upgraded:

pip install --upgrade pip

Install dependencies:

pip install -r requirements.txt

Install google-genai package, make sure that your project uses Python 3.9+.

pip install -q -U google-genai

3. Get a Gemini API Key

Then create a .env file in the root directory:

echo "GEMINI_API_KEY=your-key-here" > .env

📦 Data Setup

PrimeKG Download

mkdir -p dataset/primekg/raw
wget -O dataset/primekg/raw/edges.csv https://dataverse.harvard.edu/api/access/datafile/6180616
wget -O dataset/primekg/raw/nodes.tsv https://dataverse.harvard.edu/api/access/datafile/6180617

Prepare Patient EHR Data (Synthetic)

mkdir -p dataset/synthea-dataset-100
wget -O dataset/synthea-dataset-100.zip https://github.com/lhs-open/synthetic-data/raw/main/record/synthea-dataset-100.zip
unzip dataset/synthea-dataset-100.zip -d dataset/synthea-dataset-100
python utils/preprocess_synthea_data.py

▶️ Run Evaluation

To run the evaluation pipeline:

python main.py

🔧 Optional Configuration

  • If you're hitting Gemini API quota, edit this line in answer_generator.py:
    time.sleep(10)
  • To change the number of patients processed:
    res = evaluate_dataset(patient_ids[:50], synthea_path, privacy_level=level, G=G)
    Edit this line in evaluate.py.

📁 Project Structure

SecureMed-RAG/
├── main.py                       # Entry point to run full evaluation
├── requirements.txt             # Python dependencies
├── .env                         # Your Gemini API key (not committed to Git)
│
├── dataset/
│   ├── primekg/                 # Contains downloaded PrimeKG edges/nodes
│   └── synthea-dataset-100/     # Raw and preprocessed Synthea EHR data
│   └── synthea-unified.parquet  # Preprocessed parquet version
│
├── evaluation/
│   ├── evaluate.py              # Evaluation logic, scoring, and result generation
│   └── output/                  # Contains evaluation result files
│
├── generation/
│   └── answer_generator.py      # Calls Gemini to answer medical questions
│
├── preprocessing/
│   └── anonymize_ehr.py         # Implements k-anonymity, l-diversity, and PII removal
│
├── retrieval/
│   └── retrieve_subgraph.py     # Loads PrimeKG and extracts k-hop patient subgraphs
│
└── utils/
    ├── load_from_parquet.py     # Load and filter EHR data per patient
    └── preprocess_synthea_data.py # Groups and serializes Synthea data

📊 Evaluation Output

After running main.py, results are saved to:

evaluation/output/
├── level0_output.json           # Accuracy metrics (no anonymization)
├── level1_output.json           # Accuracy metrics (PII removed)
├── level2_output.json           # Accuracy metrics (k-anonymity + l-diversity)
└── question_results.json        # Per-question breakdown (LLM vs RAG)

📈 Metrics Reported

Each level includes:

  • LLM_only_accuracy
  • RAG_accuracy
  • improvement: how much RAG improves/worsens accuracy
  • groundedness: proportion of RAG answers grounded in PrimeKG
  • Per-question-type breakdown

Each entry in question_results.json looks like:

{
  "question_id": 1,
  "patient_id": "a123...",
  "question_type": "RISK_ASSESSMENT",
  "question": "What risks should be considered...",
  "llm_answer": "Hypertension may increase risk of stroke.",
  "rag_answer": "Patients with hypertension taking ibuprofen have higher risk of stroke.",
  "base_correct": true,
  "rag_correct": true
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages