🩺 SecureMed-RAG

SecureMed-RAG is a privacy-aware clinical question answering pipeline that uses synthetic EHR data and PrimeKG to evaluate the effectiveness of Retrieval-Augmented Generation (RAG) under varying levels of patient data anonymization.

The system generates patient-specific questions, retrieves relevant subgraphs from PrimeKG, and compares responses from a base LLM and a RAG-enhanced model. It supports multiple privacy levels including k-anonymity and l-diversity.

⚙️ Setup Instructions

1. Clone the Repository

git clone https://github.com/TDogaNazli/SecureRAG.git
cd SecureRAG

2. Python Setup

This project uses Python 3.9.22. You can check with:

python3 --version

If needed, install it using pyenv:

pyenv install 3.9.22
pyenv local 3.9.22

Then create and activate a virtual environment:

python3.9 -m venv .myenv
source .myenv/bin/activate

Make sure that your pip package is upgraded:

pip install --upgrade pip

Install dependencies:

pip install -r requirements.txt

Install google-genai package, make sure that your project uses Python 3.9+.

pip install -q -U google-genai

3. Get a Gemini API Key

Visit Google AI Studio
Generate your Gemini API key

Then create a .env file in the root directory:

echo "GEMINI_API_KEY=your-key-here" > .env

📦 Data Setup

PrimeKG Download

mkdir -p dataset/primekg/raw
wget -O dataset/primekg/raw/edges.csv https://dataverse.harvard.edu/api/access/datafile/6180616
wget -O dataset/primekg/raw/nodes.tsv https://dataverse.harvard.edu/api/access/datafile/6180617

Prepare Patient EHR Data (Synthetic)

mkdir -p dataset/synthea-dataset-100
wget -O dataset/synthea-dataset-100.zip https://github.com/lhs-open/synthetic-data/raw/main/record/synthea-dataset-100.zip
unzip dataset/synthea-dataset-100.zip -d dataset/synthea-dataset-100
python utils/preprocess_synthea_data.py

▶️ Run Evaluation

To run the evaluation pipeline:

python main.py

🔧 Optional Configuration

If you're hitting Gemini API quota, edit this line in answer_generator.py:
```
time.sleep(10)
```

To change the number of patients processed:

res = evaluate_dataset(patient_ids[:50], synthea_path, privacy_level=level, G=G)

Edit this line in evaluate.py.

📁 Project Structure

SecureMed-RAG/
├── main.py                       # Entry point to run full evaluation
├── requirements.txt             # Python dependencies
├── .env                         # Your Gemini API key (not committed to Git)
│
├── dataset/
│   ├── primekg/                 # Contains downloaded PrimeKG edges/nodes
│   └── synthea-dataset-100/     # Raw and preprocessed Synthea EHR data
│   └── synthea-unified.parquet  # Preprocessed parquet version
│
├── evaluation/
│   ├── evaluate.py              # Evaluation logic, scoring, and result generation
│   └── output/                  # Contains evaluation result files
│
├── generation/
│   └── answer_generator.py      # Calls Gemini to answer medical questions
│
├── preprocessing/
│   └── anonymize_ehr.py         # Implements k-anonymity, l-diversity, and PII removal
│
├── retrieval/
│   └── retrieve_subgraph.py     # Loads PrimeKG and extracts k-hop patient subgraphs
│
└── utils/
    ├── load_from_parquet.py     # Load and filter EHR data per patient
    └── preprocess_synthea_data.py # Groups and serializes Synthea data

📊 Evaluation Output

After running main.py, results are saved to:

evaluation/output/
├── level0_output.json           # Accuracy metrics (no anonymization)
├── level1_output.json           # Accuracy metrics (PII removed)
├── level2_output.json           # Accuracy metrics (k-anonymity + l-diversity)
└── question_results.json        # Per-question breakdown (LLM vs RAG)

📈 Metrics Reported

Each level includes:

LLM_only_accuracy
RAG_accuracy
improvement: how much RAG improves/worsens accuracy
groundedness: proportion of RAG answers grounded in PrimeKG
Per-question-type breakdown

Each entry in question_results.json looks like:

{
  "question_id": 1,
  "patient_id": "a123...",
  "question_type": "RISK_ASSESSMENT",
  "question": "What risks should be considered...",
  "llm_answer": "Hypertension may increase risk of stroke.",
  "rag_answer": "Patients with hypertension taking ibuprofen have higher risk of stroke.",
  "base_correct": true,
  "rag_correct": true
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🩺 SecureMed-RAG

⚙️ Setup Instructions

1. Clone the Repository

2. Python Setup

3. Get a Gemini API Key

📦 Data Setup

PrimeKG Download

Prepare Patient EHR Data (Synthetic)

▶️ Run Evaluation

🔧 Optional Configuration

📁 Project Structure

📊 Evaluation Output

📈 Metrics Reported

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
evaluation		evaluation
generation		generation
preprocessing		preprocessing
retrieval		retrieval
utils		utils
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

TDogaNazli/SecureRAG

Folders and files

Latest commit

History

Repository files navigation

🩺 SecureMed-RAG

⚙️ Setup Instructions

1. Clone the Repository

2. Python Setup

3. Get a Gemini API Key

📦 Data Setup

PrimeKG Download

Prepare Patient EHR Data (Synthetic)

▶️ Run Evaluation

🔧 Optional Configuration

📁 Project Structure

📊 Evaluation Output

📈 Metrics Reported

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages