SecureMed-RAG is a privacy-aware clinical question answering pipeline that uses synthetic EHR data and PrimeKG to evaluate the effectiveness of Retrieval-Augmented Generation (RAG) under varying levels of patient data anonymization.
The system generates patient-specific questions, retrieves relevant subgraphs from PrimeKG, and compares responses from a base LLM and a RAG-enhanced model. It supports multiple privacy levels including k-anonymity and l-diversity.
git clone https://github.com/TDogaNazli/SecureRAG.git
cd SecureRAGThis project uses Python 3.9.22. You can check with:
python3 --versionIf needed, install it using pyenv:
pyenv install 3.9.22
pyenv local 3.9.22Then create and activate a virtual environment:
python3.9 -m venv .myenv
source .myenv/bin/activateMake sure that your pip package is upgraded:
pip install --upgrade pipInstall dependencies:
pip install -r requirements.txtInstall google-genai package, make sure that your project uses Python 3.9+.
pip install -q -U google-genai- Visit Google AI Studio
- Generate your Gemini API key
Then create a .env file in the root directory:
echo "GEMINI_API_KEY=your-key-here" > .envmkdir -p dataset/primekg/raw
wget -O dataset/primekg/raw/edges.csv https://dataverse.harvard.edu/api/access/datafile/6180616
wget -O dataset/primekg/raw/nodes.tsv https://dataverse.harvard.edu/api/access/datafile/6180617mkdir -p dataset/synthea-dataset-100
wget -O dataset/synthea-dataset-100.zip https://github.com/lhs-open/synthetic-data/raw/main/record/synthea-dataset-100.zip
unzip dataset/synthea-dataset-100.zip -d dataset/synthea-dataset-100
python utils/preprocess_synthea_data.pyTo run the evaluation pipeline:
python main.py- If you're hitting Gemini API quota, edit this line in
answer_generator.py:time.sleep(10)
- To change the number of patients processed:
Edit this line in
res = evaluate_dataset(patient_ids[:50], synthea_path, privacy_level=level, G=G)
evaluate.py.
SecureMed-RAG/
├── main.py # Entry point to run full evaluation
├── requirements.txt # Python dependencies
├── .env # Your Gemini API key (not committed to Git)
│
├── dataset/
│ ├── primekg/ # Contains downloaded PrimeKG edges/nodes
│ └── synthea-dataset-100/ # Raw and preprocessed Synthea EHR data
│ └── synthea-unified.parquet # Preprocessed parquet version
│
├── evaluation/
│ ├── evaluate.py # Evaluation logic, scoring, and result generation
│ └── output/ # Contains evaluation result files
│
├── generation/
│ └── answer_generator.py # Calls Gemini to answer medical questions
│
├── preprocessing/
│ └── anonymize_ehr.py # Implements k-anonymity, l-diversity, and PII removal
│
├── retrieval/
│ └── retrieve_subgraph.py # Loads PrimeKG and extracts k-hop patient subgraphs
│
└── utils/
├── load_from_parquet.py # Load and filter EHR data per patient
└── preprocess_synthea_data.py # Groups and serializes Synthea data
After running main.py, results are saved to:
evaluation/output/
├── level0_output.json # Accuracy metrics (no anonymization)
├── level1_output.json # Accuracy metrics (PII removed)
├── level2_output.json # Accuracy metrics (k-anonymity + l-diversity)
└── question_results.json # Per-question breakdown (LLM vs RAG)
Each level includes:
LLM_only_accuracyRAG_accuracyimprovement: how much RAG improves/worsens accuracygroundedness: proportion of RAG answers grounded in PrimeKG- Per-question-type breakdown
Each entry in question_results.json looks like:
{
"question_id": 1,
"patient_id": "a123...",
"question_type": "RISK_ASSESSMENT",
"question": "What risks should be considered...",
"llm_answer": "Hypertension may increase risk of stroke.",
"rag_answer": "Patients with hypertension taking ibuprofen have higher risk of stroke.",
"base_correct": true,
"rag_correct": true
}