PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Fine-Tuning

We show how to fine-tune a Huggingface model on the ECHR dataset (i) without defenses, and (ii) with differentially private training.

Build & Run

We recommend setting up a conda environment for this project.

$ conda create -n pii-leakage python=3.10
$ conda activate pii-leakage
$ pip install -e .

To install fastDP,

cd libs/fast-differential-privacy
python -m setup develop

Run fine-tuning

Add a wandb key to export WANDB_API_KEY= in ./finetune.sh. We have all the commands in finetune.sh. We can run the following:

./finetune.sh

Mechanistic Interpretability Analysis

To install EAP-IG,

conda create -n py312 python=3.12
conda activate py312
cd libs
git clone https://github.com/hannamw/EAP-IG.git
cd EAP-IG
pip install .
pip install cmapy
pip install seaborn==0.13.2

Analyzing Impact of DP on General Circuits

You can run the shell script with all the commands to generate relevant csv files

./gencircuits.sh

Attack

Assuming your fine-tuned model is located at ../echr_undefended run the following attacks. Otherwise, you can edit the model_ckpt attribute in the ../configs/<ATTACK>/echr-gpt2-small-undefended.yml file to point to the location of the model.

PII Extraction

This will extract PII from the model's generated text.

$ python extract_pii.py --config_path ../configs/pii-extraction-echr-qwen3-17-baseline-loc.yml

Credits

Use modified code of pii-leakage code
EAP-IG

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analysis		analysis
configs		configs
gencircuits		gencircuits
libs/fast-differential-privacy		libs/fast-differential-privacy
pii_circuits		pii_circuits
src		src
LICENSE		LICENSE
README.md		README.md
analyze_dp_leak_results.py		analyze_dp_leak_results.py
analyze_leak_results.py		analyze_leak_results.py
analyze_pii_attention_heads.py		analyze_pii_attention_heads.py
constants.py		constants.py
evaluate_leakage.py		evaluate_leakage.py
evaluate_leakage_circuits.py		evaluate_leakage_circuits.py
evaluate_model_overlap.py		evaluate_model_overlap.py
evaluate_perplexity.py		evaluate_perplexity.py
evaluate_perplexity_circuits.py		evaluate_perplexity_circuits.py
evaluate_pii_overlap.py		evaluate_pii_overlap.py
fine_tune.py		fine_tune.py
finetune.sh		finetune.sh
gencircuits.sh		gencircuits.sh
gencircuits_chkp.sh		gencircuits_chkp.sh
get_baseline_leakage.py		get_baseline_leakage.py
main.png		main.png
pii_catcher.py		pii_catcher.py
requirements.txt		requirements.txt
run_get_baseline_leakage.sh		run_get_baseline_leakage.sh
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Fine-Tuning

Mechanistic Interpretability Analysis

Attack

Credits

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ssg-research/pii-patch

Folders and files

Latest commit

History

Repository files navigation

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Fine-Tuning

Mechanistic Interpretability Analysis

Attack

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages