Benchmark for evaluating PII removal and text anonymization tools for the paper RAT-Bench, with a focus on re-identification risk.
Our benchmark dataset is openly accessible. We also maintain an extendable leaderboard of text anonymization tools.
This project uses the uv package manager.
Run the following inside the project root:
uv syncThis will create a virtual environment inside a .venv folder. Run all next commands within this virtual environment.
Make sure to then activate the environment, by running source .venv/bin/activate.
To install the correctmatch package from Rocher et al. that is used to compute re-identification risk, run the following:
python install_correctmatch.pyYou'll need to install the spacy transformers by running
uv pip install pip
python -m spacy download en_core_web_lg
If you would need anonymizers also for other languages, repeat the same process to download es_core_news_lg(Spanish) and zh_core_web_lg(Chinese).
Copy pii_benchmark/credentials_example.py → pii_benchmark/credentials.py.
Add your API keys inside credentials.py.
(Note: credentials.py is already in .gitignore and won’t be committed.)
We have included the processed datasets needed to generate the benchmark and run our experiments in this repo:
data/100_profiles.csv contains the profiles, including direct and indirect identifiers, used to generate benchmark entries.
data/population_sample.csv contains the 3M sample of US census data, preprocessed to account for weighting of each row.
We have included 100 benchmark entries per level in the benchmark folder in this repo. To run a new generation, run:
sh scripts/generate/level_{LEVEL}.shfor the desired level of difficulty. You can also instead obtain our generated benchmark dataset here.
Anonymization scripts are included in scripts/anonymize, and attacker scripts are included in scripts/attack.
A full run of the pipeline for one difficulty level, from generation to re-identification rate computation, can be achieved by running the following:
sh scripts/generate/level_1.shsh scripts/anonymize/level_1.shsh scripts/attack/level_1.sh
If you found our work useful for your research, kindly cite our paper:
@article{krvco2026rat,
title={RAT-Bench: A Comprehensive Benchmark for Text Anonymization},
author={Kr{\v{c}}o, Nata{\v{s}}a and Yao, Zexi and Meeus, Matthieu and de Montjoye, Yves-Alexandre},
journal={arXiv preprint arXiv:2602.12806},
year={2026}
}
