RAT-Bench

Benchmark for evaluating PII removal and text anonymization tools for the paper RAT-Bench, with a focus on re-identification risk.

Our benchmark dataset is openly accessible. We also maintain an extendable leaderboard of text anonymization tools.

⚙️ Installation

This project uses the uv package manager.

1. Set up the virtual environment

Run the following inside the project root:

uv sync

This will create a virtual environment inside a .venv folder. Run all next commands within this virtual environment.

Make sure to then activate the environment, by running source .venv/bin/activate.

To install the correctmatch package from Rocher et al. that is used to compute re-identification risk, run the following:

python install_correctmatch.py

You'll need to install the spacy transformers by running

uv pip install pip
python -m spacy download en_core_web_lg

If you would need anonymizers also for other languages, repeat the same process to download es_core_news_lg(Spanish) and zh_core_web_lg(Chinese).

Copy pii_benchmark/credentials_example.py → pii_benchmark/credentials.py.

Add your API keys inside credentials.py.

(Note: credentials.py is already in .gitignore and won’t be committed.)

⚙️ Data

We have included the processed datasets needed to generate the benchmark and run our experiments in this repo:

data/100_profiles.csv contains the profiles, including direct and indirect identifiers, used to generate benchmark entries. data/population_sample.csv contains the 3M sample of US census data, preprocessed to account for weighting of each row.

Running experiments

We have included 100 benchmark entries per level in the benchmark folder in this repo. To run a new generation, run:

sh scripts/generate/level_{LEVEL}.sh

for the desired level of difficulty. You can also instead obtain our generated benchmark dataset here.

Anonymization scripts are included in scripts/anonymize, and attacker scripts are included in scripts/attack.

A full run of the pipeline for one difficulty level, from generation to re-identification rate computation, can be achieved by running the following:

sh scripts/generate/level_1.sh
sh scripts/anonymize/level_1.sh
sh scripts/attack/level_1.sh

References

If you found our work useful for your research, kindly cite our paper:

@article{krvco2026rat,
  title={RAT-Bench: A Comprehensive Benchmark for Text Anonymization},
  author={Kr{\v{c}}o, Nata{\v{s}}a and Yao, Zexi and Meeus, Matthieu and de Montjoye, Yves-Alexandre},
  journal={arXiv preprint arXiv:2602.12806},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
pii_benchmark		pii_benchmark
scripts		scripts
synthetic_data_generation		synthetic_data_generation
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
Rat_Bench.png		Rat_Bench.png
generator.py		generator.py
install_correctmatch.py		install_correctmatch.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAT-Bench

⚙️ Installation

1. Set up the virtual environment

⚙️ Data

Running experiments

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAT-Bench

⚙️ Installation

1. Set up the virtual environment

⚙️ Data

Running experiments

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages