Skip to content

imperial-aisp/rat-bench

Repository files navigation

RAT-Bench

Benchmark for evaluating PII removal and text anonymization tools for the paper RAT-Bench, with a focus on re-identification risk.

Our benchmark dataset is openly accessible. We also maintain an extendable leaderboard of text anonymization tools.

⚙️ Installation

This project uses the uv package manager.

1. Set up the virtual environment

Run the following inside the project root:

uv sync

This will create a virtual environment inside a .venv folder. Run all next commands within this virtual environment.

Make sure to then activate the environment, by running source .venv/bin/activate.

To install the correctmatch package from Rocher et al. that is used to compute re-identification risk, run the following:

python install_correctmatch.py

You'll need to install the spacy transformers by running

uv pip install pip
python -m spacy download en_core_web_lg

If you would need anonymizers also for other languages, repeat the same process to download es_core_news_lg(Spanish) and zh_core_web_lg(Chinese).

Copy pii_benchmark/credentials_example.pypii_benchmark/credentials.py.

Add your API keys inside credentials.py.

(Note: credentials.py is already in .gitignore and won’t be committed.)

⚙️ Data

We have included the processed datasets needed to generate the benchmark and run our experiments in this repo:

data/100_profiles.csv contains the profiles, including direct and indirect identifiers, used to generate benchmark entries. data/population_sample.csv contains the 3M sample of US census data, preprocessed to account for weighting of each row.

Running experiments

We have included 100 benchmark entries per level in the benchmark folder in this repo. To run a new generation, run:

sh scripts/generate/level_{LEVEL}.sh

for the desired level of difficulty. You can also instead obtain our generated benchmark dataset here.

Anonymization scripts are included in scripts/anonymize, and attacker scripts are included in scripts/attack.

A full run of the pipeline for one difficulty level, from generation to re-identification rate computation, can be achieved by running the following:

  1. sh scripts/generate/level_1.sh
  2. sh scripts/anonymize/level_1.sh
  3. sh scripts/attack/level_1.sh

References

If you found our work useful for your research, kindly cite our paper:

@article{krvco2026rat,
  title={RAT-Bench: A Comprehensive Benchmark for Text Anonymization},
  author={Kr{\v{c}}o, Nata{\v{s}}a and Yao, Zexi and Meeus, Matthieu and de Montjoye, Yves-Alexandre},
  journal={arXiv preprint arXiv:2602.12806},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors