Disentangling the Roles of Representation and Selection in Data Pruning

Here is the code for our ACL 2025 paper "Disentangling the Roles of Representation and Selection in Data Pruning", by Yupei Du, Yingjin Song, Hugh Mee Wong, Daniil Ignatev, Albert Gatt, and Dong Nguyen.

Get Started

Install dependencies

python3.10 -m venv py310_venv
source py310_venv/bin/activate
pip install -r requirements.txt
# follow: https://trak.readthedocs.io/en/latest/install.html
pip install traker[fast]

Download the dataset

Download the dataset from Google Drive, and extract it to the data directory.

Train models

Our training script uses a YAML configuration file to specify the training parameters. You can find example configurations in the configs directory. We also offer a function to construct a YAML configuration file based on the provided parameters, see construct_train_yaml in exp_utils.py.

python train.py <YAML_CONFIG>

Collect representations

To collect representations from different training runs, you can use the collect_grad_reps.py script. Similar to the training script, it requires a YAML configuration file to specify the parameters; and you can find example configurations in the configs directory, or construct a YAML configuration file using the construct_feature_yaml function in exp_utils.py.

python collect_grad_reps.py <YAML_CONFIG>

Infer selected data instances and retrain models

To infer the selected data instances, you can use the subset_inference.py script. You can similarly use the train.py script to retrain the models on the selected data instances by specifying the --selected_uid_path argument.

Toy example

Our code for the toy example in Figure 2a can be found at sampling_toy.ipynb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disentangling the Roles of Representation and Selection in Data Pruning

Get Started

Install dependencies

Download the dataset

Train models

Collect representations

Infer selected data instances and retrain models

Toy example

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
algorithms		algorithms
configs		configs
README.md		README.md
collect_grad_reps.py		collect_grad_reps.py
dataloader.py		dataloader.py
exp_utils.py		exp_utils.py
requirements.txt		requirements.txt
sampling_toy.ipynb		sampling_toy.ipynb
subset_inference.py		subset_inference.py
train.py		train.py
trainers.py		trainers.py

nlpsoc/data_pruning_disentangle

Folders and files

Latest commit

History

Repository files navigation

Disentangling the Roles of Representation and Selection in Data Pruning

Get Started

Install dependencies

Download the dataset

Train models

Collect representations

Infer selected data instances and retrain models

Toy example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages