Suppression or Deletion: A Restoration-Based Representation-Level Analysis of Machine Unlearning (WWW '26)
As pretrained models are increasingly shared on the web, ensuring models can forget sensitive, copyrighted, or private information has become crucial. Current unlearning evaluations rely on output-based metrics, which cannot verify whether information is truly deleted or merely suppressed at the representation level.
This repository provides a restoration-based analysis framework using Sparse Autoencoders to:
- Identify class-specific expert features in intermediate layers
- Apply inference-time steering to restore unlearned information
- Quantitatively distinguish between suppression and deletion
- Python 3.9 or higher
- CUDA-capable GPU (recommended)
# Clone the repository
git clone https://github.com/Yurim990507/suppression-or-deletion.git
cd suppression-or-deletion
# Install dependencies
pip install -r requirements.txtDownload the pretrained SAE models, expert features, and original model from Hugging Face:
# Install Hugging Face CLI
pip install huggingface_hub
# Download all pretrained files
huggingface-cli download Yurim0507/suppression-or-deletion --local-dir ./pretrained --repo-type=modelThe files will be organized as:
pretrained/
├── cifar10/
│ ├── vit_base_16_original.pth # Original ViT model
│ ├── sae_layer9_k16.pt # SAE model (layer 9, k=16)
│ ├── activations_layer9_stats.npy # Normalization statistics
│ └── expert_features_layer9_k16.pt # Expert features per class
└── imagenette/
├── vit_base_16_original.pth # Original ViT model
├── sae_layer9_k32.pt # SAE model (layer 9, k=32)
├── activations_layer9_stats.npy # Normalization statistics
└── expert_features_layer9_k32.pt # Expert features per class
Note: Dataset-specific SAE configurations:
- CIFAR-10: k=16 (TopK sparsity)
- Imagenette: k=32 (TopK sparsity)
Train an unlearned model using your preferred method (CF-k, SALUN, SCRUB, etc.) and save it as a .pth checkpoint.
Example checkpoint format:
{
'model_state_dict': model.state_dict(),
# ... other optional keys
}python demo.py \
--dataset cifar10 \
--unlearned_model path/to/your/unlearned_model.pth \
--target_class 0python recovery_test.py \
--dataset cifar10 \
--unlearned_model path/to/your/unlearned_model.pth \
--target_class 0 \
--layer 9 \
--alpha 1.0 5.0 10.0 \
--save_dir ./resultsResults are saved in the --save_dir directory:
restoration_class{X}.png: Line plot showing restoration performancerestoration_class{X}_results.json: Detailed numerical results
We train a sparse autoencoder on ViT layer activations with the following architecture:
- Input: ViT hidden states (768-dim for ViT-Base)
- Latent: 768-dim (hidden_mul=1)
- Activation: TopK sparsity (only top K features active per sample)
- Output: Reconstructed hidden states
Dataset-specific configurations:
| Dataset | K value (TopK sparsity) |
|---|---|
| CIFAR-10 | 16 |
| Imagenette | 32 |
Restoration mode:
direct_injection: Gradual addition method (default, used in experiments)- Only target class samples are restored
- Non-target samples remain untouched
This repository supports CIFAR-10 and Imagenette datasets used in our experiments.
If you find this work useful, please cite our paper:
@article{jang2026suppression,
title={{Suppression or Deletion}: A Restoration-Based Representation-Level Analysis of Machine Unlearning},
author={Jang, Yurim and Lee, Jaeung and Kim, Dohyun and Jo, Jaemin and Woo, Simon S},
journal={arXiv preprint arXiv:2602.18505},
year={2026}
}