This repository contains the official implementation of our paper:
“Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning”
The rapid growth of harmful online content—such as toxicity, spam, and negative sentiment—calls for robust, adaptive, and user-centered moderation systems.
However, existing moderation approaches are typically centralized, task-specific, and opaque, offering little flexibility for diverse user preferences or decentralized environments.
This project introduces a novel framework that leverages in-context learning (ICL) with foundation models to unify harmful content detection across binary, multi-class, and multi-label formulations.
Our approach enables lightweight personalization, allowing users to:
- Block new categories of harmful content
- Unblock existing ones
- Extend detection to semantically similar variations
— all through simple prompt-level customization, without any model retraining.
Experiments on TextDetox, UCI SMS, SST2, and our newly annotated Mastodon dataset demonstrate that:
- Foundation models achieve strong cross-task generalization, often matching or surpassing fine-tuned baselines.
- Personalization is effective with as few as one user-provided example or definition.
- Rationale-augmented prompts improve robustness to noisy, real-world data.
Overall, our study moves beyond one-size-fits-all moderation, establishing ICL as a practical, privacy-preserving, and highly adaptable foundation for user-centric content safety systems.
We use three public datasets and one newly collected multi-class, multi-label dataset from Mastodon.
All datasets are publicly available:
This module fine-tunes the BERT-base-uncased model on three benchmark datasets—TextDetox, UCI SMS, and SST2—to establish strong supervised baselines for harmful content detection. It includes scripts for training, evaluation, and configuration management, enabling reproducible baseline comparisons against in-context learning (ICL) approaches.
👉 Implementation: baselines/
This component investigates the capability of in-context learning (ICL) for harmful content detection across binary, multi-class, and multi-task setups. It provides Jupyter notebooks for experimenting with various retrieval strategies, prompt templates, and foundation models (Llama, Mistral, Qwen).
👉 Implementation: single-and-multi-task_icl/
This module demonstrates user-specific moderation customization via ICL. It simulates three real-world personalization scenarios—blocking new harmful categories, unblocking acceptable ones, and blocking semantic variations—all achieved through prompt-level updates without retraining.
👉 Implementation: personalization/
This section evaluates model robustness on Mastodon wild data, a noisy, real-world dataset covering multiple harmful content types. It explores ICL performance under binary, multi-class, and multi-label formulations, and introduces rationale-augmented prompts to enhance generalization and reduce false positives.
👉Implementation: evalOnWild/
create a new conda environment with:
conda create -n icl python=3.12
activate the environment:
conda activate icl
Install all dependencies with:
pip install -r requirements.txt
Note:
You may encounter compatibility issues related to retriv and textattack.
We recommend two possible solutions:
- Create a separate virtual environment for
retrivand pre-generate retrieved demonstrations from the training set. - Use a slightly older version of
vllm, such asvllm==0.6.1.post2, to ensure compatibility (though inference may be slower).
A similar approach applies to resolvingtextattackissues.
To run the notebooks, you need to download the following models from hugginface:
Example: Download Qwen2-7B-Instruct
# 1. Install Hugging Face CLI if not already installed
pip install -U "huggingface_hub[cli]"
# 2. Login to your Hugging Face account
huggingface-cli login
# 3. Download the full model repository
huggingface-cli download Qwen/Qwen2-7B-Instruct --local-dir ./models/Qwen2-7B-Instruct
You can reproduce the results reported in our paper by executing the provided notebooks cell by cell.
Taking multi-task binary classification as an example:
-
Locate the implementation
Open the notebook:icl.ipynb -
Run the notebook step by step
- Cell 1: Import the required libraries (see Import libraries section)
- Cell 2: Initialize the model (see Initialize the model section)
- Note: you need replace the
modelwith your downloaded model path.
- Note: you need replace the
- Cell 3: Load the dataset (see multi-task/binary classification section)
- Note: The notebook may include multiple dataset-loading cells for different tasks.
Choose the one corresponding to your target task.
- Note: The notebook may include multiple dataset-loading cells for different tasks.
- Cell 4: Run the ICL experiment (see Run experiment section)
You can also use the provided code predict.py to detect a single text or a batch of texts(csv file) in different tasks scenarios. Some examples are shown below:
- detect whether a text is spam or not:
python predict.py --harmful_category spam --classification binary --model_path your_model_path --text "This is a spam text"- detect whether a text is harmful or not in binary classification:
python predict.py --harmful_category all --classification binary --model_path your_model_path --text "This is a harmful text"- detect a text's harmful categories in multi-class classification:
python predict.py --harmful_category all --classification multi-class --model_path your_model_path --text "This is a harmful text"- detect a text's harmful categories in multi-label classification:
python predict.py --harmful_category all --classification multi-label --model_path your_model_path --text "This is a harmful text"If you find this work useful, please cite our paper:
@misc{zhang2025onesizefitsallpersonalizedharmfulcontent,
title={Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning},
author={Rufan Zhang and Lin Zhang and Xianghang Mi},
year={2025},
eprint={2511.05532},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.05532},
}