This repository contains a test dataset for the evaluation of the KIdeKu Toxicity Detection Pipeline (https://github.com/debatelab/toxicity-detector). It includes the test dataset, notebooks for data generation and analysis and the generated detection data by the pipeline.
Important
This repository is part of ongoing research work. The used provided test dataset can due to its small size and construction only provide limited and exploratory insights into the performance of the toxicity detection pipeline. Please refer to the project page of KIdeKu (https://compphil2mmae.github.io/research/kideku/) for more information about the project and its outcomes.
Warning
The test dataset contains user-generated content from social media which may include offensive language. Please exercise caution when accessing or using the dataset.
config/: YAML configuration files and metadata for the evaluation.data/:kideku_tox_gold.csv: The test dataset for toxicity detection.detection_outputs/: Contains raw model outputs (YAML files), organized by date.evaluation/: Contains aggregated model outputs a basis for the evaluation.eval_run_20260119.csv: Example evaluation output file generated withpipeline_config_01.yaml(and 5 text inputs from the test dataset).eval_run_20260120.csv: Example evaluation output file generated withpipeline_config_02.yaml(and 5 text inputs from the test dataset).eval_run_20260120_1.csv: Evaluation output file generated withpipeline_config_02.yamland the whole test dataset.
notebooks/: Jupyter notebooks for data generation and analysis.pyproject.toml: Project metadata and dependency definitions.
This dataset is a merged and re-annotated collection of German social media comments used for evaluating the KIdeKu toxicity detector. It combines subsets from two prominent datasets: HASOC 2019 (Goldstandard) and GermEval 2018 (Test).
The dataset contains a total of 285 entries, including original labels and new annotations from two independent annotators.
The dataset was created to provide a high-quality "gold standard" for toxicity classification, specifically distinguishing between personalized and group-based toxicity. Additionally, the dataset addresses ambiguities and uncertainties in annotations by allowing annotators to flag uncertain cases.
- HASOC 2019 (https://hasocfire.github.io/hasoc/2019/): Subset of the German Goldstandard.
- GermEval 2018 (https://github.com/uds-lsv/GermEval-2018-Data): Subset of the Test dataset.
A subset of these datasets (61 from HASOC, 53 from GermEval) was re-annotated by two annotators. The annotation followed a common guideline:
- Personalized Toxicity (
PERS): Insults, threats, or harassment directed at an individual without reference to group membership. - Group-based Toxicity (
GRUP): Hate speech directed at a group or individuals as representatives of a group (based on religion, origin, sexual orientation, etc.). BOTH: Contains both types of toxicity.NONE: No toxicity detected.
The annotators worked independently and later aligned on specific edge cases (e.g., treatment of political groups as GRUP).
See also config/eval_metadata.yaml for details about the dataset structure.
The file kideku_tox_gold.csv contains the following columns:
| Column | Description |
|---|---|
text_id |
Unique identifier. GermEval IDs are prefixed with germeval2018_, Hasoc IDs with hasoc_de. |
text |
The raw text of the comment. |
TOX_ANNO_1 |
Toxicity label from Annotator 1. |
UNCERTAINTY_ANNO_1 |
Flag (1) if Annotator 1 was uncertain. |
REMARKS_ANNO_1 |
Optional remarks from Annotator 1. |
TOX_ANNO_2 |
Toxicity label from Annotator 2. |
UNCERTAINTY_ANNO_2 |
Flag (1) if Annotator 2 was uncertain. |
REMARKS_ANNO_2 |
Optional remarks from Annotator 2. |
TOX_ANNO_3 |
Mapped original label from the source dataset (HASOC/GermEval). |
The original labels from the source datasets were mapped to our common format as follows:
HASOC 2019 (Subtask B):
OFFN→PERSHATE→GRUPPRFN→NONENONE→NONE
GermEval 2018 (Subtask 2):
INSULT→PERSABUSE→GRUPPROFANITY→NONEOTHER→NONE
We defined an aggregated label (column AGGREGATE_LABEL), which we use as standard for evaluating the performance of the toxicity detection pipeline.
Rough idea:
- The original labels are not taken into account for the aggregated label since they stem from different annotation guidelines and are not fully compatible with our current annotation scheme. This is corraborated by low inter-annotator agreement between the original labels and the new annotations (~57% or 0.3 Krippendorff's alpha, see
notebooks/goldstandard_analysis.ipynbfor details). - We construct an aggregate label as follows: We categorize
UNCLEARif there is a disagreement between both annotators or if both use an uncertainty flag. Otherwise, we take the label both annotators used.
Raw detection outputs from various model runs are stored in the data/detection_outputs/ directory, organized by date. These file contain everything that is needed to reproduce a run (including model parameters and prompt templates). Each run generates a CSV file in data/evaluation containing model predictions alongside the gold standard labels (files of the form eval_run_YYYYMMDD.csv).
These tables contain all relevant columns from kideku_tox_gold.csv along with model predictions in separate columns:
PERS_<model identifier>: Predicted personalized toxicity label by the model.HATE_<model identifier>: Predicted hatespeech label by the model.TOX_ANNO_<model identifier>: Aggregated predicted toxicity label by the model (see below).HATE_DETECTION_UID_<model identifier>&PERS_DETECTION_UID_<model identifier>: Unique identifier for the model run (these UIDs refer to the relevant YAML output files indetection_outputs/, which contain specifics of each pipeline run, e.g., the full configuration, preliminary outcomes of the pipeline steps, etc.).PIPELINE_CONFIG: The used pipeline configuration file for the run.
The pipeline is designed to answer two separate detection tasks: detection of personalized toxicity and detection of group-based toxicity (hatespeech). Each task produces a separate output label (one of "true", "false" and "unclear"). The final toxicity label is then derived from these two outputs as follows:
| Personalized Toxicity | Hatespeech | Final Label |
|---|---|---|
false |
false |
NONE |
true |
false |
PERS |
false |
true |
GRUP |
true |
true |
BOTH |
unclear |
any | UNCLEAR |
| any | unclear |
UNCLEAR |
The Toxicity Detector and its evaluation is part of the project "Opportunities of AI to Strengthen Our Deliberative Culture" (KIdeKu) which was funded by the Federal Ministry of Education, Family Affairs, Senior Citizens, Women and Youth (BMBFSFJ).