From 40bdf2081ce70941a4bce75df526eae1859276fc Mon Sep 17 00:00:00 2001
From: David Dale <daviddale@meta.com>
Date: Mon, 28 Jul 2025 16:31:44 +0000
Subject: [PATCH 1/2] add a notebook with finetuning example

---
 ...inetune_sonar_as_toxicity_classifier.ipynb | 3119 +++++++++++++++++
 1 file changed, 3119 insertions(+)
 create mode 100644 examples/finetune_sonar_as_toxicity_classifier.ipynb

diff --git a/examples/finetune_sonar_as_toxicity_classifier.ipynb b/examples/finetune_sonar_as_toxicity_classifier.ipynb
new file mode 100644
index 0000000..b2bc852
--- /dev/null
+++ b/examples/finetune_sonar_as_toxicity_classifier.ipynb
@@ -0,0 +1,3119 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0ee0665d-ab0b-4ebf-a6a5-613ac1072f93",
+   "metadata": {},
+   "source": [
+    "In this notebook, we train two SONAR-based toxicity classifiers, to illustrate two things:\n",
+    "- How a classification head on top of SONAR could be trained;\n",
+    "- How a SONAR encoder could be finetuned with some downstream task. \n",
+    "\n",
+    "The first model is adding a classifier layer on top of the frozen SONAR text encoder (like the [MuTox](https://aclanthology.org/2024.findings-acl.340/) model). The second is also unfreezing the encoder, allowing it to modify the underlying sentence representation to potentially better fit the task. \n",
+    "\n",
+    "In both cases, we will be using the unofficial HuggingFace port [cointegrated/SONAR_200_text_encoder](https://huggingface.co/cointegrated/SONAR_200_text_encoder) of the classifier, simply to avoid messing with fairseq2, in which original SONAR is implemented. \n",
+    "\n",
+    "We will be using two datasets for training and evaluation:\n",
+    "- MuTox (text transcriptions; the original dataset is speech-based)\n",
+    "    - see https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/cli/toxicity/mutox/README.md\n",
+    "- Jigsaw toxic comments from three competitions:\n",
+    "    - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data\n",
+    "    - https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data\n",
+    "    - https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/data\n",
+    "    \n",
+    "\n",
+    "Note: this notebook trains the models using 8 A100 GPUs. If you have other hardware configuration, consider adjusting the training parameters accordingly (e.g. reducing the batch size if it does not fit into GPU memory, or even downsampling the dataset). Please note that some kind of hardware accelerator is indispensable for running this training within reasonable time. If you do not have any, consider subscribing e.g. to Google Colab. \n",
+    "\n",
+    "**WARNING**: this notebook might contain examples of offensive or otherwise distubing texts in various languages. \n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "id": "7bfdd63f-069c-4401-9581-3049ffbc8d34",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import torch\n",
+    "import datasets\n",
+    "\n",
+    "from typing import Optional, Union\n",
+    "from torch import nn\n",
+    "from torch.nn import CrossEntropyLoss\n",
+    "from transformers import AutoTokenizer\n",
+    "from transformers.models.m2m_100.modeling_m2m_100 import M2M100Encoder, M2M100PreTrainedModel\n",
+    "from transformers.modeling_outputs import SequenceClassifierOutput\n",
+    "\n",
+    "from transformers import TrainingArguments, Trainer, DataCollatorWithPadding\n",
+    "from sklearn.metrics import roc_auc_score\n",
+    "from tqdm.auto import tqdm\n",
+    "import gc"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87e7e751-ba58-426e-97bd-aa7e10e579fc",
+   "metadata": {},
+   "source": [
+    "Setting up some directories"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "0f97515b-fe3b-480e-aebb-f0ca37ee8e86",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DATA_DIR = \"data\"  # you will need to download the Jigsaw datasets there\n",
+    "MODELS_DIR = \"models\"  # the trained models will be saved there\n",
+    "\n",
+    "# the two models will be saved here:\n",
+    "MODEL1_DIR = f\"{MODELS_DIR}/SONAR_toxicity_classifier_frozen\"\n",
+    "MODEL2_DIR = f\"{MODELS_DIR}/SONAR_toxicity_classifier_unfozen\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6eb01105-4288-4161-8983-925e239a391f",
+   "metadata": {},
+   "source": [
+    "Point to the model that we will fine-tune: https://huggingface.co/cointegrated/SONAR_200_text_encoder"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "fcc9aa22-5ea7-4f93-b0bc-9bcbb40586b2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "BASE_MODEL_NAME = \"cointegrated/SONAR_200_text_encoder\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b39a8d8-2a5d-429e-8d18-df8824af5d67",
+   "metadata": {},
+   "source": [
+    "Load the classifier in advance (we will need it while preparing the data). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "9f31a2cd-f68d-41a8-bd2c-12858e982d05",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ca989c3-7743-4c43-be4b-64ee07fba110",
+   "metadata": {},
+   "source": [
+    "# 1. Compiling the datasets"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9cf437e5-16ac-4976-bd8f-acfb6845ce9f",
+   "metadata": {},
+   "source": [
+    "## Mutox"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b722089d-87b5-437a-95b9-da646cbb786a",
+   "metadata": {},
+   "source": [
+    "The Mutox text dataset could be downloaded directly:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "98a30e19-4393-4bfd-a18a-09d269ea1219",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mutox_raw = pd.read_csv(\"https://dl.fbaipublicfiles.com/seamless/datasets/mutox.tsv\", sep=\"\\t\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "6c80449d-b166-4af1-897c-2daf4d674ad6",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>lang</th>\n",
+       "      <th>partition</th>\n",
+       "      <th>public_url_segment</th>\n",
+       "      <th>audio_file_transcript</th>\n",
+       "      <th>contains_toxicity</th>\n",
+       "      <th>toxicity_types</th>\n",
+       "      <th>perlocutionary_effects</th>\n",
+       "      <th>label</th>\n",
+       "      <th>etox_result</th>\n",
+       "      <th>detoxify_score</th>\n",
+       "      <th>mutox_speech_score</th>\n",
+       "      <th>mutox_text_score</th>\n",
+       "      <th>mutox_zero_shot_speech_score</th>\n",
+       "      <th>mutox_zero_shot_text_score</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>46427</th>\n",
+       "      <td>fra_1181276</td>\n",
+       "      <td>fra</td>\n",
+       "      <td>train</td>\n",
+       "      <td>http://www.archive.org/download/notredameparis...</td>\n",
+       "      <td>élas, mon frère, c'est que vous aviez bien rai...</td>\n",
+       "      <td>No</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.001685</td>\n",
+       "      <td>-24.469976</td>\n",
+       "      <td>-28.979309</td>\n",
+       "      <td>-12.300150</td>\n",
+       "      <td>-18.465229</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>85753</th>\n",
+       "      <td>41677.0</td>\n",
+       "      <td>ell</td>\n",
+       "      <td>train</td>\n",
+       "      <td>https://bauernordic-pods.sharp-stream.com/gr/1...</td>\n",
+       "      <td>Και φτάνεις σε ένα σημείο και λες δεν μπορώ ν...</td>\n",
+       "      <td>No</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.768185</td>\n",
+       "      <td>-8.917057</td>\n",
+       "      <td>-18.839628</td>\n",
+       "      <td>12.581656</td>\n",
+       "      <td>-13.409579</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>57571</th>\n",
+       "      <td>por_370751</td>\n",
+       "      <td>por</td>\n",
+       "      <td>devtest</td>\n",
+       "      <td>https://feeds.soundcloud.com/stream/1121363227...</td>\n",
+       "      <td>so para quando algm sangra ou cae no chãoae pa...</td>\n",
+       "      <td>Yes</td>\n",
+       "      <td>profanities</td>\n",
+       "      <td>None of the above</td>\n",
+       "      <td>1</td>\n",
+       "      <td>merdas</td>\n",
+       "      <td>0.976207</td>\n",
+       "      <td>-0.573492</td>\n",
+       "      <td>31.779409</td>\n",
+       "      <td>-12.019632</td>\n",
+       "      <td>-1.681559</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                id lang partition  \\\n",
+       "46427  fra_1181276  fra     train   \n",
+       "85753      41677.0  ell     train   \n",
+       "57571   por_370751  por   devtest   \n",
+       "\n",
+       "                                      public_url_segment  \\\n",
+       "46427  http://www.archive.org/download/notredameparis...   \n",
+       "85753  https://bauernordic-pods.sharp-stream.com/gr/1...   \n",
+       "57571  https://feeds.soundcloud.com/stream/1121363227...   \n",
+       "\n",
+       "                                   audio_file_transcript contains_toxicity  \\\n",
+       "46427  élas, mon frère, c'est que vous aviez bien rai...                No   \n",
+       "85753   Και φτάνεις σε ένα σημείο και λες δεν μπορώ ν...                No   \n",
+       "57571  so para quando algm sangra ou cae no chãoae pa...               Yes   \n",
+       "\n",
+       "      toxicity_types perlocutionary_effects  label etox_result  \\\n",
+       "46427            NaN                    NaN      0         NaN   \n",
+       "85753            NaN                    NaN      0         NaN   \n",
+       "57571    profanities      None of the above      1      merdas   \n",
+       "\n",
+       "       detoxify_score  mutox_speech_score  mutox_text_score  \\\n",
+       "46427        0.001685          -24.469976        -28.979309   \n",
+       "85753        0.768185           -8.917057        -18.839628   \n",
+       "57571        0.976207           -0.573492         31.779409   \n",
+       "\n",
+       "       mutox_zero_shot_speech_score  mutox_zero_shot_text_score  \n",
+       "46427                    -12.300150                  -18.465229  \n",
+       "85753                     12.581656                  -13.409579  \n",
+       "57571                    -12.019632                   -1.681559  "
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "mutox_raw.sample(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2017e345-f799-4a9c-aad0-830ee16d7978",
+   "metadata": {},
+   "source": [
+    "Note that Mutox contains a certain portion of the texts that are missing or inadequately short; we would like to filter them out. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "2c70bb38-87ee-488d-b5aa-f74ad0dcebf7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3441\n",
+      "122\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(mutox_raw['audio_file_transcript'].isnull().sum())\n",
+    "print((mutox_raw['audio_file_transcript'].str.len() < 5).sum())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42766ecf-c253-4717-816b-bdbe7894e731",
+   "metadata": {},
+   "source": [
+    "There are plenty of languages in the dataset: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "id": "ddfb39e0-1445-40ca-9179-ecaa3b77b978",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "partition\n",
+       "train      67971\n",
+       "dev        22273\n",
+       "devtest    10691\n",
+       "Name: count, dtype: int64"
+      ]
+     },
+     "execution_count": 57,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "mutox_raw[\"partition\"].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "e1ec505b-8f15-4d74-8b5b-0e355c282167",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "lang\n",
+       "eng    17082\n",
+       "spa    17059\n",
+       "cat     2787\n",
+       "bul     2650\n",
+       "est     2631\n",
+       "ita     2615\n",
+       "hin     2601\n",
+       "arb     2560\n",
+       "por     2551\n",
+       "heb     2550\n",
+       "swh     2542\n",
+       "nld     2542\n",
+       "tgl     2527\n",
+       "ben     2526\n",
+       "ell     2526\n",
+       "dan     2525\n",
+       "fin     2524\n",
+       "vie     2523\n",
+       "fra     2523\n",
+       "tur     2520\n",
+       "rus     2519\n",
+       "slk     2518\n",
+       "urd     2517\n",
+       "pol     2515\n",
+       "deu     2514\n",
+       "cmn     2513\n",
+       "pes     2513\n",
+       "ces     2512\n",
+       "ind     2510\n",
+       "hun     2501\n",
+       "Name: count, dtype: int64"
+      ]
+     },
+     "execution_count": 23,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "mutox_raw[\"lang\"].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 74,
+   "id": "ca9681f6-5f47-420a-8457-b84468255fcd",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "contains_toxicity\n",
+       "No            87887\n",
+       "Yes           13048\n",
+       "Cannot say     2623\n",
+       "Cannot Say     1927\n",
+       "yes              11\n",
+       "Name: count, dtype: int64"
+      ]
+     },
+     "execution_count": 74,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "mutox_raw['contains_toxicity'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3ccdbdb0-dfdf-41e8-ae50-8b7a1039ba7a",
+   "metadata": {},
+   "source": [
+    "Cleaning the dataset and binarizing the label"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "id": "ad3c46a3-aad1-4e04-92d4-689127895bbd",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(105496, 16)\n",
+      "(97509, 17)\n"
+     ]
+    }
+   ],
+   "source": [
+    "mutox_cleaned = mutox_raw[\n",
+    "    mutox_raw['audio_file_transcript'].notnull()\n",
+    "    & mutox_raw['audio_file_transcript'].str.len().ge(5)\n",
+    "    & mutox_raw['contains_toxicity'].isin({'Yes', 'yes', 'No'})\n",
+    "].copy()\n",
+    "mutox_cleaned['toxic'] = mutox_cleaned['contains_toxicity'].apply({'Yes': 1, 'yes': 1, 'No': 0}.get)\n",
+    "print(mutox_raw.shape)\n",
+    "print(mutox_cleaned.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f8cebda7-47e9-4f32-8af0-49170ec5b933",
+   "metadata": {},
+   "source": [
+    "## Jigsaw"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43734bae-35cb-4ab2-8183-0c286ad557f9",
+   "metadata": {},
+   "source": [
+    "For Jigsaw, you need to access the datasets manually. \n",
+    "\n",
+    "1. Please log into Kaggle, go to https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/data, and \"download all\" the data in to the `$DATA_DIR` directory (it is \"data\" for me, but you could use some directory elsewhere if you want). \n",
+    "2. Unzip the data. \n",
+    "\n",
+    "On unix-like operation systems, the command to unzip them would go like this:\n",
+    "\n",
+    "```\n",
+    "!cd $DATA_DIR && unzip jigsaw-multilingual-toxic-comment-classification.zip\n",
+    "```\n",
+    "\n",
+    "The result will include the datasets from all three competitions. The directory contents could look as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "id": "7363170e-ab23-48dc-ae44-067bd2ecb10f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "jigsaw-multilingual-toxic-comment-classification.zip\n",
+      "jigsaw-toxic-comment-train-processed-seqlen128.csv\n",
+      "jigsaw-toxic-comment-train.csv\n",
+      "jigsaw-unintended-bias-train-processed-seqlen128.csv\n",
+      "jigsaw-unintended-bias-train.csv\n",
+      "sample_submission.csv\n",
+      "test-processed-seqlen128.csv\n",
+      "test.csv\n",
+      "test_labels.csv\n",
+      "validation-processed-seqlen128.csv\n",
+      "validation.csv\n"
+     ]
+    }
+   ],
+   "source": [
+    "!ls $DATA_DIR"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c7f3dfe1-dd80-4066-a411-1cafac6b9627",
+   "metadata": {},
+   "source": [
+    "Now we can read the datasets and add some missing columns:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "id": "74575c4e-5ed4-47c2-a9e9-dabe0093c54a",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>comment_text</th>\n",
+       "      <th>toxic</th>\n",
+       "      <th>severe_toxic</th>\n",
+       "      <th>obscene</th>\n",
+       "      <th>threat</th>\n",
+       "      <th>insult</th>\n",
+       "      <th>identity_hate</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>170768</th>\n",
+       "      <td>2cb5cd297da37b3a</td>\n",
+       "      <td>, but I don't know what DRV is, I'm afraid</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>123481</th>\n",
+       "      <td>948a7ca2686cccde</td>\n",
+       "      <td>By the way I will not appeal this block despit...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>198711</th>\n",
+       "      <td>9c0a4a0364eb1792</td>\n",
+       "      <td>In my opinion, it is impossible to explain wha...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                      id                                       comment_text  \\\n",
+       "170768  2cb5cd297da37b3a         , but I don't know what DRV is, I'm afraid   \n",
+       "123481  948a7ca2686cccde  By the way I will not appeal this block despit...   \n",
+       "198711  9c0a4a0364eb1792  In my opinion, it is impossible to explain wha...   \n",
+       "\n",
+       "        toxic  severe_toxic  obscene  threat  insult  identity_hate  \n",
+       "170768      0             0        0       0       0              0  \n",
+       "123481      0             0        0       0       0              0  \n",
+       "198711      0             0        0       0       0              0  "
+      ]
+     },
+     "execution_count": 50,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "jigsaw1_train = pd.read_csv(f\"{DATA_DIR}/jigsaw-toxic-comment-train.csv\")\n",
+    "jigsaw1_train.sample(3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "id": "88a13789-5e6e-4545-8806-d597f831fab1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>comment_text</th>\n",
+       "      <th>toxic</th>\n",
+       "      <th>severe_toxicity</th>\n",
+       "      <th>obscene</th>\n",
+       "      <th>identity_attack</th>\n",
+       "      <th>insult</th>\n",
+       "      <th>threat</th>\n",
+       "      <th>asian</th>\n",
+       "      <th>atheist</th>\n",
+       "      <th>...</th>\n",
+       "      <th>article_id</th>\n",
+       "      <th>rating</th>\n",
+       "      <th>funny</th>\n",
+       "      <th>wow</th>\n",
+       "      <th>sad</th>\n",
+       "      <th>likes</th>\n",
+       "      <th>disagree</th>\n",
+       "      <th>sexual_explicit</th>\n",
+       "      <th>identity_annotator_count</th>\n",
+       "      <th>toxicity_annotator_count</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>1225615</th>\n",
+       "      <td>5612740</td>\n",
+       "      <td>Why?</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>...</td>\n",
+       "      <td>356357</td>\n",
+       "      <td>approved</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>696078</th>\n",
+       "      <td>4973759</td>\n",
+       "      <td>On the contrary, I believe EVERY new condo in ...</td>\n",
+       "      <td>0.166667</td>\n",
+       "      <td>0.166667</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>...</td>\n",
+       "      <td>317444</td>\n",
+       "      <td>approved</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>6</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1136401</th>\n",
+       "      <td>5504345</td>\n",
+       "      <td>Money has ruined politics in this country.  \\n...</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>...</td>\n",
+       "      <td>350258</td>\n",
+       "      <td>approved</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>5</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>4</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>3 rows × 45 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "              id                                       comment_text     toxic  \\\n",
+       "1225615  5612740                                               Why?  0.000000   \n",
+       "696078   4973759  On the contrary, I believe EVERY new condo in ...  0.166667   \n",
+       "1136401  5504345  Money has ruined politics in this country.  \\n...  0.000000   \n",
+       "\n",
+       "         severe_toxicity  obscene  identity_attack  insult  threat  asian  \\\n",
+       "1225615         0.000000      0.0              0.0     0.0     0.0    NaN   \n",
+       "696078          0.166667      0.0              0.0     0.0     0.0    NaN   \n",
+       "1136401         0.000000      0.0              0.0     0.0     0.0    NaN   \n",
+       "\n",
+       "         atheist  ...  article_id    rating  funny  wow  sad  likes  disagree  \\\n",
+       "1225615      NaN  ...      356357  approved      0    0    0      0         0   \n",
+       "696078       NaN  ...      317444  approved      0    0    0      0         0   \n",
+       "1136401      NaN  ...      350258  approved      0    0    0      5         0   \n",
+       "\n",
+       "         sexual_explicit  identity_annotator_count  toxicity_annotator_count  \n",
+       "1225615              0.0                         0                         4  \n",
+       "696078               0.0                         0                         6  \n",
+       "1136401              0.0                         0                         4  \n",
+       "\n",
+       "[3 rows x 45 columns]"
+      ]
+     },
+     "execution_count": 51,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "jigsaw2_train = pd.read_csv(f\"{DATA_DIR}/jigsaw-unintended-bias-train.csv\")\n",
+    "jigsaw2_train.sample(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "252f618e-8f94-4c13-abb6-4144af8e32a3",
+   "metadata": {},
+   "source": [
+    "The unintended bias dataset has toxicity labels that are non-binary. We will convert them to binary by applying a threshold of 0.5."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 209,
+   "id": "3bc054e7-69fb-43c0-879a-a9cba7879124",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAGsCAYAAAAPJKchAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAJ8BJREFUeJzt3X9Q1Pedx/EXPxe5hBhD+KFHipqYH1WR4skR40R7IDUOPeemVy/mlOOiuVS4se6kicQoUhMxRj06OVImJsY6jcGYSUxbGYWScDaRniPKnGnU1KChZwX1vAQC7bKy3/sjw/YoiHzJ7veTxedjxj/2w+fz+b737Sqv+X6/uxtmWZYlAAAAQ8JNFwAAAK5vhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgVEiFkYMHDyovL09jx45VWFiY9u7da3sPy7K0efNmTZo0SS6XS+PGjdMzzzwT+GIBAMCQRJouwI7Ozk6lpaXpn//5n/V3f/d3w9pjxYoVqqmp0ebNmzVlyhRdvnxZly9fDnClAABgqMJC9YvywsLC9NZbb2nBggX+MY/Ho9WrV+u1117Tp59+qsmTJ+vZZ5/V7NmzJUknTpzQ1KlT9cEHH+jOO+80UzgAAOgjpC7TXEtRUZEaGhpUVVWl//qv/9Lf//3f61vf+pZ++9vfSpJ+/vOfa8KECfrFL36h8ePHKzU1VUuXLuXMCAAABo2YMNLS0qJXXnlFe/bs0axZszRx4kQ99thjuu+++/TKK69Ikpqbm/XJJ59oz5492rlzp3bs2KHGxkZ95zvfMVw9AADXr5C6Z2Qwx48fV09PjyZNmtRn3OPx6JZbbpEk+Xw+eTwe7dy50z/v5ZdfVkZGhk6dOsWlGwAADBgxYeTzzz9XRESEGhsbFRER0ednN9xwgyQpOTlZkZGRfQLL3XffLemLMyuEEQAAnDdiwkh6erp6enp04cIFzZo1a8A5M2fO1JUrV/Txxx9r4sSJkqSPPvpIkvS1r33NsVoBAMCfhNS7aT7//HOdPn1a0hfhY+vWrZozZ47GjBmj2267Tf/4j/+o999/X1u2bFF6erouXryouro6TZ06VfPnz5fP59Nf/dVf6YYbblB5ebl8Pp8KCwsVFxenmpoaw88OAIDrU0iFkfr6es2ZM6ffeH5+vnbs2CGv16unn35aO3fu1Llz5xQfH6+//uu/VmlpqaZMmSJJ+v3vf69//dd/VU1Njf7iL/5C8+bN05YtWzRmzBinnw4AANAwwsjBgwf13HPPqbGxUefPn+/3WR+Def/993X//fdr8uTJampqGka5AABgpLH91t7eT0GtqKiwte7TTz/VkiVL9Dd/8zd2DwkAAEawL3WZZqBPQb2af/iHf9Add9yhiIgI7d27lzMjAABAkkPvpnnllVfU3Nysn/70p3r66aevOd/j8cjj8fgf+3w+Xb58WbfccovCwsKCWSoAAAgQy7LU0dGhsWPHKjz86hdjgh5Gfvvb32rVqlX61a9+pcjIoR2urKxMpaWlQa4MAAA44Xe/+53+8i//8qo/D2oY6enp0aJFi1RaWtrvk1EHU1xcLLfb7X/82Wef6bbbbtOZM2d04403Bqw+r9erd999V3PmzFFUVFTA9kVf9Nk59NoZ9NkZ9NkZwexzR0eHxo8ff83f3UENIx0dHTpy5IiOHTumoqIiSV9ccrEsS5GRkaqpqdE3v/nNfutcLpdcLle/8TFjxiguLi5g9Xm9XsXGxuqWW27hhR5E9Nk59NoZ9NkZ9NkZwexz737XusUiqGEkLi5Ox48f7zP2wgsv6J133tEbb7yh8ePHB/PwAAAgBNgOI///U1Al6cyZM2pqavJ/CmpxcbHOnTunnTt3Kjw8XJMnT+6zPiEhQTExMf3GAQDA9cl2GDly5EifT0Htvbej91NQz58/r5aWlsBVCAAARjTbYWT27Nka7KNJduzYMej6devWad26dXYPCwAARijbn8AKAAAQSIQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYFRQv7U3VExed0CensG/3vir5OzG+aZLAAAgYDgzAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAo22Hk4MGDysvL09ixYxUWFqa9e/cOOv/NN99UTk6Obr31VsXFxSkrK0sHDhwYbr0AAGCEsR1GOjs7lZaWpoqKiiHNP3jwoHJyclRdXa3GxkbNmTNHeXl5OnbsmO1iAQDAyBNpd8G8efM0b968Ic8vLy/v83jDhg16++239fOf/1zp6el2Dw8AAEYY22Hky/L5fOro6NCYMWOuOsfj8cjj8fgft7e3S5K8Xq+8Xm/AaundyxVuBWxPJwSyB07orTfU6g5F9NoZ9NkZ9NkZwezzUPcMsyxr2L+Jw8LC9NZbb2nBggVDXrNp0yZt3LhRJ0+eVEJCwoBz1q1bp9LS0n7ju3btUmxs7HDLBQAADurq6tKiRYv02WefKS4u7qrzHA0ju3bt0rJly/T2228rOzv7qvMGOjOSkpKiS5cuDfpk7PJ6vaqtrdWaI+Hy+MICtm+wfbAu13QJtvT2OScnR1FRUabLGdHotTPoszPoszOC2ef29nbFx8dfM4w4dpmmqqpKS5cu1Z49ewYNIpLkcrnkcrn6jUdFRQXlBenxhcnTEzphJFT/UQbr7w/90Wtn0Gdn0GdnBKPPQ93Pkc8Zee2111RQUKDXXntN8+fPd+KQAAAgRNg+M/L555/r9OnT/sdnzpxRU1OTxowZo9tuu03FxcU6d+6cdu7cKemLSzP5+fn60Y9+pMzMTLW2tkqSRo0apZtuuilATwMAAIQq22dGjhw5ovT0dP/bct1ut9LT07V27VpJ0vnz59XS0uKf/+KLL+rKlSsqLCxUcnKy/8+KFSsC9BQAAEAos31mZPbs2RrsntcdO3b0eVxfX2/3EAAA4DrCd9MAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjbYeTgwYPKy8vT2LFjFRYWpr17915zTX19vb7xjW/I5XLp9ttv144dO4ZRKgAAGIlsh5HOzk6lpaWpoqJiSPPPnDmj+fPna86cOWpqatL3v/99LV26VAcOHLBdLAAAGHki7S6YN2+e5s2bN+T5lZWVGj9+vLZs2SJJuvvuu/Xee+/p3/7t35Sbm2v38AAAYISxHUbsamhoUHZ2dp+x3Nxcff/737/qGo/HI4/H43/c3t4uSfJ6vfJ6vQGrrXcvV7gVsD2dEMgeOKG33lCrOxTRa2fQZ2fQZ2cEs89D3TPoYaS1tVWJiYl9xhITE9Xe3q4//OEPGjVqVL81ZWVlKi0t7TdeU1Oj2NjYgNe4frov4HsGU3V1tekShqW2ttZ0CdcNeu0M+uwM+uyMYPS5q6trSPOCHkaGo7i4WG632/+4vb1dKSkpmjt3ruLi4gJ2HK/Xq9raWq05Ei6PLyxg+wbbB+tC6/JWb59zcnIUFRVlupwRjV47gz47gz47I5h97r2ycS1BDyNJSUlqa2vrM9bW1qa4uLgBz4pIksvlksvl6jceFRUVlBekxxcmT0/ohJFQ/UcZrL8/9EevnUGfnUGfnRGMPg91v6B/zkhWVpbq6ur6jNXW1iorKyvYhwYAACHAdhj5/PPP1dTUpKamJklfvHW3qalJLS0tkr64xLJkyRL//EcffVTNzc16/PHHdfLkSb3wwgt6/fXXtXLlysA8AwAAENJsh5EjR44oPT1d6enpkiS326309HStXbtWknT+/Hl/MJGk8ePHa9++faqtrVVaWpq2bNmil156ibf1AgAAScO4Z2T27NmyrKu/FXagT1edPXu2jh07ZvdQAADgOsB30wAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAo4YVRioqKpSamqqYmBhlZmbq8OHDg84vLy/XnXfeqVGjRiklJUUrV67UH//4x2EVDAAARhbbYWT37t1yu90qKSnR0aNHlZaWptzcXF24cGHA+bt27dKqVatUUlKiEydO6OWXX9bu3bv15JNPfuniAQBA6LMdRrZu3aply5apoKBA99xzjyorKxUbG6vt27cPOP/QoUOaOXOmFi1apNTUVM2dO1cPPvjgNc+mAACA60Okncnd3d1qbGxUcXGxfyw8PFzZ2dlqaGgYcM29996rn/70pzp8+LBmzJih5uZmVVdXa/HixVc9jsfjkcfj8T9ub2+XJHm9Xnm9XjslD6p3L1e4FbA9nRDIHjiht95QqzsU0Wtn0Gdn0GdnBLPPQ93TVhi5dOmSenp6lJiY2Gc8MTFRJ0+eHHDNokWLdOnSJd13332yLEtXrlzRo48+OuhlmrKyMpWWlvYbr6mpUWxsrJ2Sh2T9dF/A9wym6upq0yUMS21trekSrhv02hn02Rn02RnB6HNXV9eQ5tkKI8NRX1+vDRs26IUXXlBmZqZOnz6tFStWaP369VqzZs2Aa4qLi+V2u/2P29vblZKSorlz5youLi5gtXm9XtXW1mrNkXB5fGEB2zfYPliXa7oEW3r7nJOTo6ioKNPljGj02hn02Rn02RnB7HPvlY1rsRVG4uPjFRERoba2tj7jbW1tSkpKGnDNmjVrtHjxYi1dulSSNGXKFHV2duqRRx7R6tWrFR7e/7YVl8sll8vVbzwqKiooL0iPL0yentAJI6H6jzJYf3/oj147gz47gz47Ixh9Hup+tm5gjY6OVkZGhurq6vxjPp9PdXV1ysrKGnBNV1dXv8AREREhSbKs0LpXAwAABJ7tyzRut1v5+fmaPn26ZsyYofLycnV2dqqgoECStGTJEo0bN05lZWWSpLy8PG3dulXp6en+yzRr1qxRXl6eP5QAAIDrl+0wsnDhQl28eFFr165Va2urpk2bpv379/tvam1paelzJuSpp55SWFiYnnrqKZ07d0633nqr8vLy9MwzzwTuWQAAgJA1rBtYi4qKVFRUNODP6uvr+x4gMlIlJSUqKSkZzqEAAMAIx3fTAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMCoYYWRiooKpaamKiYmRpmZmTp8+PCg8z/99FMVFhYqOTlZLpdLkyZNUnV19bAKBgAAI0uk3QW7d++W2+1WZWWlMjMzVV5ertzcXJ06dUoJCQn95nd3dysnJ0cJCQl64403NG7cOH3yyScaPXp0IOoHAAAhznYY2bp1q5YtW6aCggJJUmVlpfbt26ft27dr1apV/eZv375dly9f1qFDhxQVFSVJSk1N/XJVAwCAEcNWGOnu7lZjY6OKi4v9Y+Hh4crOzlZDQ8OAa372s58pKytLhYWFevvtt3Xrrbdq0aJFeuKJJxQRETHgGo/HI4/H43/c3t4uSfJ6vfJ6vXZKHlTvXq5wK2B7OiGQPXBCb72hVncootfOoM/OoM/OCGafh7qnrTBy6dIl9fT0KDExsc94YmKiTp48OeCa5uZmvfPOO3rooYdUXV2t06dPa/ny5fJ6vSopKRlwTVlZmUpLS/uN19TUKDY21k7JQ7J+ui/gewZTqN5vU1tba7qE6wa9dgZ9dgZ9dkYw+tzV1TWkebYv09jl8/mUkJCgF198UREREcrIyNC5c+f03HPPXTWMFBcXy+12+x+3t7crJSVFc+fOVVxcXMBq83q9qq2t1Zoj4fL4wgK2b7B9sC7XdAm29PY5JyfHf6kOwUGvnUGfnUGfnRHMPvde2bgWW2EkPj5eERERamtr6zPe1tampKSkAdckJycrKiqqzyWZu+++W62treru7lZ0dHS/NS6XSy6Xq994VFRUUF6QHl+YPD2hE0ZC9R9lsP7+0B+9dgZ9dgZ9dkYw+jzU/Wy9tTc6OloZGRmqq6vzj/l8PtXV1SkrK2vANTNnztTp06fl8/3pUshHH32k5OTkAYMIAAC4vtj+nBG3261t27bpJz/5iU6cOKHvfe976uzs9L+7ZsmSJX1ucP3e976ny5cva8WKFfroo4+0b98+bdiwQYWFhYF7FgAAIGTZvmdk4cKFunjxotauXavW1lZNmzZN+/fv99/U2tLSovDwP2WclJQUHThwQCtXrtTUqVM1btw4rVixQk888UTgngUAAAhZw7qBtaioSEVFRQP+rL6+vt9YVlaWfv3rXw/nUAAAYITju2kAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYNSwwkhFRYVSU1MVExOjzMxMHT58eEjrqqqqFBYWpgULFgznsAAAYASyHUZ2794tt9utkpISHT16VGlpacrNzdWFCxcGXXf27Fk99thjmjVr1rCLBQAAI4/tMLJ161YtW7ZMBQUFuueee1RZWanY2Fht3779qmt6enr00EMPqbS0VBMmTPhSBQMAgJEl0s7k7u5uNTY2qri42D8WHh6u7OxsNTQ0XHXdD3/4QyUkJOjhhx/Wr371q2sex+PxyOPx+B+3t7dLkrxer7xer52SB9W7lyvcCtieTghkD5zQW2+o1R2K6LUz6LMz6LMzgtnnoe5pK4xcunRJPT09SkxM7DOemJiokydPDrjmvffe08svv6ympqYhH6esrEylpaX9xmtqahQbG2un5CFZP90X8D2Dqbq62nQJw1JbW2u6hOsGvXYGfXYGfXZGMPrc1dU1pHm2wohdHR0dWrx4sbZt26b4+PghrysuLpbb7fY/bm9vV0pKiubOnau4uLiA1ef1elVbW6s1R8Ll8YUFbN9g+2BdrukSbOntc05OjqKiokyXM6LRa2fQZ2fQZ2cEs8+9VzauxVYYiY+PV0REhNra2vqMt7W1KSkpqd/8jz/+WGfPnlVeXp5/zOf74ixEZGSkTp06pYkTJ/Zb53K55HK5+o1HRUUF5QXp8YXJ0xM6YSRU/1EG6+8P/dFrZ9BnZ9BnZwSjz0Pdz9YNrNHR0crIyFBdXZ1/zOfzqa6uTllZWf3m33XXXTp+/Liampr8f7797W9rzpw5ampqUkpKip3DAwCAEcj2ZRq32638/HxNnz5dM2bMUHl5uTo7O1VQUCBJWrJkicaNG6eysjLFxMRo8uTJfdaPHj1akvqNAwCA65PtMLJw4UJdvHhRa9euVWtrq6ZNm6b9+/f7b2ptaWlReDgf7AoAAIZmWDewFhUVqaioaMCf1dfXD7p2x44dwzkkAAAYoTiFAQAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjhhVGKioqlJqaqpiYGGVmZurw4cNXnbtt2zbNmjVLN998s26++WZlZ2cPOh8AAFxfbIeR3bt3y+12q6SkREePHlVaWppyc3N14cKFAefX19frwQcf1LvvvquGhgalpKRo7ty5Onfu3JcuHgAAhD7bYWTr1q1atmyZCgoKdM8996iyslKxsbHavn37gPNfffVVLV++XNOmTdNdd92ll156ST6fT3V1dV+6eAAAEPoi7Uzu7u5WY2OjiouL/WPh4eHKzs5WQ0PDkPbo6uqS1+vVmDFjrjrH4/HI4/H4H7e3t0uSvF6vvF6vnZIH1buXK9wK2J5OCGQPnNBbb6jVHYrotTPoszPoszOC2eeh7hlmWdaQfxP//ve/17hx43To0CFlZWX5xx9//HH9x3/8h/7zP//zmnssX75cBw4c0G9+8xvFxMQMOGfdunUqLS3tN75r1y7FxsYOtVwAAGBQV1eXFi1apM8++0xxcXFXnWfrzMiXtXHjRlVVVam+vv6qQUSSiouL5Xa7/Y/b29v995oM9mTs8nq9qq2t1Zoj4fL4wgK2b7B9sC7XdAm29PY5JydHUVFRpssZ0ei1M+izM+izM4LZ594rG9diK4zEx8crIiJCbW1tfcbb2tqUlJQ06NrNmzdr48aN+uUvf6mpU6cOOtflcsnlcvUbj4qKCsoL0uMLk6cndMJIqP6jDNbfH/qj186gz86gz84IRp+Hup+tG1ijo6OVkZHR5+bT3ptR//9lmz+3adMmrV+/Xvv379f06dPtHBIAAIxwti/TuN1u5efna/r06ZoxY4bKy8vV2dmpgoICSdKSJUs0btw4lZWVSZKeffZZrV27Vrt27VJqaqpaW1slSTfccINuuOGGAD4VAAAQimyHkYULF+rixYtau3atWltbNW3aNO3fv1+JiYmSpJaWFoWH/+mEy49//GN1d3frO9/5Tp99SkpKtG7dui9XPULK5HUHQupymCSd3TjfdAkAMOIN6wbWoqIiFRUVDfiz+vr6Po/Pnj07nEMAAIDrBN9NAwAAjHL0rb0IjNRV+0yXYIsrwtKmGaarAAB8VXFmBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYFSk6QIAIHXVPtMl2OKKsLRphukqgJGDMyMAAMAowggAADCKyzTACDR53QF5esJMlwEAQ8KZEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRfOgZAAxTqH243NmN802XAAyIMAIA+Eoj9I18XKYBAABGDevMSEVFhZ577jm1trYqLS1Nzz//vGbMuPr3ae/Zs0dr1qzR2bNndccdd+jZZ5/VAw88MOyiAQD2pa7aZ7oEW1wRljZd/VcLRhDbZ0Z2794tt9utkpISHT16VGlpacrNzdWFCxcGnH/o0CE9+OCDevjhh3Xs2DEtWLBACxYs0AcffPCliwcAAKHP9pmRrVu3atmyZSooKJAkVVZWat++fdq+fbtWrVrVb/6PfvQjfetb39IPfvADSdL69etVW1urf//3f1dlZeWXLB8AgK8WzkDZZyuMdHd3q7GxUcXFxf6x8PBwZWdnq6GhYcA1DQ0NcrvdfcZyc3O1d+/eqx7H4/HI4/H4H3/22WeSpMuXL8vr9dopeVBer1ddXV2K9Iarxxc6N0eFmkifpa4uX0j2+fbHXjddgi2ucEtPpYdmr0NJKL+mQwl9dkZvn//nf/5HUVFRAd27o6NDkmRZ1uA12Nn00qVL6unpUWJiYp/xxMREnTx5csA1ra2tA85vbW296nHKyspUWlrab3z8+PF2ysVXyCLTBVxH6LUz6LMz6LMzgt3njo4O3XTTTVf9+Vfyrb3FxcV9zqb4fD5dvnxZt9xyi8LCApeO29vblZKSot/97neKi4sL2L7oiz47h147gz47gz47I5h9tixLHR0dGjt27KDzbIWR+Ph4RUREqK2trc94W1ubkpKSBlyTlJRka74kuVwuuVyuPmOjR4+2U6otcXFxvNAdQJ+dQ6+dQZ+dQZ+dEaw+D3ZGpJetd9NER0crIyNDdXV1/jGfz6e6ujplZWUNuCYrK6vPfEmqra296nwAAHB9sX2Zxu12Kz8/X9OnT9eMGTNUXl6uzs5O/7trlixZonHjxqmsrEyStGLFCt1///3asmWL5s+fr6qqKh05ckQvvvhiYJ8JAAAISbbDyMKFC3Xx4kWtXbtWra2tmjZtmvbv3++/SbWlpUXh4X864XLvvfdq165deuqpp/Tkk0/qjjvu0N69ezV58uTAPYthcrlcKikp6XdJCIFFn51Dr51Bn51Bn53xVehzmHWt99sAAAAEEd9NAwAAjCKMAAAAowgjAADAKMIIAAAwasSHkYqKCqWmpiomJkaZmZk6fPjwoPP37Nmju+66SzExMZoyZYqqq6sdqjS02enztm3bNGvWLN188826+eablZ2dfc2/F/yJ3dd0r6qqKoWFhWnBggXBLXCEsNvnTz/9VIWFhUpOTpbL5dKkSZP4/2MI7Pa5vLxcd955p0aNGqWUlBStXLlSf/zjHx2qNjQdPHhQeXl5Gjt2rMLCwgb9brhe9fX1+sY3viGXy6Xbb79dO3bsCG6R1ghWVVVlRUdHW9u3b7d+85vfWMuWLbNGjx5ttbW1DTj//ffftyIiIqxNmzZZH374ofXUU09ZUVFR1vHjxx2uPLTY7fOiRYusiooK69ixY9aJEyesf/qnf7Juuukm67//+78drjz02O11rzNnzljjxo2zZs2aZf3t3/6tM8WGMLt99ng81vTp060HHnjAeu+996wzZ85Y9fX1VlNTk8OVhxa7fX711Vctl8tlvfrqq9aZM2esAwcOWMnJydbKlSsdrjy0VFdXW6tXr7befPNNS5L11ltvDTq/ubnZio2Ntdxut/Xhhx9azz//vBUREWHt378/aDWO6DAyY8YMq7Cw0P+4p6fHGjt2rFVWVjbg/O9+97vW/Pnz+4xlZmZa//Iv/xLUOkOd3T7/uStXrlg33nij9ZOf/CRYJY4Yw+n1lStXrHvvvdd66aWXrPz8fMLIENjt849//GNrwoQJVnd3t1Mljgh2+1xYWGh985vf7DPmdrutmTNnBrXOkWQoYeTxxx+3vv71r/cZW7hwoZWbmxu0ukbsZZru7m41NjYqOzvbPxYeHq7s7Gw1NDQMuKahoaHPfEnKzc296nwMr89/rqurS16vV2PGjAlWmSPCcHv9wx/+UAkJCXr44YedKDPkDafPP/vZz5SVlaXCwkIlJiZq8uTJ2rBhg3p6epwqO+QMp8/33nuvGhsb/ZdympubVV1drQceeMCRmq8XJn4XfiW/tTcQLl26pJ6eHv8nw/ZKTEzUyZMnB1zT2to64PzW1tag1RnqhtPnP/fEE09o7Nix/V786Gs4vX7vvff08ssvq6mpyYEKR4bh9Lm5uVnvvPOOHnroIVVXV+v06dNavny5vF6vSkpKnCg75Aynz4sWLdKlS5d03333ybIsXblyRY8++qiefPJJJ0q+blztd2F7e7v+8Ic/aNSoUQE/5og9M4LQsHHjRlVVVemtt95STEyM6XJGlI6ODi1evFjbtm1TfHy86XJGNJ/Pp4SEBL344ovKyMjQwoULtXr1alVWVpoubUSpr6/Xhg0b9MILL+jo0aN68803tW/fPq1fv950afiSRuyZkfj4eEVERKitra3PeFtbm5KSkgZck5SUZGs+htfnXps3b9bGjRv1y1/+UlOnTg1mmSOC3V5//PHHOnv2rPLy8vxjPp9PkhQZGalTp05p4sSJwS06BA3nNZ2cnKyoqChFRET4x+6++261traqu7tb0dHRQa05FA2nz2vWrNHixYu1dOlSSdKUKVPU2dmpRx55RKtXr+7zvWgYvqv9LoyLiwvKWRFpBJ8ZiY6OVkZGhurq6vxjPp9PdXV1ysrKGnBNVlZWn/mSVFtbe9X5GF6fJWnTpk1av3699u/fr+nTpztRasiz2+u77rpLx48fV1NTk//Pt7/9bc2ZM0dNTU1KSUlxsvyQMZzX9MyZM3X69Gl/2JOkjz76SMnJyQSRqxhOn7u6uvoFjt4AaPE1awFj5Hdh0G6N/QqoqqqyXC6XtWPHDuvDDz+0HnnkEWv06NFWa2urZVmWtXjxYmvVqlX++e+//74VGRlpbd682Tpx4oRVUlLCW3uHwG6fN27caEVHR1tvvPGGdf78ef+fjo4OU08hZNjt9Z/j3TRDY7fPLS0t1o033mgVFRVZp06dsn7xi19YCQkJ1tNPP23qKYQEu30uKSmxbrzxRuu1116zmpubrZqaGmvixInWd7/7XVNPISR0dHRYx44ds44dO2ZJsrZu3WodO3bM+uSTTyzLsqxVq1ZZixcv9s/vfWvvD37wA+vEiRNWRUUFb+39sp5//nnrtttus6Kjo60ZM2ZYv/71r/0/u//++638/Pw+819//XVr0qRJVnR0tPX1r3/d2rdvn8MVhyY7ff7a175mSer3p6SkxPnCQ5Dd1/T/RxgZOrt9PnTokJWZmWm5XC5rwoQJ1jPPPGNduXLF4apDj50+e71ea926ddbEiROtmJgYKyUlxVq+fLn1v//7v84XHkLefffdAf/P7e1tfn6+df/99/dbM23aNCs6OtqaMGGC9corrwS1xjDL4twWAAAwZ8TeMwIAAEIDYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBR/wfWP+KWs4PFjgAAAABJRU5ErkJggg==",
+      "text/plain": [
+       "<Figure size 640x480 with 1 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "jigsaw2_train['toxic'].hist(bins=10);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 215,
+   "id": "fa9e08bc-666a-4490-8ae9-eb8c3c8952cb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "jigsaw2_train['toxic_binary'] = jigsaw2_train['toxic'].gt(0.5).astype(int)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "id": "4a02f61b-e6e4-4cff-b40c-63b11d9c80c0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>comment_text</th>\n",
+       "      <th>lang</th>\n",
+       "      <th>toxic</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>7502</th>\n",
+       "      <td>7502</td>\n",
+       "      <td>Sayın hocam sizden bir ricam daha olacak. Şu f...</td>\n",
+       "      <td>tr</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3607</th>\n",
+       "      <td>3607</td>\n",
+       "      <td>&gt;  Occhio tu piuttosto... con il rollback hai ...</td>\n",
+       "      <td>it</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2902</th>\n",
+       "      <td>2902</td>\n",
+       "      <td>È inutile che cercate di buttarla sul ridicolo...</td>\n",
+       "      <td>it</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "        id                                       comment_text lang  toxic\n",
+       "7502  7502  Sayın hocam sizden bir ricam daha olacak. Şu f...   tr      0\n",
+       "3607  3607  >  Occhio tu piuttosto... con il rollback hai ...   it      0\n",
+       "2902  2902  È inutile che cercate di buttarla sul ridicolo...   it      1"
+      ]
+     },
+     "execution_count": 54,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "jigsaw3_valid = pd.read_csv(f\"{DATA_DIR}/validation.csv\")\n",
+    "jigsaw3_valid.sample(3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "id": "cbc7a0be-9aef-48e4-b954-bcafa9a64c03",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>content</th>\n",
+       "      <th>lang</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>6853</th>\n",
+       "      <td>6853</td>\n",
+       "      <td>Valla hiç bir fikrim yok, bu akşam bir iki ker...</td>\n",
+       "      <td>tr</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16307</th>\n",
+       "      <td>16307</td>\n",
+       "      <td>merhaba, Zeynel Limoncu maddesine yetersiz ve ...</td>\n",
+       "      <td>tr</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>48448</th>\n",
+       "      <td>48448</td>\n",
+       "      <td>Ненейтрально: Кабумба, Мбеки — министр Пропага...</td>\n",
+       "      <td>ru</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "          id                                            content lang\n",
+       "6853    6853  Valla hiç bir fikrim yok, bu akşam bir iki ker...   tr\n",
+       "16307  16307  merhaba, Zeynel Limoncu maddesine yetersiz ve ...   tr\n",
+       "48448  48448  Ненейтрально: Кабумба, Мбеки — министр Пропага...   ru"
+      ]
+     },
+     "execution_count": 53,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "jigsaw3_test = pd.read_csv(f\"{DATA_DIR}/test.csv\")\n",
+    "jigsaw3_test.sample(3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "id": "af357be3-7e56-4cb8-b94e-6c34672f46fa",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>toxic</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>34608</th>\n",
+       "      <td>34608</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>36778</th>\n",
+       "      <td>36778</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6329</th>\n",
+       "      <td>6329</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "          id  toxic\n",
+       "34608  34608      0\n",
+       "36778  36778      0\n",
+       "6329    6329      0"
+      ]
+     },
+     "execution_count": 52,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "jigsaw3_test_labels = pd.read_csv(f\"{DATA_DIR}/test_labels.csv\")\n",
+    "jigsaw3_test_labels.sample(3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 68,
+   "id": "d9d243b0-7385-4eb2-ab7d-b8c2ccf5c199",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert (jigsaw3_test['id'] == jigsaw3_test_labels['id']).all()\n",
+    "jigsaw3_test['toxic'] = jigsaw3_test_labels['toxic']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2016942-1230-49c2-875b-5640475e1848",
+   "metadata": {},
+   "source": [
+    "## Merge the datasets into one"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 69,
+   "id": "0ed7e70b-1aed-48c3-996d-bb3a13a3114a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "jigsaw1_train['dataset'] = 'jigsaw1'\n",
+    "jigsaw2_train['dataset'] = 'jigsaw2'\n",
+    "jigsaw3_valid['dataset'] = 'jigsaw3'\n",
+    "jigsaw3_test['dataset'] = 'jigsaw3'\n",
+    "\n",
+    "jigsaw1_train['split'] = 'train'\n",
+    "jigsaw2_train['split'] = 'train'\n",
+    "jigsaw3_valid['split'] = 'valid'\n",
+    "jigsaw3_test['split'] = 'test'\n",
+    "\n",
+    "jigsaw1_train['lang'] = 'en'\n",
+    "jigsaw2_train['lang'] = 'en'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 71,
+   "id": "3640b96f-4d23-4282-b35d-95c2543794c5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mutox_raw['dataset'] = 'mutox'\n",
+    "mutox_raw"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 229,
+   "id": "2e7ae6c1-39e7-4224-97b0-23a1c7a9f2f5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_pooled = pd.concat([\n",
+    "    jigsaw1_train.rename({'comment_text': 'text'}, axis=1), \n",
+    "    jigsaw2_train[\n",
+    "        # removing the labels with low agreement between annotators\n",
+    "        jigsaw2_train['toxic'].lt(0.3) | jigsaw2_train['toxic'].gt(0.6)\n",
+    "    ].rename({'comment_text': 'text', 'toxic_binary': 'toxic', 'toxic': 'toxic_fraction'}, axis=1), \n",
+    "    jigsaw3_valid.rename({'comment_text': 'text'}, axis=1), \n",
+    "    jigsaw3_test.rename({'content': 'text'}, axis=1),\n",
+    "    mutox_cleaned.rename({'audio_file_transcript': 'text', 'partition': 'split'}, axis=1),\n",
+    "    \n",
+    "])[['text', 'toxic', 'lang', 'split', 'dataset']].dropna()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 230,
+   "id": "3d2e098a-a757-487e-a563-9444038d3290",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>text</th>\n",
+       "      <th>toxic</th>\n",
+       "      <th>lang</th>\n",
+       "      <th>split</th>\n",
+       "      <th>dataset</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>124242</th>\n",
+       "      <td>Good idea and done. -  ✉</td>\n",
+       "      <td>0</td>\n",
+       "      <td>en</td>\n",
+       "      <td>train</td>\n",
+       "      <td>jigsaw1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>216946</th>\n",
+       "      <td>\", 4 September 2011 (UTC) \\n :::I'm adding to ...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>en</td>\n",
+       "      <td>train</td>\n",
+       "      <td>jigsaw1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>126160</th>\n",
+       "      <td>tell your girl friend of your User:Nancy stop ...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>en</td>\n",
+       "      <td>train</td>\n",
+       "      <td>jigsaw1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1575462</th>\n",
+       "      <td>The numbers are stunning.  This part of the co...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>en</td>\n",
+       "      <td>train</td>\n",
+       "      <td>jigsaw2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1469311</th>\n",
+       "      <td>So you agree that if two people in the househo...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>en</td>\n",
+       "      <td>train</td>\n",
+       "      <td>jigsaw2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1537882</th>\n",
+       "      <td>They are fighting racism, especially state-spo...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>en</td>\n",
+       "      <td>train</td>\n",
+       "      <td>jigsaw2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>39046</th>\n",
+       "      <td>Дело не в этом, а в том, что простая подборка...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>ru</td>\n",
+       "      <td>test</td>\n",
+       "      <td>jigsaw3</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>58817</th>\n",
+       "      <td>45x45px|left    Spam    bağlantılar eklemeyi ...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>tr</td>\n",
+       "      <td>test</td>\n",
+       "      <td>jigsaw3</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>22587</th>\n",
+       "      <td>Pardon pardon gerek kalmadı, zaten örneği verm...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>tr</td>\n",
+       "      <td>test</td>\n",
+       "      <td>jigsaw3</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7085</th>\n",
+       "      <td>and the destruction of his dynasty.</td>\n",
+       "      <td>0</td>\n",
+       "      <td>eng</td>\n",
+       "      <td>train</td>\n",
+       "      <td>mutox</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>88252</th>\n",
+       "      <td>Üllatus-üllatus, mitte mõne massibrandi, vai ...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>est</td>\n",
+       "      <td>train</td>\n",
+       "      <td>mutox</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>40817</th>\n",
+       "      <td>来自上帝和模西这就是许多失被称为大位的失片的原因有机墨室的观点</td>\n",
+       "      <td>0</td>\n",
+       "      <td>cmn</td>\n",
+       "      <td>train</td>\n",
+       "      <td>mutox</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                      text  toxic lang  split  \\\n",
+       "124242                            Good idea and done. -  ✉      0   en  train   \n",
+       "216946   \", 4 September 2011 (UTC) \\n :::I'm adding to ...      0   en  train   \n",
+       "126160   tell your girl friend of your User:Nancy stop ...      0   en  train   \n",
+       "1575462  The numbers are stunning.  This part of the co...      0   en  train   \n",
+       "1469311  So you agree that if two people in the househo...      0   en  train   \n",
+       "1537882  They are fighting racism, especially state-spo...      0   en  train   \n",
+       "39046     Дело не в этом, а в том, что простая подборка...      0   ru   test   \n",
+       "58817     45x45px|left    Spam    bağlantılar eklemeyi ...      0   tr   test   \n",
+       "22587    Pardon pardon gerek kalmadı, zaten örneği verm...      0   tr   test   \n",
+       "7085                   and the destruction of his dynasty.      0  eng  train   \n",
+       "88252     Üllatus-üllatus, mitte mõne massibrandi, vai ...      0  est  train   \n",
+       "40817                      来自上帝和模西这就是许多失被称为大位的失片的原因有机墨室的观点      0  cmn  train   \n",
+       "\n",
+       "         dataset  \n",
+       "124242   jigsaw1  \n",
+       "216946   jigsaw1  \n",
+       "126160   jigsaw1  \n",
+       "1575462  jigsaw2  \n",
+       "1469311  jigsaw2  \n",
+       "1537882  jigsaw2  \n",
+       "39046    jigsaw3  \n",
+       "58817    jigsaw3  \n",
+       "22587    jigsaw3  \n",
+       "7085       mutox  \n",
+       "88252      mutox  \n",
+       "40817      mutox  "
+      ]
+     },
+     "execution_count": 230,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data_pooled.groupby(['dataset']).sample(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c5ad74b0-93af-4559-9418-cb598a5d0d01",
+   "metadata": {},
+   "source": [
+    "In each of the datasets, we have about 5-20% of toxic labels. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 231,
+   "id": "1a51bb81-17e8-4feb-b16f-566c78b9623e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th>count</th>\n",
+       "      <th>mean</th>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>dataset</th>\n",
+       "      <th>split</th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>jigsaw1</th>\n",
+       "      <th>train</th>\n",
+       "      <td>223549</td>\n",
+       "      <td>0.095657</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>jigsaw2</th>\n",
+       "      <th>train</th>\n",
+       "      <td>1699310</td>\n",
+       "      <td>0.045574</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th rowspan=\"2\" valign=\"top\">jigsaw3</th>\n",
+       "      <th>test</th>\n",
+       "      <td>63812</td>\n",
+       "      <td>0.225820</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>valid</th>\n",
+       "      <td>8000</td>\n",
+       "      <td>0.153750</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th rowspan=\"3\" valign=\"top\">mutox</th>\n",
+       "      <th>dev</th>\n",
+       "      <td>22032</td>\n",
+       "      <td>0.108932</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>devtest</th>\n",
+       "      <td>10308</td>\n",
+       "      <td>0.135914</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>train</th>\n",
+       "      <td>65158</td>\n",
+       "      <td>0.135532</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                   count      mean\n",
+       "dataset split                     \n",
+       "jigsaw1 train     223549  0.095657\n",
+       "jigsaw2 train    1699310  0.045574\n",
+       "jigsaw3 test       63812  0.225820\n",
+       "        valid       8000  0.153750\n",
+       "mutox   dev        22032  0.108932\n",
+       "        devtest    10308  0.135914\n",
+       "        train      65158  0.135532"
+      ]
+     },
+     "execution_count": 231,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data_pooled.groupby(['dataset', 'split'])['toxic'].aggregate(['count', 'mean'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e173c6bc-d23e-48fb-882a-d546be1a8083",
+   "metadata": {},
+   "source": [
+    "## Map the language codes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ab405158-e4a6-4b99-a5e6-82638ff60fc7",
+   "metadata": {},
+   "source": [
+    "SONAR model requires adding a special token indicating the language to the text being encoded.\n",
+    "\n",
+    "So we need to re-format the language codes to agree with those of the SONAR tokenizer."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "21ca886c-a352-4666-9195-b50a8ba540ea",
+   "metadata": {},
+   "source": [
+    "Here is an example of how the tokenizer works:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 176,
+   "id": "96c85b1d-7218-46f9-8420-ce42598be8b2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[256047, 94124, 15697, 2]\n",
+      "['eng_Latn', '▁Hello', '▁world', '</s>']\n"
+     ]
+    }
+   ],
+   "source": [
+    "tokenizer.src_lang = \"eng_Latn\"  # setting the language that will be added to the text by default\n",
+    "tokens_ids = tokenizer('Hello world').input_ids\n",
+    "print(tokens_ids) # [256047, 94124, 15697, 2]\n",
+    "tokens = tokenizer.convert_ids_to_tokens(tokens_ids)\n",
+    "print(tokens)  # ['eng_Latn', '▁Hello', '▁world', '</s>']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 112,
+   "id": "13ff7970-badf-46b1-8327-d03231cdc3c0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lang_codes_map ={\n",
+    "    'arb': 'arb_Arab',\n",
+    "    'ben': 'ben_Beng',\n",
+    "    'bul': 'bul_Cyrl',\n",
+    "    'cat': 'cat_Latn',\n",
+    "    'ces': 'ces_Latn',\n",
+    "    'cmn': 'zho_Hans',\n",
+    "    'dan': 'dan_Latn',\n",
+    "    'deu': 'deu_Latn',\n",
+    "    'ell': 'ell_Grek',\n",
+    "    'en': 'eng_Latn',\n",
+    "    'eng': 'eng_Latn',\n",
+    "    'es': 'spa_Latn',\n",
+    "    'est': 'est_Latn',\n",
+    "    'fin': 'fin_Latn',\n",
+    "    'fr': 'fra_Latn',\n",
+    "    'fra': 'fra_Latn',\n",
+    "    'heb': 'heb_Hebr',\n",
+    "    'hin': 'hin_Deva',\n",
+    "    'hun': 'hun_Latn',\n",
+    "    'ind': 'ind_Latn',\n",
+    "    'it': 'ita_Latn',\n",
+    "    'ita': 'ita_Latn',\n",
+    "    'nld': 'nld_Latn',\n",
+    "    'pes': 'pes_Arab',\n",
+    "    'pol': 'pol_Latn',\n",
+    "    'por': 'por_Latn',\n",
+    "    'pt': 'por_Latn',\n",
+    "    'ru': 'rus_Cyrl',\n",
+    "    'rus': 'rus_Cyrl',\n",
+    "    'slk': 'slk_Latn',\n",
+    "    'spa': 'spa_Latn',\n",
+    "    'swh': 'swh_Latn',\n",
+    "    'tgl': 'tgl_Latn',\n",
+    "    'tr': 'tur_Latn',\n",
+    "    'tur': 'tur_Latn',\n",
+    "    'urd': 'urd_Arab',\n",
+    "    'vie': 'vie_Latn'\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9fc0070-57b4-47b8-bcf7-5aa7ce73d80b",
+   "metadata": {},
+   "source": [
+    "Check that we have covered all the languages in the dataset, and the codes are indeed part of the SONAR tokenizer:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 239,
+   "id": "28dd3775-6519-462d-89a4-2c3a6d6f314b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_pooled['lang_code'] = data_pooled['lang'].apply(lang_codes_map.get) \n",
+    "assert data_pooled['lang_code'].isnull().sum() == 0\n",
+    "\n",
+    "for lang_code in lang_codes_map.values():\n",
+    "    assert lang_code in tokenizer.special_tokens_map['additional_special_tokens']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bd022830-ba03-4cd2-8f2d-66f2d945b429",
+   "metadata": {},
+   "source": [
+    "# 2. Setting model architecture"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "50f81140-8d36-4245-a9b3-a7ba10ff8eac",
+   "metadata": {},
+   "source": [
+    "We start by downloading the encoder from HuggingFace"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "c122a013-2338-48b6-acbb-527f4346330f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "encoder = M2M100Encoder.from_pretrained(BASE_MODEL_NAME)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "30b46023-76eb-4b29-b601-0a9434ef8ec0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "M2M100Encoder(\n",
+       "  (embed_tokens): M2M100ScaledWordEmbedding(256206, 1024, padding_idx=1)\n",
+       "  (embed_positions): M2M100SinusoidalPositionalEmbedding()\n",
+       "  (layers): ModuleList(\n",
+       "    (0-23): 24 x M2M100EncoderLayer(\n",
+       "      (self_attn): M2M100SdpaAttention(\n",
+       "        (k_proj): Linear(in_features=1024, out_features=1024, bias=True)\n",
+       "        (v_proj): Linear(in_features=1024, out_features=1024, bias=True)\n",
+       "        (q_proj): Linear(in_features=1024, out_features=1024, bias=True)\n",
+       "        (out_proj): Linear(in_features=1024, out_features=1024, bias=True)\n",
+       "      )\n",
+       "      (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n",
+       "      (activation_fn): ReLU()\n",
+       "      (fc1): Linear(in_features=1024, out_features=8192, bias=True)\n",
+       "      (fc2): Linear(in_features=8192, out_features=1024, bias=True)\n",
+       "      (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n",
+       "    )\n",
+       "  )\n",
+       "  (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n",
+       ")"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "encoder"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5556da4e-3694-408d-9fde-df0fbd7a9ac9",
+   "metadata": {},
+   "source": [
+    "In the spirit of the Hugginface typical models, we will add a classification head on top (just like, for example, [BertForSequenceClassification](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py#L1472))."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "e87b52d2-8c68-406b-bc4e-2c031a6c076d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class SonarForSequenceClassification(M2M100PreTrainedModel):\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__(config)\n",
+    "        self.num_labels = config.num_labels\n",
+    "        self.config = config\n",
+    "        \n",
+    "        self.sonar_encoder = M2M100Encoder(config)\n",
+    "        \n",
+    "        # the classifier will consist of a wide linear layer, non-linearity, dropout, and a final linear layer\n",
+    "        self.fc1 = nn.Linear(config.d_model, config.encoder_ffn_dim)\n",
+    "        self.activation = nn.Tanh()\n",
+    "        self.dropout = nn.Dropout(config.dropout)\n",
+    "        self.classifier = nn.Linear(config.encoder_ffn_dim, config.num_labels)\n",
+    "\n",
+    "        # Initialize weights and apply final processing\n",
+    "        self.post_init()\n",
+    "        \n",
+    "    def forward(\n",
+    "        self,\n",
+    "        input_ids: Optional[torch.Tensor] = None,\n",
+    "        attention_mask: Optional[torch.Tensor] = None,\n",
+    "        head_mask: Optional[torch.Tensor] = None,\n",
+    "        inputs_embeds: Optional[torch.Tensor] = None,\n",
+    "        labels: Optional[torch.Tensor] = None,\n",
+    "        output_attentions: Optional[bool] = None,\n",
+    "        output_hidden_states: Optional[bool] = None,\n",
+    "        return_dict: Optional[bool] = None,\n",
+    "    ) -> Union[tuple[torch.Tensor], SequenceClassifierOutput]:\n",
+    "        r\"\"\"\n",
+    "        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):\n",
+    "            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,\n",
+    "            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If\n",
+    "            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n",
+    "        \"\"\"\n",
+    "        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n",
+    "\n",
+    "        outputs = self.sonar_encoder(\n",
+    "            input_ids,\n",
+    "            attention_mask=attention_mask,\n",
+    "            head_mask=head_mask,\n",
+    "            inputs_embeds=inputs_embeds,\n",
+    "            output_attentions=output_attentions,\n",
+    "            output_hidden_states=True,\n",
+    "            return_dict=return_dict,\n",
+    "        )\n",
+    "        token_embs = outputs.last_hidden_state\n",
+    "        sentence_embs = (token_embs * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.unsqueeze(-1).sum(1)\n",
+    "        \n",
+    "        hidden = self.fc1(sentence_embs)\n",
+    "        hidden = self.dropout(hidden)\n",
+    "        logits = self.classifier(hidden)\n",
+    "\n",
+    "        loss = None\n",
+    "        if labels is not None:\n",
+    "            if self.config.problem_type is None:\n",
+    "                if self.num_labels == 1:\n",
+    "                    self.config.problem_type = \"regression\"\n",
+    "                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):\n",
+    "                    self.config.problem_type = \"single_label_classification\"\n",
+    "                else:\n",
+    "                    self.config.problem_type = \"multi_label_classification\"\n",
+    "\n",
+    "            if self.config.problem_type == \"regression\":\n",
+    "                loss_fct = MSELoss()\n",
+    "                if self.num_labels == 1:\n",
+    "                    loss = loss_fct(logits.squeeze(), labels.squeeze())\n",
+    "                else:\n",
+    "                    loss = loss_fct(logits, labels)\n",
+    "            elif self.config.problem_type == \"single_label_classification\":\n",
+    "                loss_fct = CrossEntropyLoss()\n",
+    "                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n",
+    "            elif self.config.problem_type == \"multi_label_classification\":\n",
+    "                loss_fct = BCEWithLogitsLoss()\n",
+    "                loss = loss_fct(logits, labels)\n",
+    "        if not return_dict:\n",
+    "            output = (logits,) + outputs[2:]\n",
+    "            return ((loss,) + output) if loss is not None else output\n",
+    "\n",
+    "        return SequenceClassifierOutput(\n",
+    "            loss=loss,\n",
+    "            logits=logits,\n",
+    "            hidden_states=sentence_embs,\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f78e63d9-3f81-49c7-9668-38023a7c1011",
+   "metadata": {},
+   "source": [
+    "Create the model! And initialize its encoder by the actual SONAR encoder parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "62881559-57bc-4a71-9fb2-1c10b9009259",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<All keys matched successfully>"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "encoder.config.num_labels = 2\n",
+    "encoder.config.problem_type = \"single_label_classification\"\n",
+    "\n",
+    "model = SonarForSequenceClassification(encoder.config)\n",
+    "\n",
+    "model.sonar_encoder.load_state_dict(encoder.state_dict())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bd75a385-6be7-4175-90d6-3b4fd0c5f9da",
+   "metadata": {},
+   "source": [
+    "Here is how our model is supposed to classify:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "7420543f-d913-4ef7-9701-1d4983c06940",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([[0.5013, 0.4987],\n",
+      "        [0.5074, 0.4926],\n",
+      "        [0.4992, 0.5008]])\n"
+     ]
+    }
+   ],
+   "source": [
+    "sample_texts = ['Fuck you bitch!', 'Do you like dogs?', 'One more neutral sentence.']\n",
+    "tokenizer.src_lang = \"eng_Latn\"\n",
+    "\n",
+    "with torch.inference_mode():\n",
+    "    out = model(**tokenizer(sample_texts, padding=True, return_tensors='pt').to(model.device))\n",
+    "    probabilities = torch.softmax(out.logits, dim=-1)\n",
+    "print(probabilities)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d68937af-76cd-4493-85c7-9ba407288836",
+   "metadata": {},
+   "source": [
+    "The outputs are a list of two-dimensional arrays, each of them containing probabilities of the labels. \n",
+    "\n",
+    "We have two-label classification as a problem, with the labels 0 (non-toxic) or 1 (toxic). Thus, the last column of the output (number 1, if we count from 0) contains the predicted probability of the texts being toxic. Which is by default close to 0.5 for all texts: before the training, the model doesn't know what to predict. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00193864-7dc8-4ad3-bf70-7fd4b80485dd",
+   "metadata": {},
+   "source": [
+    "# 3. Preparing the datasets for training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1830348-b36b-4c46-9105-7fb76095b710",
+   "metadata": {},
+   "source": [
+    "We want to use the Hugginface trainer, which expects the data to be pre-tokenized in advance. So we will do just that.\n",
+    "\n",
+    "We will use all `train` and `dev` data for training, and `test` data for testing. We don't really need a validation dataset here, because we plan to train the models just once, without much hyperparameter tuning, so overfitting to the `test` split is not a problem."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 251,
+   "id": "a6ae4b71-758b-46de-8ff9-6a3b80d6534f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "DatasetDict({\n",
+       "    train: Dataset({\n",
+       "        features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n",
+       "        num_rows: 2018049\n",
+       "    })\n",
+       "    test: Dataset({\n",
+       "        features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n",
+       "        num_rows: 74120\n",
+       "    })\n",
+       "})"
+      ]
+     },
+     "execution_count": 251,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "ds_mix = datasets.DatasetDict({\n",
+    "    'train': datasets.Dataset.from_pandas(data_pooled[~data_pooled['split'].isin({'test', 'devtest'})], preserve_index=False),\n",
+    "    'test': datasets.Dataset.from_pandas(data_pooled[data_pooled['split'].isin({'test', 'devtest'})], preserve_index=False),\n",
+    "})\n",
+    "ds_mix"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8942ae7e-06cd-41bf-ae5b-48e45b6ab0e7",
+   "metadata": {},
+   "source": [
+    "The test split looks too big; we downsample it, taking at most 100 labels from each dataset-language-label combination. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 258,
+   "id": "38914623-f1db-4154-8c20-8c3d38f7b1f4",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "DatasetDict({\n",
+       "    train: Dataset({\n",
+       "        features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n",
+       "        num_rows: 2018049\n",
+       "    })\n",
+       "    test: Dataset({\n",
+       "        features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n",
+       "        num_rows: 74120\n",
+       "    })\n",
+       "    test_small: Dataset({\n",
+       "        features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n",
+       "        num_rows: 5101\n",
+       "    })\n",
+       "})"
+      ]
+     },
+     "execution_count": 258,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_small = data_pooled[\n",
+    "    data_pooled['split'].isin({'test', 'devtest'})\n",
+    "].groupby(['dataset', 'lang_code', 'toxic']).apply(\n",
+    "    lambda x: x.sample(min(len(x), 100), random_state=1)\n",
+    ")\n",
+    "ds_mix['test_small'] = datasets.Dataset.from_pandas(test_small, preserve_index=False)\n",
+    "ds_mix"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98c91eb2-77a5-45aa-83f2-1735706014da",
+   "metadata": {},
+   "source": [
+    "Now we can define the tokenization function (language-dependent!) and pre-tokenize the datasets.\n",
+    "\n",
+    "In addition, we will rename the target to \"labels\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 247,
+   "id": "d4ad18ba-d59f-460b-854f-92e8c7ae96c2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'input_ids': [[256047, 2867, 248203, 2], [256057, 163119, 248203, 2], [256147, 12700, 9272, 248203, 2]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1]]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "def tokenize_function(examples, max_length=512):\n",
+    "    \"\"\" Tokenize the examples and prepend the right language tag.\"\"\"\n",
+    "    result = tokenizer(\n",
+    "        text=list(examples[\"text\"]),\n",
+    "        truncation=True,\n",
+    "        max_length=max_length,\n",
+    "    )\n",
+    "    lang_ids = tokenizer.convert_tokens_to_ids(list(examples[\"lang_code\"]))\n",
+    "    for input_ids, lang_id in zip(result[\"input_ids\"], lang_ids):\n",
+    "        input_ids[0] = lang_id\n",
+    "\n",
+    "    if 'toxic' in examples:\n",
+    "        result['labels'] = examples['toxic']\n",
+    "    return result\n",
+    "\n",
+    "print(tokenize_function({\"text\": [\"Hi!\", \"Salut!\", \"Привет!\"], \"lang_code\": [\"eng_Latn\", \"fra_Latn\", \"rus_Cyrl\"]}))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 259,
+   "id": "4cbcc19c-4da2-4629-a07c-3e35f68175f0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "9187b87d21ee4da18552ee5a9c07806a",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/2018049 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "e6963c6313594fd983429d4b0b97d472",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/74120 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "21c7f1396fdb442fab173663d4839c22",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/5101 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "ds_tokenized = ds_mix.map(tokenize_function, batched=True, batch_size=10_000, remove_columns=ds_mix['train'].column_names)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 263,
+   "id": "14cdb759-9972-47b7-a84e-dbdb71104e73",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "f5ee2e1bd9a447f6a099c76e62fa6764",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Saving the dataset (0/2 shards):   0%|          | 0/2018049 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "ea70ff44bc4341088d83593abd632e01",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Saving the dataset (0/1 shards):   0%|          | 0/74120 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "3c51e5323b6f4ccf938a705fbd2e7aa3",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Saving the dataset (0/1 shards):   0%|          | 0/5101 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# save the dataset for reproducibility in the future\n",
+    "ds_tokenized.save_to_disk(f\"{DATA_DIR}/tokenized_dataset\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6af1f82a-1690-46e3-81e9-423f6a45db2b",
+   "metadata": {},
+   "source": [
+    "# 4. Training a classifier with the frozen encoder"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "5966c784-3c93-44bf-913e-e40aa0499a2a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds_tokenized = datasets.load_from_disk(f\"{DATA_DIR}/tokenized_dataset\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "62387d45-fa69-4515-ad4e-69bf23a1b52f",
+   "metadata": {},
+   "source": [
+    "To freeze the encoder, we will disable the gradient computation for all its parameters:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "9912078c-c633-49e1-b72f-295be01cde3b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for param in model.sonar_encoder.parameters():\n",
+    "    param.requires_grad = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "622f90f8-66e6-4a60-8290-8611962c56a2",
+   "metadata": {},
+   "source": [
+    "Actually, it would be much faster to tokenize all the data in advance, and then to train only the head, feeding the embeddings directly to it. \n",
+    "\n",
+    "But I tried to make the code more generic, so I am treating a SONAR-based classifier as any other HF classifier. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "a545b686-5e12-4086-8c38-322399dea483",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cac3d483-e4ad-4ba7-86f3-729abaf52cf0",
+   "metadata": {},
+   "source": [
+    "I will train the model for one epoch over the dataset (this should be sufficient to train the classifier head)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "5f842caf-41f0-4740-87ff-e6af1a6111d4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "training_args = TrainingArguments(\n",
+    "    output_dir=MODEL1_DIR,\n",
+    "    num_train_epochs=1,\n",
+    "    logging_steps=500,\n",
+    "    save_steps=10_000,\n",
+    "    save_total_limit=50,\n",
+    "    per_device_train_batch_size=64, # for full-model training, this should be reduced to 16 or something like that\n",
+    "    warmup_steps=1_000,  # Adjust the number of the warmup steps to take about 10%-20% of the total number of steps per epoch.\n",
+    "    weight_decay=1e-3,\n",
+    "    optim=\"adamw_torch\", # adafactor, if even with smaller batches you are out of memory \n",
+    "    gradient_accumulation_steps=1,\n",
+    "    run_name='sonar-tox-trained-v1',\n",
+    "    learning_rate=1e-4,\n",
+    "    report_to=\"wandb\", # turn this off if you don't have wandb\n",
+    "    # max_steps=100_000,\n",
+    "    bf16=True, # turn this off if your GPU doesn't support it\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "8df2b28d-24fe-4599-84a0-2a25a06972a5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "steps per epoch 3941.501953125\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Here is how many steps per epoch would be made with my setting (8 gpus):\n",
+    "print(\"steps per epoch\", len(ds_tokenized['train']) / training_args.per_device_train_batch_size / 8)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "f79cdbf9-3b02-4da9-8b0c-6e24e8aa86e7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer = Trainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    train_dataset=ds_tokenized[\"train\"],\n",
+    "    data_collator=data_collator,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "6eae929a-4a3c-4ce5-a02e-812f8731cc77",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='3942' max='3942' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [3942/3942 2:19:11, Epoch 1/1]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Step</th>\n",
+       "      <th>Training Loss</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td>500</td>\n",
+       "      <td>0.354900</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>1000</td>\n",
+       "      <td>0.146400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>1500</td>\n",
+       "      <td>0.113300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>2000</td>\n",
+       "      <td>0.110100</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>2500</td>\n",
+       "      <td>0.109200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>3000</td>\n",
+       "      <td>0.108100</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>3500</td>\n",
+       "      <td>0.109500</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/transformers/configuration_utils.py:397: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.\n",
+      "Non-default generation parameters: {'max_length': 200}\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "TrainOutput(global_step=3942, training_loss=0.14545092686731037, metrics={'train_runtime': 8355.0605, 'train_samples_per_second': 241.536, 'train_steps_per_second': 0.472, 'total_flos': 3.004701338322985e+18, 'train_loss': 0.14545092686731037, 'epoch': 1.0})"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a9e4ed6b-9315-420e-9c83-77aace8e7edc",
+   "metadata": {},
+   "source": [
+    "We can see that by the end of the epoch, the loss stagnates at the level about 0.1. Apparently, it cannot go much lower than that without unfreezing the model parameters. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42540d92-2fcc-44a7-b0f3-dc99d6b566f6",
+   "metadata": {},
+   "source": [
+    "## Evaluating it"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "ef36fd6a-9e90-4c21-bb3e-1e1da71a61ae",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "device(type='cuda', index=0)"
+      ]
+     },
+     "execution_count": 30,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model.eval()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "id": "780ff8f6-691c-412b-9075-865ab996a5ab",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "preds1_small = trainer.predict(ds_tokenized['test_small'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "092a6e2a-285c-4357-8c77-a802f9546dba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with torch.inference_mode():\n",
+    "    p1_bad = torch.softmax(torch.tensor(preds1_small.predictions[0]), dim=-1)[:, 1].cpu().numpy()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "id": "02e74f42-9e9f-4b65-8f67-4eb914abf490",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.8337834961877266\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(roc_auc_score(preds1_small.label_ids, p1_bad))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "id": "08805de6-7011-4d12-bdb6-003b0e81f0d8",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "074f8f829b7f4fa5a08c0b9c7db509c4",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/5101 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "lang = [tokenizer.convert_ids_to_tokens(x['input_ids'][0]) for x in tqdm(ds_tokenized['test_small'])]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 56,
+   "id": "7c426de1-17f2-426b-b20d-ced21b1bba85",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "e89e7531981448ed8b70e91fe680389f",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/5101 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "test_pred_df = pd.DataFrame({\n",
+    "    'label': preds1_small.label_ids,\n",
+    "    'pred': p1_bad,\n",
+    "    'lang': [tokenizer.convert_ids_to_tokens(x['input_ids'][0]) for x in tqdm(ds_tokenized['test_small'])],\n",
+    "})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "id": "1d0a7727-805f-4fcb-a018-e79cc437e057",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "lang\n",
+       "arb_Arab    0.887143\n",
+       "ben_Beng    0.820000\n",
+       "bul_Cyrl    0.820000\n",
+       "cat_Latn    0.911429\n",
+       "ces_Latn    0.880714\n",
+       "dan_Latn    0.858889\n",
+       "deu_Latn    0.806167\n",
+       "ell_Grek    0.842500\n",
+       "eng_Latn    0.676300\n",
+       "est_Latn    0.916552\n",
+       "fin_Latn    0.931034\n",
+       "fra_Latn    0.839628\n",
+       "heb_Hebr    0.858000\n",
+       "hin_Deva    0.780357\n",
+       "hun_Latn    0.884333\n",
+       "ind_Latn    0.740000\n",
+       "ita_Latn    0.801917\n",
+       "nld_Latn    0.703793\n",
+       "pes_Arab    0.911429\n",
+       "pol_Latn    0.874545\n",
+       "por_Latn    0.866679\n",
+       "rus_Cyrl    0.820039\n",
+       "slk_Latn    0.881333\n",
+       "spa_Latn    0.816425\n",
+       "swh_Latn    0.625667\n",
+       "tgl_Latn    0.688667\n",
+       "tur_Latn    0.885000\n",
+       "urd_Arab    0.845435\n",
+       "vie_Latn    0.838065\n",
+       "zho_Hans    0.732609\n",
+       "dtype: float64"
+      ]
+     },
+     "execution_count": 59,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_pred_df.groupby('lang').apply(lambda x: roc_auc_score(x['label'], x['pred']))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68254f12-423b-4d55-95f8-c90a8937ee58",
+   "metadata": {},
+   "source": [
+    "# 5. Now unfeeze the model and train again"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "099b3672-829b-426c-bb0c-95b2c36d17e5",
+   "metadata": {},
+   "source": [
+    "For unfrozen fine-tuning, we start with the current model - the one that has its classification head well trained. \n",
+    "\n",
+    "This way, we avoid the problem that the loss propagated from the randomly initialized head distorts the parameters of the underlying sentence encoder. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 71,
+   "id": "6b5721e1-7cef-40ee-af66-4735c4229cf1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# clean up some memory\n",
+    "trainer = None\n",
+    "gc.collect()\n",
+    "torch.cuda.empty_cache()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "id": "4d4ecf91-b185-426b-87da-bbafc4f77aeb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for param in model.sonar_encoder.parameters():\n",
+    "    param.requires_grad = True"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 75,
+   "id": "b5c313e1-6f65-4464-8cbe-bb4882dbb42c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "training_args = TrainingArguments(\n",
+    "    output_dir=MODEL2_DIR,\n",
+    "    num_train_epochs=2,  # now with more parameters to train, we increase the number of epochs\n",
+    "    logging_steps=500,\n",
+    "    save_steps=5_000,\n",
+    "    save_total_limit=50,\n",
+    "    per_device_train_batch_size=32,\n",
+    "    warmup_steps=1_000,  # Adjust the number of the warmup steps to take about 10%-20% of the total number of steps per epoch.\n",
+    "    weight_decay=1e-3,\n",
+    "    optim=\"adafactor\", # adafactor, if even with smaller batches you are out of memory \n",
+    "    gradient_accumulation_steps=1,\n",
+    "    run_name='sonar-tox-trained-v2',\n",
+    "    learning_rate=1e-4,\n",
+    "    report_to=\"wandb\", # turn this off if you don't have wandb\n",
+    "    # max_steps=100_000,\n",
+    "    bf16=True, # turn this off if your GPU doesn't support it\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 76,
+   "id": "6b2c6f90-83ca-4397-8477-3586d4d66880",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "steps per epoch 7883.00390625\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Here is how many steps per epoch would be made with my setting (8 gpus):\n",
+    "print(\"steps per epoch\", len(ds_tokenized['train']) / training_args.per_device_train_batch_size / 8)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "id": "b5371975-9eee-484c-92e0-6bb8f75b9284",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer = Trainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    train_dataset=ds_tokenized[\"train\"],\n",
+    "    data_collator=data_collator,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "beff2afa-1d8a-4950-9b92-bfe6b907bfe1",
+   "metadata": {},
+   "source": [
+    "We can see that the loss starts at the 0.095 level - slightly "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 78,
+   "id": "25de2ef6-1926-4a00-94bf-57b49c3c39a7",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='15768' max='15768' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [15768/15768 12:13:52, Epoch 2/2]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Step</th>\n",
+       "      <th>Training Loss</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td>500</td>\n",
+       "      <td>0.095400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>1000</td>\n",
+       "      <td>0.083500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>1500</td>\n",
+       "      <td>0.075400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>2000</td>\n",
+       "      <td>0.069800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>2500</td>\n",
+       "      <td>0.065700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>3000</td>\n",
+       "      <td>0.064500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>3500</td>\n",
+       "      <td>0.062000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>4000</td>\n",
+       "      <td>0.059500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>4500</td>\n",
+       "      <td>0.060500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>5000</td>\n",
+       "      <td>0.059000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>5500</td>\n",
+       "      <td>0.058600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>6000</td>\n",
+       "      <td>0.058700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>6500</td>\n",
+       "      <td>0.057000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>7000</td>\n",
+       "      <td>0.059200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>7500</td>\n",
+       "      <td>0.056600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>8000</td>\n",
+       "      <td>0.052500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>8500</td>\n",
+       "      <td>0.047500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>9000</td>\n",
+       "      <td>0.047600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>9500</td>\n",
+       "      <td>0.047300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>10000</td>\n",
+       "      <td>0.047200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>10500</td>\n",
+       "      <td>0.047300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>11000</td>\n",
+       "      <td>0.047100</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>11500</td>\n",
+       "      <td>0.045700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>12000</td>\n",
+       "      <td>0.046100</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>12500</td>\n",
+       "      <td>0.045700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>13000</td>\n",
+       "      <td>0.045600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>13500</td>\n",
+       "      <td>0.045600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>14000</td>\n",
+       "      <td>0.044500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>14500</td>\n",
+       "      <td>0.045800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>15000</td>\n",
+       "      <td>0.044400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>15500</td>\n",
+       "      <td>0.044600</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "IOPub message rate exceeded.\n",
+      "The Jupyter server will temporarily stop sending output\n",
+      "to the client in order to avoid crashing it.\n",
+      "To change this limit, set the config variable\n",
+      "`--ServerApp.iopub_msg_rate_limit`.\n",
+      "\n",
+      "Current values:\n",
+      "ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+      "ServerApp.rate_limit_window=3.0 (secs)\n",
+      "\n",
+      "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/transformers/configuration_utils.py:397: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.\n",
+      "Non-default generation parameters: {'max_length': 200}\n",
+      "  warnings.warn(\n",
+      "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+      "  warnings.warn(\n",
+      "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/transformers/configuration_utils.py:397: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.\n",
+      "Non-default generation parameters: {'max_length': 200}\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "TrainOutput(global_step=15768, training_loss=0.05560727533140139, metrics={'train_runtime': 44037.4041, 'train_samples_per_second': 91.652, 'train_steps_per_second': 0.358, 'total_flos': 5.37354496011832e+18, 'train_loss': 0.05560727533140139, 'epoch': 2.0})"
+      ]
+     },
+     "execution_count": 78,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6a2081ee-4ad7-4bb4-b506-3c9224392cda",
+   "metadata": {},
+   "source": [
+    "## Evaluating"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "id": "b55d26e8-a05b-4368-9cf1-890c95b7f8fd",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1"
+      ]
+     },
+     "execution_count": 79,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model.eval()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 80,
+   "id": "aff27f8b-20c2-4467-9910-6f857f381f75",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "preds2_small = trainer.predict(ds_tokenized['test_small'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 81,
+   "id": "43889710-ff29-42b8-918d-7f377fdca926",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.8961500666222518\n"
+     ]
+    }
+   ],
+   "source": [
+    "with torch.inference_mode():\n",
+    "    p2_bad = torch.softmax(torch.tensor(preds2_small.predictions[0]), dim=-1)[:, 1].cpu().numpy()\n",
+    "print(roc_auc_score(preds2_small.label_ids, p2_bad))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 83,
+   "id": "9a98f19f-d5d4-46a5-9fab-0570ef4b2dc5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "c1dc507386f041e4ad2a8638569e94df",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/5101 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "test_pred2_df = pd.DataFrame({\n",
+    "    'label': preds2_small.label_ids,\n",
+    "    'pred': p2_bad,\n",
+    "    'lang': [tokenizer.convert_ids_to_tokens(x['input_ids'][0]) for x in tqdm(ds_tokenized['test_small'])],\n",
+    "})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "id": "56d3dd7b-124d-4c62-a58a-46a7136c136a",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "lang\n",
+       "arb_Arab    0.887143\n",
+       "ben_Beng    0.820000\n",
+       "bul_Cyrl    0.820000\n",
+       "cat_Latn    0.911429\n",
+       "ces_Latn    0.880714\n",
+       "dan_Latn    0.858889\n",
+       "deu_Latn    0.806167\n",
+       "ell_Grek    0.842500\n",
+       "eng_Latn    0.676300\n",
+       "est_Latn    0.916552\n",
+       "fin_Latn    0.931034\n",
+       "fra_Latn    0.839628\n",
+       "heb_Hebr    0.858000\n",
+       "hin_Deva    0.780357\n",
+       "hun_Latn    0.884333\n",
+       "ind_Latn    0.740000\n",
+       "ita_Latn    0.801917\n",
+       "nld_Latn    0.703793\n",
+       "pes_Arab    0.911429\n",
+       "pol_Latn    0.874545\n",
+       "por_Latn    0.866679\n",
+       "rus_Cyrl    0.820039\n",
+       "slk_Latn    0.881333\n",
+       "spa_Latn    0.816425\n",
+       "swh_Latn    0.625667\n",
+       "tgl_Latn    0.688667\n",
+       "tur_Latn    0.885000\n",
+       "urd_Arab    0.845435\n",
+       "vie_Latn    0.838065\n",
+       "zho_Hans    0.732609\n",
+       "dtype: float64"
+      ]
+     },
+     "execution_count": 84,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_pred_df.groupby('lang').apply(lambda x: roc_auc_score(x['label'], x['pred']))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f1f64a0-a5d6-4aaa-a40f-d22744e830ba",
+   "metadata": {},
+   "source": [
+    "How does it work with our toy sentences?\n",
+    "\n",
+    "We can see that the model is rather confidently predicting the first one as toxic, and the other two as neutral. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 87,
+   "id": "0148ec0e-ac74-49ff-874e-999610fc3106",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[2.0675792e-04 9.9979323e-01]\n",
+      " [9.9986959e-01 1.3042087e-04]\n",
+      " [9.9996281e-01 3.7189438e-05]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "sample_texts = ['Fuck you bitch!', 'Do you like dogs?', 'One more neutral sentence.']\n",
+    "tokenizer.src_lang = \"eng_Latn\"\n",
+    "\n",
+    "with torch.inference_mode():\n",
+    "    out = model(**tokenizer(sample_texts, padding=True, return_tensors='pt').to(model.device))\n",
+    "    probabilities = torch.softmax(out.logits, dim=-1).cpu().numpy()\n",
+    "print(probabilities)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f089f5b-8489-4ed2-998d-f9d2f8a5e634",
+   "metadata": {},
+   "source": [
+    "# Packaging"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf80372c-e981-402c-bf41-9f3690589723",
+   "metadata": {},
+   "source": [
+    "Here we show how to load the model from a directory. \n",
+    "\n",
+    "It has been already saved there by the trainer, but we'll also save the tokenizer, just for convenience."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 89,
+   "id": "d989bbcb-39f7-4cca-9d53-5aa236634fcc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gc.collect()\n",
+    "torch.cuda.empty_cache()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 90,
+   "id": "6e52b390-0a64-4188-aa81-1047a0b960a3",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('models/SONAR_toxicity_classifier_unfozen/checkpoint-15768/tokenizer_config.json',\n",
+       " 'models/SONAR_toxicity_classifier_unfozen/checkpoint-15768/special_tokens_map.json',\n",
+       " 'models/SONAR_toxicity_classifier_unfozen/checkpoint-15768/sentencepiece.bpe.model',\n",
+       " 'models/SONAR_toxicity_classifier_unfozen/checkpoint-15768/added_tokens.json',\n",
+       " 'models/SONAR_toxicity_classifier_unfozen/checkpoint-15768/tokenizer.json')"
+      ]
+     },
+     "execution_count": 90,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.save_pretrained(f\"{MODEL1_DIR}/checkpoint-3942\")\n",
+    "tokenizer.save_pretrained(f\"{MODEL2_DIR}/checkpoint-15768\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 91,
+   "id": "40bccba0-1922-4b45-9f8f-c4b818dc10c1",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "total 3.0G\n",
+      "   0 drwxrwxr-x 2 daviddale daviddale    0 Jul 28 14:26 .\n",
+      "   0 drwxrwxr-x 6 daviddale daviddale    0 Jul 28 11:14 ..\n",
+      "1.0K -rw-rw-r-- 1 daviddale daviddale  910 Jul 28 11:14 config.json\n",
+      "2.9G -rw-rw-r-- 1 daviddale daviddale 2.9G Jul 28 11:14 model.safetensors\n",
+      "5.4M -rw-rw-r-- 1 daviddale daviddale 5.4M Jul 28 11:14 optimizer.pt\n",
+      " 14K -rw-rw-r-- 1 daviddale daviddale  14K Jul 28 11:14 rng_state.pth\n",
+      "1.5K -rw-rw-r-- 1 daviddale daviddale 1.1K Jul 28 11:14 scheduler.pt\n",
+      "4.7M -rw-rw-r-- 1 daviddale daviddale 4.7M Jul 28 14:26 sentencepiece.bpe.model\n",
+      "4.5K -rw-rw-r-- 1 daviddale daviddale 4.2K Jul 28 14:26 special_tokens_map.json\n",
+      " 39K -rw-rw-r-- 1 daviddale daviddale  39K Jul 28 14:26 tokenizer_config.json\n",
+      " 31M -rw-rw-r-- 1 daviddale daviddale  31M Jul 28 14:26 tokenizer.json\n",
+      "6.0K -rw-rw-r-- 1 daviddale daviddale 6.0K Jul 28 11:14 trainer_state.json\n",
+      "5.5K -rw-rw-r-- 1 daviddale daviddale 5.3K Jul 28 11:14 training_args.bin\n"
+     ]
+    }
+   ],
+   "source": [
+    "!ls -alsh $MODEL2_DIR/checkpoint-15768"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a9f479ee-08fa-4217-bf45-23135b9cd631",
+   "metadata": {},
+   "source": [
+    "Now, let's load the new model and tokenizer!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 92,
+   "id": "d3f4d3d1-8c08-47de-8b5d-a7001b115313",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "new_model_dir = f\"{MODEL2_DIR}/checkpoint-15768\"\n",
+    "\n",
+    "model2 = SonarForSequenceClassification.from_pretrained(new_model_dir)\n",
+    "tokenizer2 = AutoTokenizer.from_pretrained(new_model_dir)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "40e86fe6-5c2f-488d-bab2-fed7ea7a113a",
+   "metadata": {},
+   "source": [
+    "Checking that the model gets exactly the same predictions as before:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 93,
+   "id": "32f805fb-9d8b-4e46-b4ae-c6965670d9fe",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[2.0675754e-04 9.9979323e-01]\n",
+      " [9.9986959e-01 1.3042087e-04]\n",
+      " [9.9996281e-01 3.7189402e-05]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "sample_texts = ['Fuck you bitch!', 'Do you like dogs?', 'One more neutral sentence.']\n",
+    "tokenizer2.src_lang = \"eng_Latn\"\n",
+    "\n",
+    "with torch.inference_mode():\n",
+    "    out = model2(**tokenizer2(sample_texts, padding=True, return_tensors='pt').to(model2.device))\n",
+    "    probabilities = torch.softmax(out.logits, dim=-1).cpu().numpy()\n",
+    "print(probabilities)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From 371319854548f1ee45990fa8828902e0442a3a75 Mon Sep 17 00:00:00 2001
From: David Dale <daviddale@meta.com>
Date: Mon, 28 Jul 2025 16:40:05 +0000
Subject: [PATCH 2/2] a note on dependencies

---
 examples/finetune_sonar_as_toxicity_classifier.ipynb | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/examples/finetune_sonar_as_toxicity_classifier.ipynb b/examples/finetune_sonar_as_toxicity_classifier.ipynb
index b2bc852..b7097b4 100644
--- a/examples/finetune_sonar_as_toxicity_classifier.ipynb
+++ b/examples/finetune_sonar_as_toxicity_classifier.ipynb
@@ -24,6 +24,8 @@
     "\n",
     "Note: this notebook trains the models using 8 A100 GPUs. If you have other hardware configuration, consider adjusting the training parameters accordingly (e.g. reducing the batch size if it does not fit into GPU memory, or even downsampling the dataset). Please note that some kind of hardware accelerator is indispensable for running this training within reasonable time. If you do not have any, consider subscribing e.g. to Google Colab. \n",
     "\n",
+    "The notebook does not depend on the [SONAR](https://github.com/facebookresearch/SONAR) codebase; instead, it depends only on HF [transformers](https://github.com/huggingface/transformers), HF [datasets](https://github.com/huggingface/datasets), and a bit of custom code.\n",
+    "\n",
     "**WARNING**: this notebook might contain examples of offensive or otherwise distubing texts in various languages. \n",
     "\n"
    ]