From 40bdf2081ce70941a4bce75df526eae1859276fc Mon Sep 17 00:00:00 2001 From: David Dale Date: Mon, 28 Jul 2025 16:31:44 +0000 Subject: [PATCH 1/2] add a notebook with finetuning example --- ...inetune_sonar_as_toxicity_classifier.ipynb | 3119 +++++++++++++++++ 1 file changed, 3119 insertions(+) create mode 100644 examples/finetune_sonar_as_toxicity_classifier.ipynb diff --git a/examples/finetune_sonar_as_toxicity_classifier.ipynb b/examples/finetune_sonar_as_toxicity_classifier.ipynb new file mode 100644 index 0000000..b2bc852 --- /dev/null +++ b/examples/finetune_sonar_as_toxicity_classifier.ipynb @@ -0,0 +1,3119 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0ee0665d-ab0b-4ebf-a6a5-613ac1072f93", + "metadata": {}, + "source": [ + "In this notebook, we train two SONAR-based toxicity classifiers, to illustrate two things:\n", + "- How a classification head on top of SONAR could be trained;\n", + "- How a SONAR encoder could be finetuned with some downstream task. \n", + "\n", + "The first model is adding a classifier layer on top of the frozen SONAR text encoder (like the [MuTox](https://aclanthology.org/2024.findings-acl.340/) model). The second is also unfreezing the encoder, allowing it to modify the underlying sentence representation to potentially better fit the task. \n", + "\n", + "In both cases, we will be using the unofficial HuggingFace port [cointegrated/SONAR_200_text_encoder](https://huggingface.co/cointegrated/SONAR_200_text_encoder) of the classifier, simply to avoid messing with fairseq2, in which original SONAR is implemented. \n", + "\n", + "We will be using two datasets for training and evaluation:\n", + "- MuTox (text transcriptions; the original dataset is speech-based)\n", + " - see https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/cli/toxicity/mutox/README.md\n", + "- Jigsaw toxic comments from three competitions:\n", + " - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data\n", + " - https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data\n", + " - https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/data\n", + " \n", + "\n", + "Note: this notebook trains the models using 8 A100 GPUs. If you have other hardware configuration, consider adjusting the training parameters accordingly (e.g. reducing the batch size if it does not fit into GPU memory, or even downsampling the dataset). Please note that some kind of hardware accelerator is indispensable for running this training within reasonable time. If you do not have any, consider subscribing e.g. to Google Colab. \n", + "\n", + "**WARNING**: this notebook might contain examples of offensive or otherwise distubing texts in various languages. \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "7bfdd63f-069c-4401-9581-3049ffbc8d34", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import torch\n", + "import datasets\n", + "\n", + "from typing import Optional, Union\n", + "from torch import nn\n", + "from torch.nn import CrossEntropyLoss\n", + "from transformers import AutoTokenizer\n", + "from transformers.models.m2m_100.modeling_m2m_100 import M2M100Encoder, M2M100PreTrainedModel\n", + "from transformers.modeling_outputs import SequenceClassifierOutput\n", + "\n", + "from transformers import TrainingArguments, Trainer, DataCollatorWithPadding\n", + "from sklearn.metrics import roc_auc_score\n", + "from tqdm.auto import tqdm\n", + "import gc" + ] + }, + { + "cell_type": "markdown", + "id": "87e7e751-ba58-426e-97bd-aa7e10e579fc", + "metadata": {}, + "source": [ + "Setting up some directories" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "0f97515b-fe3b-480e-aebb-f0ca37ee8e86", + "metadata": {}, + "outputs": [], + "source": [ + "DATA_DIR = \"data\" # you will need to download the Jigsaw datasets there\n", + "MODELS_DIR = \"models\" # the trained models will be saved there\n", + "\n", + "# the two models will be saved here:\n", + "MODEL1_DIR = f\"{MODELS_DIR}/SONAR_toxicity_classifier_frozen\"\n", + "MODEL2_DIR = f\"{MODELS_DIR}/SONAR_toxicity_classifier_unfozen\"" + ] + }, + { + "cell_type": "markdown", + "id": "6eb01105-4288-4161-8983-925e239a391f", + "metadata": {}, + "source": [ + "Point to the model that we will fine-tune: https://huggingface.co/cointegrated/SONAR_200_text_encoder" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "fcc9aa22-5ea7-4f93-b0bc-9bcbb40586b2", + "metadata": {}, + "outputs": [], + "source": [ + "BASE_MODEL_NAME = \"cointegrated/SONAR_200_text_encoder\"" + ] + }, + { + "cell_type": "markdown", + "id": "2b39a8d8-2a5d-429e-8d18-df8824af5d67", + "metadata": {}, + "source": [ + "Load the classifier in advance (we will need it while preparing the data). " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "9f31a2cd-f68d-41a8-bd2c-12858e982d05", + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)" + ] + }, + { + "cell_type": "markdown", + "id": "9ca989c3-7743-4c43-be4b-64ee07fba110", + "metadata": {}, + "source": [ + "# 1. Compiling the datasets" + ] + }, + { + "cell_type": "markdown", + "id": "9cf437e5-16ac-4976-bd8f-acfb6845ce9f", + "metadata": {}, + "source": [ + "## Mutox" + ] + }, + { + "cell_type": "markdown", + "id": "b722089d-87b5-437a-95b9-da646cbb786a", + "metadata": {}, + "source": [ + "The Mutox text dataset could be downloaded directly:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "98a30e19-4393-4bfd-a18a-09d269ea1219", + "metadata": {}, + "outputs": [], + "source": [ + "mutox_raw = pd.read_csv(\"https://dl.fbaipublicfiles.com/seamless/datasets/mutox.tsv\", sep=\"\\t\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "6c80449d-b166-4af1-897c-2daf4d674ad6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idlangpartitionpublic_url_segmentaudio_file_transcriptcontains_toxicitytoxicity_typesperlocutionary_effectslabeletox_resultdetoxify_scoremutox_speech_scoremutox_text_scoremutox_zero_shot_speech_scoremutox_zero_shot_text_score
46427fra_1181276fratrainhttp://www.archive.org/download/notredameparis...élas, mon frère, c'est que vous aviez bien rai...NoNaNNaN0NaN0.001685-24.469976-28.979309-12.300150-18.465229
8575341677.0elltrainhttps://bauernordic-pods.sharp-stream.com/gr/1...Και φτάνεις σε ένα σημείο και λες δεν μπορώ ν...NoNaNNaN0NaN0.768185-8.917057-18.83962812.581656-13.409579
57571por_370751pordevtesthttps://feeds.soundcloud.com/stream/1121363227...so para quando algm sangra ou cae no chãoae pa...YesprofanitiesNone of the above1merdas0.976207-0.57349231.779409-12.019632-1.681559
\n", + "
" + ], + "text/plain": [ + " id lang partition \\\n", + "46427 fra_1181276 fra train \n", + "85753 41677.0 ell train \n", + "57571 por_370751 por devtest \n", + "\n", + " public_url_segment \\\n", + "46427 http://www.archive.org/download/notredameparis... \n", + "85753 https://bauernordic-pods.sharp-stream.com/gr/1... \n", + "57571 https://feeds.soundcloud.com/stream/1121363227... \n", + "\n", + " audio_file_transcript contains_toxicity \\\n", + "46427 élas, mon frère, c'est que vous aviez bien rai... No \n", + "85753 Και φτάνεις σε ένα σημείο και λες δεν μπορώ ν... No \n", + "57571 so para quando algm sangra ou cae no chãoae pa... Yes \n", + "\n", + " toxicity_types perlocutionary_effects label etox_result \\\n", + "46427 NaN NaN 0 NaN \n", + "85753 NaN NaN 0 NaN \n", + "57571 profanities None of the above 1 merdas \n", + "\n", + " detoxify_score mutox_speech_score mutox_text_score \\\n", + "46427 0.001685 -24.469976 -28.979309 \n", + "85753 0.768185 -8.917057 -18.839628 \n", + "57571 0.976207 -0.573492 31.779409 \n", + "\n", + " mutox_zero_shot_speech_score mutox_zero_shot_text_score \n", + "46427 -12.300150 -18.465229 \n", + "85753 12.581656 -13.409579 \n", + "57571 -12.019632 -1.681559 " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mutox_raw.sample(3)" + ] + }, + { + "cell_type": "markdown", + "id": "2017e345-f799-4a9c-aad0-830ee16d7978", + "metadata": {}, + "source": [ + "Note that Mutox contains a certain portion of the texts that are missing or inadequately short; we would like to filter them out. " + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "2c70bb38-87ee-488d-b5aa-f74ad0dcebf7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3441\n", + "122\n" + ] + } + ], + "source": [ + "print(mutox_raw['audio_file_transcript'].isnull().sum())\n", + "print((mutox_raw['audio_file_transcript'].str.len() < 5).sum())" + ] + }, + { + "cell_type": "markdown", + "id": "42766ecf-c253-4717-816b-bdbe7894e731", + "metadata": {}, + "source": [ + "There are plenty of languages in the dataset: " + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "ddfb39e0-1445-40ca-9179-ecaa3b77b978", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "partition\n", + "train 67971\n", + "dev 22273\n", + "devtest 10691\n", + "Name: count, dtype: int64" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mutox_raw[\"partition\"].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "e1ec505b-8f15-4d74-8b5b-0e355c282167", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "lang\n", + "eng 17082\n", + "spa 17059\n", + "cat 2787\n", + "bul 2650\n", + "est 2631\n", + "ita 2615\n", + "hin 2601\n", + "arb 2560\n", + "por 2551\n", + "heb 2550\n", + "swh 2542\n", + "nld 2542\n", + "tgl 2527\n", + "ben 2526\n", + "ell 2526\n", + "dan 2525\n", + "fin 2524\n", + "vie 2523\n", + "fra 2523\n", + "tur 2520\n", + "rus 2519\n", + "slk 2518\n", + "urd 2517\n", + "pol 2515\n", + "deu 2514\n", + "cmn 2513\n", + "pes 2513\n", + "ces 2512\n", + "ind 2510\n", + "hun 2501\n", + "Name: count, dtype: int64" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mutox_raw[\"lang\"].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "ca9681f6-5f47-420a-8457-b84468255fcd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "contains_toxicity\n", + "No 87887\n", + "Yes 13048\n", + "Cannot say 2623\n", + "Cannot Say 1927\n", + "yes 11\n", + "Name: count, dtype: int64" + ] + }, + "execution_count": 74, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mutox_raw['contains_toxicity'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "id": "3ccdbdb0-dfdf-41e8-ae50-8b7a1039ba7a", + "metadata": {}, + "source": [ + "Cleaning the dataset and binarizing the label" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "ad3c46a3-aad1-4e04-92d4-689127895bbd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(105496, 16)\n", + "(97509, 17)\n" + ] + } + ], + "source": [ + "mutox_cleaned = mutox_raw[\n", + " mutox_raw['audio_file_transcript'].notnull()\n", + " & mutox_raw['audio_file_transcript'].str.len().ge(5)\n", + " & mutox_raw['contains_toxicity'].isin({'Yes', 'yes', 'No'})\n", + "].copy()\n", + "mutox_cleaned['toxic'] = mutox_cleaned['contains_toxicity'].apply({'Yes': 1, 'yes': 1, 'No': 0}.get)\n", + "print(mutox_raw.shape)\n", + "print(mutox_cleaned.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "f8cebda7-47e9-4f32-8af0-49170ec5b933", + "metadata": {}, + "source": [ + "## Jigsaw" + ] + }, + { + "cell_type": "markdown", + "id": "43734bae-35cb-4ab2-8183-0c286ad557f9", + "metadata": {}, + "source": [ + "For Jigsaw, you need to access the datasets manually. \n", + "\n", + "1. Please log into Kaggle, go to https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/data, and \"download all\" the data in to the `$DATA_DIR` directory (it is \"data\" for me, but you could use some directory elsewhere if you want). \n", + "2. Unzip the data. \n", + "\n", + "On unix-like operation systems, the command to unzip them would go like this:\n", + "\n", + "```\n", + "!cd $DATA_DIR && unzip jigsaw-multilingual-toxic-comment-classification.zip\n", + "```\n", + "\n", + "The result will include the datasets from all three competitions. The directory contents could look as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "7363170e-ab23-48dc-ae44-067bd2ecb10f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "jigsaw-multilingual-toxic-comment-classification.zip\n", + "jigsaw-toxic-comment-train-processed-seqlen128.csv\n", + "jigsaw-toxic-comment-train.csv\n", + "jigsaw-unintended-bias-train-processed-seqlen128.csv\n", + "jigsaw-unintended-bias-train.csv\n", + "sample_submission.csv\n", + "test-processed-seqlen128.csv\n", + "test.csv\n", + "test_labels.csv\n", + "validation-processed-seqlen128.csv\n", + "validation.csv\n" + ] + } + ], + "source": [ + "!ls $DATA_DIR" + ] + }, + { + "cell_type": "markdown", + "id": "c7f3dfe1-dd80-4066-a411-1cafac6b9627", + "metadata": {}, + "source": [ + "Now we can read the datasets and add some missing columns:" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "74575c4e-5ed4-47c2-a9e9-dabe0093c54a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idcomment_texttoxicsevere_toxicobscenethreatinsultidentity_hate
1707682cb5cd297da37b3a, but I don't know what DRV is, I'm afraid000000
123481948a7ca2686cccdeBy the way I will not appeal this block despit...000000
1987119c0a4a0364eb1792In my opinion, it is impossible to explain wha...000000
\n", + "
" + ], + "text/plain": [ + " id comment_text \\\n", + "170768 2cb5cd297da37b3a , but I don't know what DRV is, I'm afraid \n", + "123481 948a7ca2686cccde By the way I will not appeal this block despit... \n", + "198711 9c0a4a0364eb1792 In my opinion, it is impossible to explain wha... \n", + "\n", + " toxic severe_toxic obscene threat insult identity_hate \n", + "170768 0 0 0 0 0 0 \n", + "123481 0 0 0 0 0 0 \n", + "198711 0 0 0 0 0 0 " + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "jigsaw1_train = pd.read_csv(f\"{DATA_DIR}/jigsaw-toxic-comment-train.csv\")\n", + "jigsaw1_train.sample(3)" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "88a13789-5e6e-4545-8806-d597f831fab1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idcomment_texttoxicsevere_toxicityobsceneidentity_attackinsultthreatasianatheist...article_idratingfunnywowsadlikesdisagreesexual_explicitidentity_annotator_counttoxicity_annotator_count
12256155612740Why?0.0000000.0000000.00.00.00.0NaNNaN...356357approved000000.004
6960784973759On the contrary, I believe EVERY new condo in ...0.1666670.1666670.00.00.00.0NaNNaN...317444approved000000.006
11364015504345Money has ruined politics in this country. \\n...0.0000000.0000000.00.00.00.0NaNNaN...350258approved000500.004
\n", + "

3 rows × 45 columns

\n", + "
" + ], + "text/plain": [ + " id comment_text toxic \\\n", + "1225615 5612740 Why? 0.000000 \n", + "696078 4973759 On the contrary, I believe EVERY new condo in ... 0.166667 \n", + "1136401 5504345 Money has ruined politics in this country. \\n... 0.000000 \n", + "\n", + " severe_toxicity obscene identity_attack insult threat asian \\\n", + "1225615 0.000000 0.0 0.0 0.0 0.0 NaN \n", + "696078 0.166667 0.0 0.0 0.0 0.0 NaN \n", + "1136401 0.000000 0.0 0.0 0.0 0.0 NaN \n", + "\n", + " atheist ... article_id rating funny wow sad likes disagree \\\n", + "1225615 NaN ... 356357 approved 0 0 0 0 0 \n", + "696078 NaN ... 317444 approved 0 0 0 0 0 \n", + "1136401 NaN ... 350258 approved 0 0 0 5 0 \n", + "\n", + " sexual_explicit identity_annotator_count toxicity_annotator_count \n", + "1225615 0.0 0 4 \n", + "696078 0.0 0 6 \n", + "1136401 0.0 0 4 \n", + "\n", + "[3 rows x 45 columns]" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "jigsaw2_train = pd.read_csv(f\"{DATA_DIR}/jigsaw-unintended-bias-train.csv\")\n", + "jigsaw2_train.sample(3)" + ] + }, + { + "cell_type": "markdown", + "id": "252f618e-8f94-4c13-abb6-4144af8e32a3", + "metadata": {}, + "source": [ + "The unintended bias dataset has toxicity labels that are non-binary. We will convert them to binary by applying a threshold of 0.5." + ] + }, + { + "cell_type": "code", + "execution_count": 209, + "id": "3bc054e7-69fb-43c0-879a-a9cba7879124", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAGsCAYAAAAPJKchAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAJ8BJREFUeJzt3X9Q1Pedx/EXPxe5hBhD+KFHipqYH1WR4skR40R7IDUOPeemVy/mlOOiuVS4se6kicQoUhMxRj06OVImJsY6jcGYSUxbGYWScDaRniPKnGnU1KChZwX1vAQC7bKy3/sjw/YoiHzJ7veTxedjxj/2w+fz+b737Sqv+X6/uxtmWZYlAAAAQ8JNFwAAAK5vhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgVEiFkYMHDyovL09jx45VWFiY9u7da3sPy7K0efNmTZo0SS6XS+PGjdMzzzwT+GIBAMCQRJouwI7Ozk6lpaXpn//5n/V3f/d3w9pjxYoVqqmp0ebNmzVlyhRdvnxZly9fDnClAABgqMJC9YvywsLC9NZbb2nBggX+MY/Ho9WrV+u1117Tp59+qsmTJ+vZZ5/V7NmzJUknTpzQ1KlT9cEHH+jOO+80UzgAAOgjpC7TXEtRUZEaGhpUVVWl//qv/9Lf//3f61vf+pZ++9vfSpJ+/vOfa8KECfrFL36h8ePHKzU1VUuXLuXMCAAABo2YMNLS0qJXXnlFe/bs0axZszRx4kQ99thjuu+++/TKK69Ikpqbm/XJJ59oz5492rlzp3bs2KHGxkZ95zvfMVw9AADXr5C6Z2Qwx48fV09PjyZNmtRn3OPx6JZbbpEk+Xw+eTwe7dy50z/v5ZdfVkZGhk6dOsWlGwAADBgxYeTzzz9XRESEGhsbFRER0ednN9xwgyQpOTlZkZGRfQLL3XffLemLMyuEEQAAnDdiwkh6erp6enp04cIFzZo1a8A5M2fO1JUrV/Txxx9r4sSJkqSPPvpIkvS1r33NsVoBAMCfhNS7aT7//HOdPn1a0hfhY+vWrZozZ47GjBmj2267Tf/4j/+o999/X1u2bFF6erouXryouro6TZ06VfPnz5fP59Nf/dVf6YYbblB5ebl8Pp8KCwsVFxenmpoaw88OAIDrU0iFkfr6es2ZM6ffeH5+vnbs2CGv16unn35aO3fu1Llz5xQfH6+//uu/VmlpqaZMmSJJ+v3vf69//dd/VU1Njf7iL/5C8+bN05YtWzRmzBinnw4AANAwwsjBgwf13HPPqbGxUefPn+/3WR+Def/993X//fdr8uTJampqGka5AABgpLH91t7eT0GtqKiwte7TTz/VkiVL9Dd/8zd2DwkAAEawL3WZZqBPQb2af/iHf9Add9yhiIgI7d27lzMjAABAkkPvpnnllVfU3Nysn/70p3r66aevOd/j8cjj8fgf+3w+Xb58WbfccovCwsKCWSoAAAgQy7LU0dGhsWPHKjz86hdjgh5Gfvvb32rVqlX61a9+pcjIoR2urKxMpaWlQa4MAAA44Xe/+53+8i//8qo/D2oY6enp0aJFi1RaWtrvk1EHU1xcLLfb7X/82Wef6bbbbtOZM2d04403Bqw+r9erd999V3PmzFFUVFTA9kVf9Nk59NoZ9NkZ9NkZwexzR0eHxo8ff83f3UENIx0dHTpy5IiOHTumoqIiSV9ccrEsS5GRkaqpqdE3v/nNfutcLpdcLle/8TFjxiguLi5g9Xm9XsXGxuqWW27hhR5E9Nk59NoZ9NkZ9NkZwexz737XusUiqGEkLi5Ox48f7zP2wgsv6J133tEbb7yh8ePHB/PwAAAgBNgOI///U1Al6cyZM2pqavJ/CmpxcbHOnTunnTt3Kjw8XJMnT+6zPiEhQTExMf3GAQDA9cl2GDly5EifT0Htvbej91NQz58/r5aWlsBVCAAARjTbYWT27Nka7KNJduzYMej6devWad26dXYPCwAARijbn8AKAAAQSIQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYFRQv7U3VExed0CensG/3vir5OzG+aZLAAAgYDgzAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAo22Hk4MGDysvL09ixYxUWFqa9e/cOOv/NN99UTk6Obr31VsXFxSkrK0sHDhwYbr0AAGCEsR1GOjs7lZaWpoqKiiHNP3jwoHJyclRdXa3GxkbNmTNHeXl5OnbsmO1iAQDAyBNpd8G8efM0b968Ic8vLy/v83jDhg16++239fOf/1zp6el2Dw8AAEYY22Hky/L5fOro6NCYMWOuOsfj8cjj8fgft7e3S5K8Xq+8Xm/AaundyxVuBWxPJwSyB07orTfU6g5F9NoZ9NkZ9NkZwezzUPcMsyxr2L+Jw8LC9NZbb2nBggVDXrNp0yZt3LhRJ0+eVEJCwoBz1q1bp9LS0n7ju3btUmxs7HDLBQAADurq6tKiRYv02WefKS4u7qrzHA0ju3bt0rJly/T2228rOzv7qvMGOjOSkpKiS5cuDfpk7PJ6vaqtrdWaI+Hy+MICtm+wfbAu13QJtvT2OScnR1FRUabLGdHotTPoszPoszOC2ef29nbFx8dfM4w4dpmmqqpKS5cu1Z49ewYNIpLkcrnkcrn6jUdFRQXlBenxhcnTEzphJFT/UQbr7w/90Wtn0Gdn0GdnBKPPQ93Pkc8Zee2111RQUKDXXntN8+fPd+KQAAAgRNg+M/L555/r9OnT/sdnzpxRU1OTxowZo9tuu03FxcU6d+6cdu7cKemLSzP5+fn60Y9+pMzMTLW2tkqSRo0apZtuuilATwMAAIQq22dGjhw5ovT0dP/bct1ut9LT07V27VpJ0vnz59XS0uKf/+KLL+rKlSsqLCxUcnKy/8+KFSsC9BQAAEAos31mZPbs2RrsntcdO3b0eVxfX2/3EAAA4DrCd9MAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjbYeTgwYPKy8vT2LFjFRYWpr17915zTX19vb7xjW/I5XLp9ttv144dO4ZRKgAAGIlsh5HOzk6lpaWpoqJiSPPPnDmj+fPna86cOWpqatL3v/99LV26VAcOHLBdLAAAGHki7S6YN2+e5s2bN+T5lZWVGj9+vLZs2SJJuvvuu/Xee+/p3/7t35Sbm2v38AAAYISxHUbsamhoUHZ2dp+x3Nxcff/737/qGo/HI4/H43/c3t4uSfJ6vfJ6vQGrrXcvV7gVsD2dEMgeOKG33lCrOxTRa2fQZ2fQZ2cEs89D3TPoYaS1tVWJiYl9xhITE9Xe3q4//OEPGjVqVL81ZWVlKi0t7TdeU1Oj2NjYgNe4frov4HsGU3V1tekShqW2ttZ0CdcNeu0M+uwM+uyMYPS5q6trSPOCHkaGo7i4WG632/+4vb1dKSkpmjt3ruLi4gJ2HK/Xq9raWq05Ei6PLyxg+wbbB+tC6/JWb59zcnIUFRVlupwRjV47gz47gz47I5h97r2ycS1BDyNJSUlqa2vrM9bW1qa4uLgBz4pIksvlksvl6jceFRUVlBekxxcmT0/ohJFQ/UcZrL8/9EevnUGfnUGfnRGMPg91v6B/zkhWVpbq6ur6jNXW1iorKyvYhwYAACHAdhj5/PPP1dTUpKamJklfvHW3qalJLS0tkr64xLJkyRL//EcffVTNzc16/PHHdfLkSb3wwgt6/fXXtXLlysA8AwAAENJsh5EjR44oPT1d6enpkiS326309HStXbtWknT+/Hl/MJGk8ePHa9++faqtrVVaWpq2bNmil156ibf1AgAAScO4Z2T27NmyrKu/FXagT1edPXu2jh07ZvdQAADgOsB30wAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAo4YVRioqKpSamqqYmBhlZmbq8OHDg84vLy/XnXfeqVGjRiklJUUrV67UH//4x2EVDAAARhbbYWT37t1yu90qKSnR0aNHlZaWptzcXF24cGHA+bt27dKqVatUUlKiEydO6OWXX9bu3bv15JNPfuniAQBA6LMdRrZu3aply5apoKBA99xzjyorKxUbG6vt27cPOP/QoUOaOXOmFi1apNTUVM2dO1cPPvjgNc+mAACA60Okncnd3d1qbGxUcXGxfyw8PFzZ2dlqaGgYcM29996rn/70pzp8+LBmzJih5uZmVVdXa/HixVc9jsfjkcfj8T9ub2+XJHm9Xnm9XjslD6p3L1e4FbA9nRDIHjiht95QqzsU0Wtn0Gdn0GdnBLPPQ93TVhi5dOmSenp6lJiY2Gc8MTFRJ0+eHHDNokWLdOnSJd13332yLEtXrlzRo48+OuhlmrKyMpWWlvYbr6mpUWxsrJ2Sh2T9dF/A9wym6upq0yUMS21trekSrhv02hn02Rn02RnB6HNXV9eQ5tkKI8NRX1+vDRs26IUXXlBmZqZOnz6tFStWaP369VqzZs2Aa4qLi+V2u/2P29vblZKSorlz5youLi5gtXm9XtXW1mrNkXB5fGEB2zfYPliXa7oEW3r7nJOTo6ioKNPljGj02hn02Rn02RnB7HPvlY1rsRVG4uPjFRERoba2tj7jbW1tSkpKGnDNmjVrtHjxYi1dulSSNGXKFHV2duqRRx7R6tWrFR7e/7YVl8sll8vVbzwqKiooL0iPL0yentAJI6H6jzJYf3/oj147gz47gz47Ixh9Hup+tm5gjY6OVkZGhurq6vxjPp9PdXV1ysrKGnBNV1dXv8AREREhSbKs0LpXAwAABJ7tyzRut1v5+fmaPn26ZsyYofLycnV2dqqgoECStGTJEo0bN05lZWWSpLy8PG3dulXp6en+yzRr1qxRXl6eP5QAAIDrl+0wsnDhQl28eFFr165Va2urpk2bpv379/tvam1paelzJuSpp55SWFiYnnrqKZ07d0633nqr8vLy9MwzzwTuWQAAgJA1rBtYi4qKVFRUNODP6uvr+x4gMlIlJSUqKSkZzqEAAMAIx3fTAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMCoYYWRiooKpaamKiYmRpmZmTp8+PCg8z/99FMVFhYqOTlZLpdLkyZNUnV19bAKBgAAI0uk3QW7d++W2+1WZWWlMjMzVV5ertzcXJ06dUoJCQn95nd3dysnJ0cJCQl64403NG7cOH3yyScaPXp0IOoHAAAhznYY2bp1q5YtW6aCggJJUmVlpfbt26ft27dr1apV/eZv375dly9f1qFDhxQVFSVJSk1N/XJVAwCAEcNWGOnu7lZjY6OKi4v9Y+Hh4crOzlZDQ8OAa372s58pKytLhYWFevvtt3Xrrbdq0aJFeuKJJxQRETHgGo/HI4/H43/c3t4uSfJ6vfJ6vXZKHlTvXq5wK2B7OiGQPXBCb72hVncootfOoM/OoM/OCGafh7qnrTBy6dIl9fT0KDExsc94YmKiTp48OeCa5uZmvfPOO3rooYdUXV2t06dPa/ny5fJ6vSopKRlwTVlZmUpLS/uN19TUKDY21k7JQ7J+ui/gewZTqN5vU1tba7qE6wa9dgZ9dgZ9dkYw+tzV1TWkebYv09jl8/mUkJCgF198UREREcrIyNC5c+f03HPPXTWMFBcXy+12+x+3t7crJSVFc+fOVVxcXMBq83q9qq2t1Zoj4fL4wgK2b7B9sC7XdAm29PY5JyfHf6kOwUGvnUGfnUGfnRHMPvde2bgWW2EkPj5eERERamtr6zPe1tampKSkAdckJycrKiqqzyWZu+++W62treru7lZ0dHS/NS6XSy6Xq994VFRUUF6QHl+YPD2hE0ZC9R9lsP7+0B+9dgZ9dgZ9dkYw+jzU/Wy9tTc6OloZGRmqq6vzj/l8PtXV1SkrK2vANTNnztTp06fl8/3pUshHH32k5OTkAYMIAAC4vtj+nBG3261t27bpJz/5iU6cOKHvfe976uzs9L+7ZsmSJX1ucP3e976ny5cva8WKFfroo4+0b98+bdiwQYWFhYF7FgAAIGTZvmdk4cKFunjxotauXavW1lZNmzZN+/fv99/U2tLSovDwP2WclJQUHThwQCtXrtTUqVM1btw4rVixQk888UTgngUAAAhZw7qBtaioSEVFRQP+rL6+vt9YVlaWfv3rXw/nUAAAYITju2kAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYNSwwkhFRYVSU1MVExOjzMxMHT58eEjrqqqqFBYWpgULFgznsAAAYASyHUZ2794tt9utkpISHT16VGlpacrNzdWFCxcGXXf27Fk99thjmjVr1rCLBQAAI4/tMLJ161YtW7ZMBQUFuueee1RZWanY2Fht3779qmt6enr00EMPqbS0VBMmTPhSBQMAgJEl0s7k7u5uNTY2qri42D8WHh6u7OxsNTQ0XHXdD3/4QyUkJOjhhx/Wr371q2sex+PxyOPx+B+3t7dLkrxer7xer52SB9W7lyvcCtieTghkD5zQW2+o1R2K6LUz6LMz6LMzgtnnoe5pK4xcunRJPT09SkxM7DOemJiokydPDrjmvffe08svv6ympqYhH6esrEylpaX9xmtqahQbG2un5CFZP90X8D2Dqbq62nQJw1JbW2u6hOsGvXYGfXYGfXZGMPrc1dU1pHm2wohdHR0dWrx4sbZt26b4+PghrysuLpbb7fY/bm9vV0pKiubOnau4uLiA1ef1elVbW6s1R8Ll8YUFbN9g+2BdrukSbOntc05OjqKiokyXM6LRa2fQZ2fQZ2cEs8+9VzauxVYYiY+PV0REhNra2vqMt7W1KSkpqd/8jz/+WGfPnlVeXp5/zOf74ixEZGSkTp06pYkTJ/Zb53K55HK5+o1HRUUF5QXp8YXJ0xM6YSRU/1EG6+8P/dFrZ9BnZ9BnZwSjz0Pdz9YNrNHR0crIyFBdXZ1/zOfzqa6uTllZWf3m33XXXTp+/Liampr8f7797W9rzpw5ampqUkpKip3DAwCAEcj2ZRq32638/HxNnz5dM2bMUHl5uTo7O1VQUCBJWrJkicaNG6eysjLFxMRo8uTJfdaPHj1akvqNAwCA65PtMLJw4UJdvHhRa9euVWtrq6ZNm6b9+/f7b2ptaWlReDgf7AoAAIZmWDewFhUVqaioaMCf1dfXD7p2x44dwzkkAAAYoTiFAQAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjCCMAAMAowggAADCKMAIAAIwijAAAAKMIIwAAwCjCCAAAMIowAgAAjCKMAAAAowgjAADAKMIIAAAwijACAACMIowAAACjhhVGKioqlJqaqpiYGGVmZurw4cNXnbtt2zbNmjVLN998s26++WZlZ2cPOh8AAFxfbIeR3bt3y+12q6SkREePHlVaWppyc3N14cKFAefX19frwQcf1LvvvquGhgalpKRo7ty5Onfu3JcuHgAAhD7bYWTr1q1atmyZCgoKdM8996iyslKxsbHavn37gPNfffVVLV++XNOmTdNdd92ll156ST6fT3V1dV+6eAAAEPoi7Uzu7u5WY2OjiouL/WPh4eHKzs5WQ0PDkPbo6uqS1+vVmDFjrjrH4/HI4/H4H7e3t0uSvF6vvF6vnZIH1buXK9wK2J5OCGQPnNBbb6jVHYrotTPoszPoszOC2eeh7hlmWdaQfxP//ve/17hx43To0CFlZWX5xx9//HH9x3/8h/7zP//zmnssX75cBw4c0G9+8xvFxMQMOGfdunUqLS3tN75r1y7FxsYOtVwAAGBQV1eXFi1apM8++0xxcXFXnWfrzMiXtXHjRlVVVam+vv6qQUSSiouL5Xa7/Y/b29v995oM9mTs8nq9qq2t1Zoj4fL4wgK2b7B9sC7XdAm29PY5JydHUVFRpssZ0ei1M+izM+izM4LZ594rG9diK4zEx8crIiJCbW1tfcbb2tqUlJQ06NrNmzdr48aN+uUvf6mpU6cOOtflcsnlcvUbj4qKCsoL0uMLk6cndMJIqP6jDNbfH/qj186gz86gz84IRp+Hup+tG1ijo6OVkZHR5+bT3ptR//9lmz+3adMmrV+/Xvv379f06dPtHBIAAIxwti/TuN1u5efna/r06ZoxY4bKy8vV2dmpgoICSdKSJUs0btw4lZWVSZKeffZZrV27Vrt27VJqaqpaW1slSTfccINuuOGGAD4VAAAQimyHkYULF+rixYtau3atWltbNW3aNO3fv1+JiYmSpJaWFoWH/+mEy49//GN1d3frO9/5Tp99SkpKtG7dui9XPULK5HUHQupymCSd3TjfdAkAMOIN6wbWoqIiFRUVDfiz+vr6Po/Pnj07nEMAAIDrBN9NAwAAjHL0rb0IjNRV+0yXYIsrwtKmGaarAAB8VXFmBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRhBEAAGAUYQQAABhFGAEAAEYRRgAAgFGEEQAAYFSk6QIAIHXVPtMl2OKKsLRphukqgJGDMyMAAMAowggAADCKyzTACDR53QF5esJMlwEAQ8KZEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBRfOgZAAxTqH243NmN802XAAyIMAIA+Eoj9I18XKYBAABGDevMSEVFhZ577jm1trYqLS1Nzz//vGbMuPr3ae/Zs0dr1qzR2bNndccdd+jZZ5/VAw88MOyiAQD2pa7aZ7oEW1wRljZd/VcLRhDbZ0Z2794tt9utkpISHT16VGlpacrNzdWFCxcGnH/o0CE9+OCDevjhh3Xs2DEtWLBACxYs0AcffPCliwcAAKHP9pmRrVu3atmyZSooKJAkVVZWat++fdq+fbtWrVrVb/6PfvQjfetb39IPfvADSdL69etVW1urf//3f1dlZeWXLB8AgK8WzkDZZyuMdHd3q7GxUcXFxf6x8PBwZWdnq6GhYcA1DQ0NcrvdfcZyc3O1d+/eqx7H4/HI4/H4H3/22WeSpMuXL8vr9dopeVBer1ddXV2K9Iarxxc6N0eFmkifpa4uX0j2+fbHXjddgi2ucEtPpYdmr0NJKL+mQwl9dkZvn//nf/5HUVFRAd27o6NDkmRZ1uA12Nn00qVL6unpUWJiYp/xxMREnTx5csA1ra2tA85vbW296nHKyspUWlrab3z8+PF2ysVXyCLTBVxH6LUz6LMz6LMzgt3njo4O3XTTTVf9+Vfyrb3FxcV9zqb4fD5dvnxZt9xyi8LCApeO29vblZKSot/97neKi4sL2L7oiz47h147gz47gz47I5h9tixLHR0dGjt27KDzbIWR+Ph4RUREqK2trc94W1ubkpKSBlyTlJRka74kuVwuuVyuPmOjR4+2U6otcXFxvNAdQJ+dQ6+dQZ+dQZ+dEaw+D3ZGpJetd9NER0crIyNDdXV1/jGfz6e6ujplZWUNuCYrK6vPfEmqra296nwAAHB9sX2Zxu12Kz8/X9OnT9eMGTNUXl6uzs5O/7trlixZonHjxqmsrEyStGLFCt1///3asmWL5s+fr6qqKh05ckQvvvhiYJ8JAAAISbbDyMKFC3Xx4kWtXbtWra2tmjZtmvbv3++/SbWlpUXh4X864XLvvfdq165deuqpp/Tkk0/qjjvu0N69ezV58uTAPYthcrlcKikp6XdJCIFFn51Dr51Bn51Bn53xVehzmHWt99sAAAAEEd9NAwAAjCKMAAAAowgjAADAKMIIAAAwasSHkYqKCqWmpiomJkaZmZk6fPjwoPP37Nmju+66SzExMZoyZYqqq6sdqjS02enztm3bNGvWLN188826+eablZ2dfc2/F/yJ3dd0r6qqKoWFhWnBggXBLXCEsNvnTz/9VIWFhUpOTpbL5dKkSZP4/2MI7Pa5vLxcd955p0aNGqWUlBStXLlSf/zjHx2qNjQdPHhQeXl5Gjt2rMLCwgb9brhe9fX1+sY3viGXy6Xbb79dO3bsCG6R1ghWVVVlRUdHW9u3b7d+85vfWMuWLbNGjx5ttbW1DTj//ffftyIiIqxNmzZZH374ofXUU09ZUVFR1vHjxx2uPLTY7fOiRYusiooK69ixY9aJEyesf/qnf7Juuukm67//+78drjz02O11rzNnzljjxo2zZs2aZf3t3/6tM8WGMLt99ng81vTp060HHnjAeu+996wzZ85Y9fX1VlNTk8OVhxa7fX711Vctl8tlvfrqq9aZM2esAwcOWMnJydbKlSsdrjy0VFdXW6tXr7befPNNS5L11ltvDTq/ubnZio2Ntdxut/Xhhx9azz//vBUREWHt378/aDWO6DAyY8YMq7Cw0P+4p6fHGjt2rFVWVjbg/O9+97vW/Pnz+4xlZmZa//Iv/xLUOkOd3T7/uStXrlg33nij9ZOf/CRYJY4Yw+n1lStXrHvvvdd66aWXrPz8fMLIENjt849//GNrwoQJVnd3t1Mljgh2+1xYWGh985vf7DPmdrutmTNnBrXOkWQoYeTxxx+3vv71r/cZW7hwoZWbmxu0ukbsZZru7m41NjYqOzvbPxYeHq7s7Gw1NDQMuKahoaHPfEnKzc296nwMr89/rqurS16vV2PGjAlWmSPCcHv9wx/+UAkJCXr44YedKDPkDafPP/vZz5SVlaXCwkIlJiZq8uTJ2rBhg3p6epwqO+QMp8/33nuvGhsb/ZdympubVV1drQceeMCRmq8XJn4XfiW/tTcQLl26pJ6eHv8nw/ZKTEzUyZMnB1zT2to64PzW1tag1RnqhtPnP/fEE09o7Nix/V786Gs4vX7vvff08ssvq6mpyYEKR4bh9Lm5uVnvvPOOHnroIVVXV+v06dNavny5vF6vSkpKnCg75Aynz4sWLdKlS5d03333ybIsXblyRY8++qiefPJJJ0q+blztd2F7e7v+8Ic/aNSoUQE/5og9M4LQsHHjRlVVVemtt95STEyM6XJGlI6ODi1evFjbtm1TfHy86XJGNJ/Pp4SEBL344ovKyMjQwoULtXr1alVWVpoubUSpr6/Xhg0b9MILL+jo0aN68803tW/fPq1fv950afiSRuyZkfj4eEVERKitra3PeFtbm5KSkgZck5SUZGs+htfnXps3b9bGjRv1y1/+UlOnTg1mmSOC3V5//PHHOnv2rPLy8vxjPp9PkhQZGalTp05p4sSJwS06BA3nNZ2cnKyoqChFRET4x+6++261traqu7tb0dHRQa05FA2nz2vWrNHixYu1dOlSSdKUKVPU2dmpRx55RKtXr+7zvWgYvqv9LoyLiwvKWRFpBJ8ZiY6OVkZGhurq6vxjPp9PdXV1ysrKGnBNVlZWn/mSVFtbe9X5GF6fJWnTpk1av3699u/fr+nTpztRasiz2+u77rpLx48fV1NTk//Pt7/9bc2ZM0dNTU1KSUlxsvyQMZzX9MyZM3X69Gl/2JOkjz76SMnJyQSRqxhOn7u6uvoFjt4AaPE1awFj5Hdh0G6N/QqoqqqyXC6XtWPHDuvDDz+0HnnkEWv06NFWa2urZVmWtXjxYmvVqlX++e+//74VGRlpbd682Tpx4oRVUlLCW3uHwG6fN27caEVHR1tvvPGGdf78ef+fjo4OU08hZNjt9Z/j3TRDY7fPLS0t1o033mgVFRVZp06dsn7xi19YCQkJ1tNPP23qKYQEu30uKSmxbrzxRuu1116zmpubrZqaGmvixInWd7/7XVNPISR0dHRYx44ds44dO2ZJsrZu3WodO3bM+uSTTyzLsqxVq1ZZixcv9s/vfWvvD37wA+vEiRNWRUUFb+39sp5//nnrtttus6Kjo60ZM2ZYv/71r/0/u//++638/Pw+819//XVr0qRJVnR0tPX1r3/d2rdvn8MVhyY7ff7a175mSer3p6SkxPnCQ5Dd1/T/RxgZOrt9PnTokJWZmWm5XC5rwoQJ1jPPPGNduXLF4apDj50+e71ea926ddbEiROtmJgYKyUlxVq+fLn1v//7v84XHkLefffdAf/P7e1tfn6+df/99/dbM23aNCs6OtqaMGGC9corrwS1xjDL4twWAAAwZ8TeMwIAAEIDYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUYQRAABgFGEEAAAYRRgBAABGEUYAAIBR/wfWP+KWs4PFjgAAAABJRU5ErkJggg==", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "jigsaw2_train['toxic'].hist(bins=10);" + ] + }, + { + "cell_type": "code", + "execution_count": 215, + "id": "fa9e08bc-666a-4490-8ae9-eb8c3c8952cb", + "metadata": {}, + "outputs": [], + "source": [ + "jigsaw2_train['toxic_binary'] = jigsaw2_train['toxic'].gt(0.5).astype(int)" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "4a02f61b-e6e4-4cff-b40c-63b11d9c80c0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idcomment_textlangtoxic
75027502Sayın hocam sizden bir ricam daha olacak. Şu f...tr0
36073607> Occhio tu piuttosto... con il rollback hai ...it0
29022902È inutile che cercate di buttarla sul ridicolo...it1
\n", + "
" + ], + "text/plain": [ + " id comment_text lang toxic\n", + "7502 7502 Sayın hocam sizden bir ricam daha olacak. Şu f... tr 0\n", + "3607 3607 > Occhio tu piuttosto... con il rollback hai ... it 0\n", + "2902 2902 È inutile che cercate di buttarla sul ridicolo... it 1" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "jigsaw3_valid = pd.read_csv(f\"{DATA_DIR}/validation.csv\")\n", + "jigsaw3_valid.sample(3)" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "cbc7a0be-9aef-48e4-b954-bcafa9a64c03", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idcontentlang
68536853Valla hiç bir fikrim yok, bu akşam bir iki ker...tr
1630716307merhaba, Zeynel Limoncu maddesine yetersiz ve ...tr
4844848448Ненейтрально: Кабумба, Мбеки — министр Пропага...ru
\n", + "
" + ], + "text/plain": [ + " id content lang\n", + "6853 6853 Valla hiç bir fikrim yok, bu akşam bir iki ker... tr\n", + "16307 16307 merhaba, Zeynel Limoncu maddesine yetersiz ve ... tr\n", + "48448 48448 Ненейтрально: Кабумба, Мбеки — министр Пропага... ru" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "jigsaw3_test = pd.read_csv(f\"{DATA_DIR}/test.csv\")\n", + "jigsaw3_test.sample(3)" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "af357be3-7e56-4cb8-b94e-6c34672f46fa", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idtoxic
34608346080
36778367780
632963290
\n", + "
" + ], + "text/plain": [ + " id toxic\n", + "34608 34608 0\n", + "36778 36778 0\n", + "6329 6329 0" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "jigsaw3_test_labels = pd.read_csv(f\"{DATA_DIR}/test_labels.csv\")\n", + "jigsaw3_test_labels.sample(3)" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "d9d243b0-7385-4eb2-ab7d-b8c2ccf5c199", + "metadata": {}, + "outputs": [], + "source": [ + "assert (jigsaw3_test['id'] == jigsaw3_test_labels['id']).all()\n", + "jigsaw3_test['toxic'] = jigsaw3_test_labels['toxic']" + ] + }, + { + "cell_type": "markdown", + "id": "f2016942-1230-49c2-875b-5640475e1848", + "metadata": {}, + "source": [ + "## Merge the datasets into one" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "0ed7e70b-1aed-48c3-996d-bb3a13a3114a", + "metadata": {}, + "outputs": [], + "source": [ + "jigsaw1_train['dataset'] = 'jigsaw1'\n", + "jigsaw2_train['dataset'] = 'jigsaw2'\n", + "jigsaw3_valid['dataset'] = 'jigsaw3'\n", + "jigsaw3_test['dataset'] = 'jigsaw3'\n", + "\n", + "jigsaw1_train['split'] = 'train'\n", + "jigsaw2_train['split'] = 'train'\n", + "jigsaw3_valid['split'] = 'valid'\n", + "jigsaw3_test['split'] = 'test'\n", + "\n", + "jigsaw1_train['lang'] = 'en'\n", + "jigsaw2_train['lang'] = 'en'" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "3640b96f-4d23-4282-b35d-95c2543794c5", + "metadata": {}, + "outputs": [], + "source": [ + "mutox_raw['dataset'] = 'mutox'\n", + "mutox_raw" + ] + }, + { + "cell_type": "code", + "execution_count": 229, + "id": "2e7ae6c1-39e7-4224-97b0-23a1c7a9f2f5", + "metadata": {}, + "outputs": [], + "source": [ + "data_pooled = pd.concat([\n", + " jigsaw1_train.rename({'comment_text': 'text'}, axis=1), \n", + " jigsaw2_train[\n", + " # removing the labels with low agreement between annotators\n", + " jigsaw2_train['toxic'].lt(0.3) | jigsaw2_train['toxic'].gt(0.6)\n", + " ].rename({'comment_text': 'text', 'toxic_binary': 'toxic', 'toxic': 'toxic_fraction'}, axis=1), \n", + " jigsaw3_valid.rename({'comment_text': 'text'}, axis=1), \n", + " jigsaw3_test.rename({'content': 'text'}, axis=1),\n", + " mutox_cleaned.rename({'audio_file_transcript': 'text', 'partition': 'split'}, axis=1),\n", + " \n", + "])[['text', 'toxic', 'lang', 'split', 'dataset']].dropna()" + ] + }, + { + "cell_type": "code", + "execution_count": 230, + "id": "3d2e098a-a757-487e-a563-9444038d3290", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
texttoxiclangsplitdataset
124242Good idea and done. - ✉0entrainjigsaw1
216946\", 4 September 2011 (UTC) \\n :::I'm adding to ...0entrainjigsaw1
126160tell your girl friend of your User:Nancy stop ...0entrainjigsaw1
1575462The numbers are stunning. This part of the co...0entrainjigsaw2
1469311So you agree that if two people in the househo...0entrainjigsaw2
1537882They are fighting racism, especially state-spo...0entrainjigsaw2
39046Дело не в этом, а в том, что простая подборка...0rutestjigsaw3
5881745x45px|left Spam bağlantılar eklemeyi ...0trtestjigsaw3
22587Pardon pardon gerek kalmadı, zaten örneği verm...0trtestjigsaw3
7085and the destruction of his dynasty.0engtrainmutox
88252Üllatus-üllatus, mitte mõne massibrandi, vai ...0esttrainmutox
40817来自上帝和模西这就是许多失被称为大位的失片的原因有机墨室的观点0cmntrainmutox
\n", + "
" + ], + "text/plain": [ + " text toxic lang split \\\n", + "124242 Good idea and done. - ✉ 0 en train \n", + "216946 \", 4 September 2011 (UTC) \\n :::I'm adding to ... 0 en train \n", + "126160 tell your girl friend of your User:Nancy stop ... 0 en train \n", + "1575462 The numbers are stunning. This part of the co... 0 en train \n", + "1469311 So you agree that if two people in the househo... 0 en train \n", + "1537882 They are fighting racism, especially state-spo... 0 en train \n", + "39046 Дело не в этом, а в том, что простая подборка... 0 ru test \n", + "58817 45x45px|left Spam bağlantılar eklemeyi ... 0 tr test \n", + "22587 Pardon pardon gerek kalmadı, zaten örneği verm... 0 tr test \n", + "7085 and the destruction of his dynasty. 0 eng train \n", + "88252 Üllatus-üllatus, mitte mõne massibrandi, vai ... 0 est train \n", + "40817 来自上帝和模西这就是许多失被称为大位的失片的原因有机墨室的观点 0 cmn train \n", + "\n", + " dataset \n", + "124242 jigsaw1 \n", + "216946 jigsaw1 \n", + "126160 jigsaw1 \n", + "1575462 jigsaw2 \n", + "1469311 jigsaw2 \n", + "1537882 jigsaw2 \n", + "39046 jigsaw3 \n", + "58817 jigsaw3 \n", + "22587 jigsaw3 \n", + "7085 mutox \n", + "88252 mutox \n", + "40817 mutox " + ] + }, + "execution_count": 230, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_pooled.groupby(['dataset']).sample(3)" + ] + }, + { + "cell_type": "markdown", + "id": "c5ad74b0-93af-4559-9418-cb598a5d0d01", + "metadata": {}, + "source": [ + "In each of the datasets, we have about 5-20% of toxic labels. " + ] + }, + { + "cell_type": "code", + "execution_count": 231, + "id": "1a51bb81-17e8-4feb-b16f-566c78b9623e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countmean
datasetsplit
jigsaw1train2235490.095657
jigsaw2train16993100.045574
jigsaw3test638120.225820
valid80000.153750
mutoxdev220320.108932
devtest103080.135914
train651580.135532
\n", + "
" + ], + "text/plain": [ + " count mean\n", + "dataset split \n", + "jigsaw1 train 223549 0.095657\n", + "jigsaw2 train 1699310 0.045574\n", + "jigsaw3 test 63812 0.225820\n", + " valid 8000 0.153750\n", + "mutox dev 22032 0.108932\n", + " devtest 10308 0.135914\n", + " train 65158 0.135532" + ] + }, + "execution_count": 231, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_pooled.groupby(['dataset', 'split'])['toxic'].aggregate(['count', 'mean'])" + ] + }, + { + "cell_type": "markdown", + "id": "e173c6bc-d23e-48fb-882a-d546be1a8083", + "metadata": {}, + "source": [ + "## Map the language codes" + ] + }, + { + "cell_type": "markdown", + "id": "ab405158-e4a6-4b99-a5e6-82638ff60fc7", + "metadata": {}, + "source": [ + "SONAR model requires adding a special token indicating the language to the text being encoded.\n", + "\n", + "So we need to re-format the language codes to agree with those of the SONAR tokenizer." + ] + }, + { + "cell_type": "markdown", + "id": "21ca886c-a352-4666-9195-b50a8ba540ea", + "metadata": {}, + "source": [ + "Here is an example of how the tokenizer works:" + ] + }, + { + "cell_type": "code", + "execution_count": 176, + "id": "96c85b1d-7218-46f9-8420-ce42598be8b2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[256047, 94124, 15697, 2]\n", + "['eng_Latn', '▁Hello', '▁world', '']\n" + ] + } + ], + "source": [ + "tokenizer.src_lang = \"eng_Latn\" # setting the language that will be added to the text by default\n", + "tokens_ids = tokenizer('Hello world').input_ids\n", + "print(tokens_ids) # [256047, 94124, 15697, 2]\n", + "tokens = tokenizer.convert_ids_to_tokens(tokens_ids)\n", + "print(tokens) # ['eng_Latn', '▁Hello', '▁world', '']" + ] + }, + { + "cell_type": "code", + "execution_count": 112, + "id": "13ff7970-badf-46b1-8327-d03231cdc3c0", + "metadata": {}, + "outputs": [], + "source": [ + "lang_codes_map ={\n", + " 'arb': 'arb_Arab',\n", + " 'ben': 'ben_Beng',\n", + " 'bul': 'bul_Cyrl',\n", + " 'cat': 'cat_Latn',\n", + " 'ces': 'ces_Latn',\n", + " 'cmn': 'zho_Hans',\n", + " 'dan': 'dan_Latn',\n", + " 'deu': 'deu_Latn',\n", + " 'ell': 'ell_Grek',\n", + " 'en': 'eng_Latn',\n", + " 'eng': 'eng_Latn',\n", + " 'es': 'spa_Latn',\n", + " 'est': 'est_Latn',\n", + " 'fin': 'fin_Latn',\n", + " 'fr': 'fra_Latn',\n", + " 'fra': 'fra_Latn',\n", + " 'heb': 'heb_Hebr',\n", + " 'hin': 'hin_Deva',\n", + " 'hun': 'hun_Latn',\n", + " 'ind': 'ind_Latn',\n", + " 'it': 'ita_Latn',\n", + " 'ita': 'ita_Latn',\n", + " 'nld': 'nld_Latn',\n", + " 'pes': 'pes_Arab',\n", + " 'pol': 'pol_Latn',\n", + " 'por': 'por_Latn',\n", + " 'pt': 'por_Latn',\n", + " 'ru': 'rus_Cyrl',\n", + " 'rus': 'rus_Cyrl',\n", + " 'slk': 'slk_Latn',\n", + " 'spa': 'spa_Latn',\n", + " 'swh': 'swh_Latn',\n", + " 'tgl': 'tgl_Latn',\n", + " 'tr': 'tur_Latn',\n", + " 'tur': 'tur_Latn',\n", + " 'urd': 'urd_Arab',\n", + " 'vie': 'vie_Latn'\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "f9fc0070-57b4-47b8-bcf7-5aa7ce73d80b", + "metadata": {}, + "source": [ + "Check that we have covered all the languages in the dataset, and the codes are indeed part of the SONAR tokenizer:" + ] + }, + { + "cell_type": "code", + "execution_count": 239, + "id": "28dd3775-6519-462d-89a4-2c3a6d6f314b", + "metadata": {}, + "outputs": [], + "source": [ + "data_pooled['lang_code'] = data_pooled['lang'].apply(lang_codes_map.get) \n", + "assert data_pooled['lang_code'].isnull().sum() == 0\n", + "\n", + "for lang_code in lang_codes_map.values():\n", + " assert lang_code in tokenizer.special_tokens_map['additional_special_tokens']" + ] + }, + { + "cell_type": "markdown", + "id": "bd022830-ba03-4cd2-8f2d-66f2d945b429", + "metadata": {}, + "source": [ + "# 2. Setting model architecture" + ] + }, + { + "cell_type": "markdown", + "id": "50f81140-8d36-4245-a9b3-a7ba10ff8eac", + "metadata": {}, + "source": [ + "We start by downloading the encoder from HuggingFace" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "c122a013-2338-48b6-acbb-527f4346330f", + "metadata": {}, + "outputs": [], + "source": [ + "encoder = M2M100Encoder.from_pretrained(BASE_MODEL_NAME)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "30b46023-76eb-4b29-b601-0a9434ef8ec0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "M2M100Encoder(\n", + " (embed_tokens): M2M100ScaledWordEmbedding(256206, 1024, padding_idx=1)\n", + " (embed_positions): M2M100SinusoidalPositionalEmbedding()\n", + " (layers): ModuleList(\n", + " (0-23): 24 x M2M100EncoderLayer(\n", + " (self_attn): M2M100SdpaAttention(\n", + " (k_proj): Linear(in_features=1024, out_features=1024, bias=True)\n", + " (v_proj): Linear(in_features=1024, out_features=1024, bias=True)\n", + " (q_proj): Linear(in_features=1024, out_features=1024, bias=True)\n", + " (out_proj): Linear(in_features=1024, out_features=1024, bias=True)\n", + " )\n", + " (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n", + " (activation_fn): ReLU()\n", + " (fc1): Linear(in_features=1024, out_features=8192, bias=True)\n", + " (fc2): Linear(in_features=8192, out_features=1024, bias=True)\n", + " (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n", + " )\n", + " )\n", + " (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n", + ")" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "encoder" + ] + }, + { + "cell_type": "markdown", + "id": "5556da4e-3694-408d-9fde-df0fbd7a9ac9", + "metadata": {}, + "source": [ + "In the spirit of the Hugginface typical models, we will add a classification head on top (just like, for example, [BertForSequenceClassification](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py#L1472))." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "e87b52d2-8c68-406b-bc4e-2c031a6c076d", + "metadata": {}, + "outputs": [], + "source": [ + "class SonarForSequenceClassification(M2M100PreTrainedModel):\n", + " def __init__(self, config):\n", + " super().__init__(config)\n", + " self.num_labels = config.num_labels\n", + " self.config = config\n", + " \n", + " self.sonar_encoder = M2M100Encoder(config)\n", + " \n", + " # the classifier will consist of a wide linear layer, non-linearity, dropout, and a final linear layer\n", + " self.fc1 = nn.Linear(config.d_model, config.encoder_ffn_dim)\n", + " self.activation = nn.Tanh()\n", + " self.dropout = nn.Dropout(config.dropout)\n", + " self.classifier = nn.Linear(config.encoder_ffn_dim, config.num_labels)\n", + "\n", + " # Initialize weights and apply final processing\n", + " self.post_init()\n", + " \n", + " def forward(\n", + " self,\n", + " input_ids: Optional[torch.Tensor] = None,\n", + " attention_mask: Optional[torch.Tensor] = None,\n", + " head_mask: Optional[torch.Tensor] = None,\n", + " inputs_embeds: Optional[torch.Tensor] = None,\n", + " labels: Optional[torch.Tensor] = None,\n", + " output_attentions: Optional[bool] = None,\n", + " output_hidden_states: Optional[bool] = None,\n", + " return_dict: Optional[bool] = None,\n", + " ) -> Union[tuple[torch.Tensor], SequenceClassifierOutput]:\n", + " r\"\"\"\n", + " labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):\n", + " Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,\n", + " config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If\n", + " `config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n", + " \"\"\"\n", + " return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n", + "\n", + " outputs = self.sonar_encoder(\n", + " input_ids,\n", + " attention_mask=attention_mask,\n", + " head_mask=head_mask,\n", + " inputs_embeds=inputs_embeds,\n", + " output_attentions=output_attentions,\n", + " output_hidden_states=True,\n", + " return_dict=return_dict,\n", + " )\n", + " token_embs = outputs.last_hidden_state\n", + " sentence_embs = (token_embs * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.unsqueeze(-1).sum(1)\n", + " \n", + " hidden = self.fc1(sentence_embs)\n", + " hidden = self.dropout(hidden)\n", + " logits = self.classifier(hidden)\n", + "\n", + " loss = None\n", + " if labels is not None:\n", + " if self.config.problem_type is None:\n", + " if self.num_labels == 1:\n", + " self.config.problem_type = \"regression\"\n", + " elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):\n", + " self.config.problem_type = \"single_label_classification\"\n", + " else:\n", + " self.config.problem_type = \"multi_label_classification\"\n", + "\n", + " if self.config.problem_type == \"regression\":\n", + " loss_fct = MSELoss()\n", + " if self.num_labels == 1:\n", + " loss = loss_fct(logits.squeeze(), labels.squeeze())\n", + " else:\n", + " loss = loss_fct(logits, labels)\n", + " elif self.config.problem_type == \"single_label_classification\":\n", + " loss_fct = CrossEntropyLoss()\n", + " loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n", + " elif self.config.problem_type == \"multi_label_classification\":\n", + " loss_fct = BCEWithLogitsLoss()\n", + " loss = loss_fct(logits, labels)\n", + " if not return_dict:\n", + " output = (logits,) + outputs[2:]\n", + " return ((loss,) + output) if loss is not None else output\n", + "\n", + " return SequenceClassifierOutput(\n", + " loss=loss,\n", + " logits=logits,\n", + " hidden_states=sentence_embs,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "f78e63d9-3f81-49c7-9668-38023a7c1011", + "metadata": {}, + "source": [ + "Create the model! And initialize its encoder by the actual SONAR encoder parameters." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "62881559-57bc-4a71-9fb2-1c10b9009259", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "encoder.config.num_labels = 2\n", + "encoder.config.problem_type = \"single_label_classification\"\n", + "\n", + "model = SonarForSequenceClassification(encoder.config)\n", + "\n", + "model.sonar_encoder.load_state_dict(encoder.state_dict())" + ] + }, + { + "cell_type": "markdown", + "id": "bd75a385-6be7-4175-90d6-3b4fd0c5f9da", + "metadata": {}, + "source": [ + "Here is how our model is supposed to classify:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "7420543f-d913-4ef7-9701-1d4983c06940", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "tensor([[0.5013, 0.4987],\n", + " [0.5074, 0.4926],\n", + " [0.4992, 0.5008]])\n" + ] + } + ], + "source": [ + "sample_texts = ['Fuck you bitch!', 'Do you like dogs?', 'One more neutral sentence.']\n", + "tokenizer.src_lang = \"eng_Latn\"\n", + "\n", + "with torch.inference_mode():\n", + " out = model(**tokenizer(sample_texts, padding=True, return_tensors='pt').to(model.device))\n", + " probabilities = torch.softmax(out.logits, dim=-1)\n", + "print(probabilities)" + ] + }, + { + "cell_type": "markdown", + "id": "d68937af-76cd-4493-85c7-9ba407288836", + "metadata": {}, + "source": [ + "The outputs are a list of two-dimensional arrays, each of them containing probabilities of the labels. \n", + "\n", + "We have two-label classification as a problem, with the labels 0 (non-toxic) or 1 (toxic). Thus, the last column of the output (number 1, if we count from 0) contains the predicted probability of the texts being toxic. Which is by default close to 0.5 for all texts: before the training, the model doesn't know what to predict. " + ] + }, + { + "cell_type": "markdown", + "id": "00193864-7dc8-4ad3-bf70-7fd4b80485dd", + "metadata": {}, + "source": [ + "# 3. Preparing the datasets for training" + ] + }, + { + "cell_type": "markdown", + "id": "b1830348-b36b-4c46-9105-7fb76095b710", + "metadata": {}, + "source": [ + "We want to use the Hugginface trainer, which expects the data to be pre-tokenized in advance. So we will do just that.\n", + "\n", + "We will use all `train` and `dev` data for training, and `test` data for testing. We don't really need a validation dataset here, because we plan to train the models just once, without much hyperparameter tuning, so overfitting to the `test` split is not a problem." + ] + }, + { + "cell_type": "code", + "execution_count": 251, + "id": "a6ae4b71-758b-46de-8ff9-6a3b80d6534f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DatasetDict({\n", + " train: Dataset({\n", + " features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n", + " num_rows: 2018049\n", + " })\n", + " test: Dataset({\n", + " features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n", + " num_rows: 74120\n", + " })\n", + "})" + ] + }, + "execution_count": 251, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_mix = datasets.DatasetDict({\n", + " 'train': datasets.Dataset.from_pandas(data_pooled[~data_pooled['split'].isin({'test', 'devtest'})], preserve_index=False),\n", + " 'test': datasets.Dataset.from_pandas(data_pooled[data_pooled['split'].isin({'test', 'devtest'})], preserve_index=False),\n", + "})\n", + "ds_mix" + ] + }, + { + "cell_type": "markdown", + "id": "8942ae7e-06cd-41bf-ae5b-48e45b6ab0e7", + "metadata": {}, + "source": [ + "The test split looks too big; we downsample it, taking at most 100 labels from each dataset-language-label combination. " + ] + }, + { + "cell_type": "code", + "execution_count": 258, + "id": "38914623-f1db-4154-8c20-8c3d38f7b1f4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DatasetDict({\n", + " train: Dataset({\n", + " features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n", + " num_rows: 2018049\n", + " })\n", + " test: Dataset({\n", + " features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n", + " num_rows: 74120\n", + " })\n", + " test_small: Dataset({\n", + " features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n", + " num_rows: 5101\n", + " })\n", + "})" + ] + }, + "execution_count": 258, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_small = data_pooled[\n", + " data_pooled['split'].isin({'test', 'devtest'})\n", + "].groupby(['dataset', 'lang_code', 'toxic']).apply(\n", + " lambda x: x.sample(min(len(x), 100), random_state=1)\n", + ")\n", + "ds_mix['test_small'] = datasets.Dataset.from_pandas(test_small, preserve_index=False)\n", + "ds_mix" + ] + }, + { + "cell_type": "markdown", + "id": "98c91eb2-77a5-45aa-83f2-1735706014da", + "metadata": {}, + "source": [ + "Now we can define the tokenization function (language-dependent!) and pre-tokenize the datasets.\n", + "\n", + "In addition, we will rename the target to \"labels\"" + ] + }, + { + "cell_type": "code", + "execution_count": 247, + "id": "d4ad18ba-d59f-460b-854f-92e8c7ae96c2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'input_ids': [[256047, 2867, 248203, 2], [256057, 163119, 248203, 2], [256147, 12700, 9272, 248203, 2]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1]]}\n" + ] + } + ], + "source": [ + "def tokenize_function(examples, max_length=512):\n", + " \"\"\" Tokenize the examples and prepend the right language tag.\"\"\"\n", + " result = tokenizer(\n", + " text=list(examples[\"text\"]),\n", + " truncation=True,\n", + " max_length=max_length,\n", + " )\n", + " lang_ids = tokenizer.convert_tokens_to_ids(list(examples[\"lang_code\"]))\n", + " for input_ids, lang_id in zip(result[\"input_ids\"], lang_ids):\n", + " input_ids[0] = lang_id\n", + "\n", + " if 'toxic' in examples:\n", + " result['labels'] = examples['toxic']\n", + " return result\n", + "\n", + "print(tokenize_function({\"text\": [\"Hi!\", \"Salut!\", \"Привет!\"], \"lang_code\": [\"eng_Latn\", \"fra_Latn\", \"rus_Cyrl\"]}))" + ] + }, + { + "cell_type": "code", + "execution_count": 259, + "id": "4cbcc19c-4da2-4629-a07c-3e35f68175f0", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "9187b87d21ee4da18552ee5a9c07806a", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Map: 0%| | 0/2018049 [00:00\n", + " \n", + " \n", + " [3942/3942 2:19:11, Epoch 1/1]\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
StepTraining Loss
5000.354900
10000.146400
15000.113300
20000.110100
25000.109200
30000.108100
35000.109500

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/transformers/configuration_utils.py:397: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.\n", + "Non-default generation parameters: {'max_length': 200}\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/plain": [ + "TrainOutput(global_step=3942, training_loss=0.14545092686731037, metrics={'train_runtime': 8355.0605, 'train_samples_per_second': 241.536, 'train_steps_per_second': 0.472, 'total_flos': 3.004701338322985e+18, 'train_loss': 0.14545092686731037, 'epoch': 1.0})" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "trainer.train()" + ] + }, + { + "cell_type": "markdown", + "id": "a9e4ed6b-9315-420e-9c83-77aace8e7edc", + "metadata": {}, + "source": [ + "We can see that by the end of the epoch, the loss stagnates at the level about 0.1. Apparently, it cannot go much lower than that without unfreezing the model parameters. " + ] + }, + { + "cell_type": "markdown", + "id": "42540d92-2fcc-44a7-b0f3-dc99d6b566f6", + "metadata": {}, + "source": [ + "## Evaluating it" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "ef36fd6a-9e90-4c21-bb3e-1e1da71a61ae", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "device(type='cuda', index=0)" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.eval()" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "780ff8f6-691c-412b-9075-865ab996a5ab", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/html": [], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "preds1_small = trainer.predict(ds_tokenized['test_small'])" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "092a6e2a-285c-4357-8c77-a802f9546dba", + "metadata": {}, + "outputs": [], + "source": [ + "with torch.inference_mode():\n", + " p1_bad = torch.softmax(torch.tensor(preds1_small.predictions[0]), dim=-1)[:, 1].cpu().numpy()" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "02e74f42-9e9f-4b65-8f67-4eb914abf490", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.8337834961877266\n" + ] + } + ], + "source": [ + "print(roc_auc_score(preds1_small.label_ids, p1_bad))" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "08805de6-7011-4d12-bdb6-003b0e81f0d8", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "074f8f829b7f4fa5a08c0b9c7db509c4", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/5101 [00:00\n", + " \n", + " \n", + " [15768/15768 12:13:52, Epoch 2/2]\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
StepTraining Loss
5000.095400
10000.083500
15000.075400
20000.069800
25000.065700
30000.064500
35000.062000
40000.059500
45000.060500
50000.059000
55000.058600
60000.058700
65000.057000
70000.059200
75000.056600
80000.052500
85000.047500
90000.047600
95000.047300
100000.047200
105000.047300
110000.047100
115000.045700
120000.046100
125000.045700
130000.045600
135000.045600
140000.044500
145000.045800
150000.044400
155000.044600

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "IOPub message rate exceeded.\n", + "The Jupyter server will temporarily stop sending output\n", + "to the client in order to avoid crashing it.\n", + "To change this limit, set the config variable\n", + "`--ServerApp.iopub_msg_rate_limit`.\n", + "\n", + "Current values:\n", + "ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n", + "ServerApp.rate_limit_window=3.0 (secs)\n", + "\n", + "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/transformers/configuration_utils.py:397: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.\n", + "Non-default generation parameters: {'max_length': 200}\n", + " warnings.warn(\n", + "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", + " warnings.warn(\n", + "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/transformers/configuration_utils.py:397: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.\n", + "Non-default generation parameters: {'max_length': 200}\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/plain": [ + "TrainOutput(global_step=15768, training_loss=0.05560727533140139, metrics={'train_runtime': 44037.4041, 'train_samples_per_second': 91.652, 'train_steps_per_second': 0.358, 'total_flos': 5.37354496011832e+18, 'train_loss': 0.05560727533140139, 'epoch': 2.0})" + ] + }, + "execution_count": 78, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "trainer.train()" + ] + }, + { + "cell_type": "markdown", + "id": "6a2081ee-4ad7-4bb4-b506-3c9224392cda", + "metadata": {}, + "source": [ + "## Evaluating" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "b55d26e8-a05b-4368-9cf1-890c95b7f8fd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 79, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.eval()" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "aff27f8b-20c2-4467-9910-6f857f381f75", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/html": [], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "preds2_small = trainer.predict(ds_tokenized['test_small'])" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "43889710-ff29-42b8-918d-7f377fdca926", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.8961500666222518\n" + ] + } + ], + "source": [ + "with torch.inference_mode():\n", + " p2_bad = torch.softmax(torch.tensor(preds2_small.predictions[0]), dim=-1)[:, 1].cpu().numpy()\n", + "print(roc_auc_score(preds2_small.label_ids, p2_bad))" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "9a98f19f-d5d4-46a5-9fab-0570ef4b2dc5", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "c1dc507386f041e4ad2a8638569e94df", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/5101 [00:00 Date: Mon, 28 Jul 2025 16:40:05 +0000 Subject: [PATCH 2/2] a note on dependencies --- examples/finetune_sonar_as_toxicity_classifier.ipynb | 2 ++ 1 file changed, 2 insertions(+) diff --git a/examples/finetune_sonar_as_toxicity_classifier.ipynb b/examples/finetune_sonar_as_toxicity_classifier.ipynb index b2bc852..b7097b4 100644 --- a/examples/finetune_sonar_as_toxicity_classifier.ipynb +++ b/examples/finetune_sonar_as_toxicity_classifier.ipynb @@ -24,6 +24,8 @@ "\n", "Note: this notebook trains the models using 8 A100 GPUs. If you have other hardware configuration, consider adjusting the training parameters accordingly (e.g. reducing the batch size if it does not fit into GPU memory, or even downsampling the dataset). Please note that some kind of hardware accelerator is indispensable for running this training within reasonable time. If you do not have any, consider subscribing e.g. to Google Colab. \n", "\n", + "The notebook does not depend on the [SONAR](https://github.com/facebookresearch/SONAR) codebase; instead, it depends only on HF [transformers](https://github.com/huggingface/transformers), HF [datasets](https://github.com/huggingface/datasets), and a bit of custom code.\n", + "\n", "**WARNING**: this notebook might contain examples of offensive or otherwise distubing texts in various languages. \n", "\n" ]