"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "encoder.config.num_labels = 2\n",
+ "encoder.config.problem_type = \"single_label_classification\"\n",
+ "\n",
+ "model = SonarForSequenceClassification(encoder.config)\n",
+ "\n",
+ "model.sonar_encoder.load_state_dict(encoder.state_dict())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bd75a385-6be7-4175-90d6-3b4fd0c5f9da",
+ "metadata": {},
+ "source": [
+ "Here is how our model is supposed to classify:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "7420543f-d913-4ef7-9701-1d4983c06940",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "tensor([[0.5013, 0.4987],\n",
+ " [0.5074, 0.4926],\n",
+ " [0.4992, 0.5008]])\n"
+ ]
+ }
+ ],
+ "source": [
+ "sample_texts = ['Fuck you bitch!', 'Do you like dogs?', 'One more neutral sentence.']\n",
+ "tokenizer.src_lang = \"eng_Latn\"\n",
+ "\n",
+ "with torch.inference_mode():\n",
+ " out = model(**tokenizer(sample_texts, padding=True, return_tensors='pt').to(model.device))\n",
+ " probabilities = torch.softmax(out.logits, dim=-1)\n",
+ "print(probabilities)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d68937af-76cd-4493-85c7-9ba407288836",
+ "metadata": {},
+ "source": [
+ "The outputs are a list of two-dimensional arrays, each of them containing probabilities of the labels. \n",
+ "\n",
+ "We have two-label classification as a problem, with the labels 0 (non-toxic) or 1 (toxic). Thus, the last column of the output (number 1, if we count from 0) contains the predicted probability of the texts being toxic. Which is by default close to 0.5 for all texts: before the training, the model doesn't know what to predict. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "00193864-7dc8-4ad3-bf70-7fd4b80485dd",
+ "metadata": {},
+ "source": [
+ "# 3. Preparing the datasets for training"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b1830348-b36b-4c46-9105-7fb76095b710",
+ "metadata": {},
+ "source": [
+ "We want to use the Hugginface trainer, which expects the data to be pre-tokenized in advance. So we will do just that.\n",
+ "\n",
+ "We will use all `train` and `dev` data for training, and `test` data for testing. We don't really need a validation dataset here, because we plan to train the models just once, without much hyperparameter tuning, so overfitting to the `test` split is not a problem."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 251,
+ "id": "a6ae4b71-758b-46de-8ff9-6a3b80d6534f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "DatasetDict({\n",
+ " train: Dataset({\n",
+ " features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n",
+ " num_rows: 2018049\n",
+ " })\n",
+ " test: Dataset({\n",
+ " features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n",
+ " num_rows: 74120\n",
+ " })\n",
+ "})"
+ ]
+ },
+ "execution_count": 251,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ds_mix = datasets.DatasetDict({\n",
+ " 'train': datasets.Dataset.from_pandas(data_pooled[~data_pooled['split'].isin({'test', 'devtest'})], preserve_index=False),\n",
+ " 'test': datasets.Dataset.from_pandas(data_pooled[data_pooled['split'].isin({'test', 'devtest'})], preserve_index=False),\n",
+ "})\n",
+ "ds_mix"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8942ae7e-06cd-41bf-ae5b-48e45b6ab0e7",
+ "metadata": {},
+ "source": [
+ "The test split looks too big; we downsample it, taking at most 100 labels from each dataset-language-label combination. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 258,
+ "id": "38914623-f1db-4154-8c20-8c3d38f7b1f4",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "DatasetDict({\n",
+ " train: Dataset({\n",
+ " features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n",
+ " num_rows: 2018049\n",
+ " })\n",
+ " test: Dataset({\n",
+ " features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n",
+ " num_rows: 74120\n",
+ " })\n",
+ " test_small: Dataset({\n",
+ " features: ['text', 'toxic', 'lang', 'split', 'dataset', 'lang_code'],\n",
+ " num_rows: 5101\n",
+ " })\n",
+ "})"
+ ]
+ },
+ "execution_count": 258,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test_small = data_pooled[\n",
+ " data_pooled['split'].isin({'test', 'devtest'})\n",
+ "].groupby(['dataset', 'lang_code', 'toxic']).apply(\n",
+ " lambda x: x.sample(min(len(x), 100), random_state=1)\n",
+ ")\n",
+ "ds_mix['test_small'] = datasets.Dataset.from_pandas(test_small, preserve_index=False)\n",
+ "ds_mix"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "98c91eb2-77a5-45aa-83f2-1735706014da",
+ "metadata": {},
+ "source": [
+ "Now we can define the tokenization function (language-dependent!) and pre-tokenize the datasets.\n",
+ "\n",
+ "In addition, we will rename the target to \"labels\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 247,
+ "id": "d4ad18ba-d59f-460b-854f-92e8c7ae96c2",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'input_ids': [[256047, 2867, 248203, 2], [256057, 163119, 248203, 2], [256147, 12700, 9272, 248203, 2]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1]]}\n"
+ ]
+ }
+ ],
+ "source": [
+ "def tokenize_function(examples, max_length=512):\n",
+ " \"\"\" Tokenize the examples and prepend the right language tag.\"\"\"\n",
+ " result = tokenizer(\n",
+ " text=list(examples[\"text\"]),\n",
+ " truncation=True,\n",
+ " max_length=max_length,\n",
+ " )\n",
+ " lang_ids = tokenizer.convert_tokens_to_ids(list(examples[\"lang_code\"]))\n",
+ " for input_ids, lang_id in zip(result[\"input_ids\"], lang_ids):\n",
+ " input_ids[0] = lang_id\n",
+ "\n",
+ " if 'toxic' in examples:\n",
+ " result['labels'] = examples['toxic']\n",
+ " return result\n",
+ "\n",
+ "print(tokenize_function({\"text\": [\"Hi!\", \"Salut!\", \"Привет!\"], \"lang_code\": [\"eng_Latn\", \"fra_Latn\", \"rus_Cyrl\"]}))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 259,
+ "id": "4cbcc19c-4da2-4629-a07c-3e35f68175f0",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "9187b87d21ee4da18552ee5a9c07806a",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Map: 0%| | 0/2018049 [00:00, ? examples/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "e6963c6313594fd983429d4b0b97d472",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Map: 0%| | 0/74120 [00:00, ? examples/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "21c7f1396fdb442fab173663d4839c22",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Map: 0%| | 0/5101 [00:00, ? examples/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "ds_tokenized = ds_mix.map(tokenize_function, batched=True, batch_size=10_000, remove_columns=ds_mix['train'].column_names)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 263,
+ "id": "14cdb759-9972-47b7-a84e-dbdb71104e73",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "f5ee2e1bd9a447f6a099c76e62fa6764",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Saving the dataset (0/2 shards): 0%| | 0/2018049 [00:00, ? examples/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "ea70ff44bc4341088d83593abd632e01",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Saving the dataset (0/1 shards): 0%| | 0/74120 [00:00, ? examples/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "3c51e5323b6f4ccf938a705fbd2e7aa3",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Saving the dataset (0/1 shards): 0%| | 0/5101 [00:00, ? examples/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# save the dataset for reproducibility in the future\n",
+ "ds_tokenized.save_to_disk(f\"{DATA_DIR}/tokenized_dataset\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6af1f82a-1690-46e3-81e9-423f6a45db2b",
+ "metadata": {},
+ "source": [
+ "# 4. Training a classifier with the frozen encoder"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "5966c784-3c93-44bf-913e-e40aa0499a2a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ds_tokenized = datasets.load_from_disk(f\"{DATA_DIR}/tokenized_dataset\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "62387d45-fa69-4515-ad4e-69bf23a1b52f",
+ "metadata": {},
+ "source": [
+ "To freeze the encoder, we will disable the gradient computation for all its parameters:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "9912078c-c633-49e1-b72f-295be01cde3b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "for param in model.sonar_encoder.parameters():\n",
+ " param.requires_grad = False"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "622f90f8-66e6-4a60-8290-8611962c56a2",
+ "metadata": {},
+ "source": [
+ "Actually, it would be much faster to tokenize all the data in advance, and then to train only the head, feeding the embeddings directly to it. \n",
+ "\n",
+ "But I tried to make the code more generic, so I am treating a SONAR-based classifier as any other HF classifier. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "a545b686-5e12-4086-8c38-322399dea483",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cac3d483-e4ad-4ba7-86f3-729abaf52cf0",
+ "metadata": {},
+ "source": [
+ "I will train the model for one epoch over the dataset (this should be sufficient to train the classifier head)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "5f842caf-41f0-4740-87ff-e6af1a6111d4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "training_args = TrainingArguments(\n",
+ " output_dir=MODEL1_DIR,\n",
+ " num_train_epochs=1,\n",
+ " logging_steps=500,\n",
+ " save_steps=10_000,\n",
+ " save_total_limit=50,\n",
+ " per_device_train_batch_size=64, # for full-model training, this should be reduced to 16 or something like that\n",
+ " warmup_steps=1_000, # Adjust the number of the warmup steps to take about 10%-20% of the total number of steps per epoch.\n",
+ " weight_decay=1e-3,\n",
+ " optim=\"adamw_torch\", # adafactor, if even with smaller batches you are out of memory \n",
+ " gradient_accumulation_steps=1,\n",
+ " run_name='sonar-tox-trained-v1',\n",
+ " learning_rate=1e-4,\n",
+ " report_to=\"wandb\", # turn this off if you don't have wandb\n",
+ " # max_steps=100_000,\n",
+ " bf16=True, # turn this off if your GPU doesn't support it\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "8df2b28d-24fe-4599-84a0-2a25a06972a5",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "steps per epoch 3941.501953125\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Here is how many steps per epoch would be made with my setting (8 gpus):\n",
+ "print(\"steps per epoch\", len(ds_tokenized['train']) / training_args.per_device_train_batch_size / 8)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "id": "f79cdbf9-3b02-4da9-8b0c-6e24e8aa86e7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "trainer = Trainer(\n",
+ " model=model,\n",
+ " args=training_args,\n",
+ " train_dataset=ds_tokenized[\"train\"],\n",
+ " data_collator=data_collator,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "6eae929a-4a3c-4ce5-a02e-812f8731cc77",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " \n",
+ "
\n",
+ " [3942/3942 2:19:11, Epoch 1/1]\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | Step | \n",
+ " Training Loss | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 500 | \n",
+ " 0.354900 | \n",
+ "
\n",
+ " \n",
+ " | 1000 | \n",
+ " 0.146400 | \n",
+ "
\n",
+ " \n",
+ " | 1500 | \n",
+ " 0.113300 | \n",
+ "
\n",
+ " \n",
+ " | 2000 | \n",
+ " 0.110100 | \n",
+ "
\n",
+ " \n",
+ " | 2500 | \n",
+ " 0.109200 | \n",
+ "
\n",
+ " \n",
+ " | 3000 | \n",
+ " 0.108100 | \n",
+ "
\n",
+ " \n",
+ " | 3500 | \n",
+ " 0.109500 | \n",
+ "
\n",
+ " \n",
+ "
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/transformers/configuration_utils.py:397: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.\n",
+ "Non-default generation parameters: {'max_length': 200}\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "TrainOutput(global_step=3942, training_loss=0.14545092686731037, metrics={'train_runtime': 8355.0605, 'train_samples_per_second': 241.536, 'train_steps_per_second': 0.472, 'total_flos': 3.004701338322985e+18, 'train_loss': 0.14545092686731037, 'epoch': 1.0})"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "trainer.train()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a9e4ed6b-9315-420e-9c83-77aace8e7edc",
+ "metadata": {},
+ "source": [
+ "We can see that by the end of the epoch, the loss stagnates at the level about 0.1. Apparently, it cannot go much lower than that without unfreezing the model parameters. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "42540d92-2fcc-44a7-b0f3-dc99d6b566f6",
+ "metadata": {},
+ "source": [
+ "## Evaluating it"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "id": "ef36fd6a-9e90-4c21-bb3e-1e1da71a61ae",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "device(type='cuda', index=0)"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model.eval()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "id": "780ff8f6-691c-412b-9075-865ab996a5ab",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "preds1_small = trainer.predict(ds_tokenized['test_small'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "id": "092a6e2a-285c-4357-8c77-a802f9546dba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "with torch.inference_mode():\n",
+ " p1_bad = torch.softmax(torch.tensor(preds1_small.predictions[0]), dim=-1)[:, 1].cpu().numpy()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "id": "02e74f42-9e9f-4b65-8f67-4eb914abf490",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.8337834961877266\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(roc_auc_score(preds1_small.label_ids, p1_bad))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "id": "08805de6-7011-4d12-bdb6-003b0e81f0d8",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "074f8f829b7f4fa5a08c0b9c7db509c4",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ " 0%| | 0/5101 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "lang = [tokenizer.convert_ids_to_tokens(x['input_ids'][0]) for x in tqdm(ds_tokenized['test_small'])]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "id": "7c426de1-17f2-426b-b20d-ced21b1bba85",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "e89e7531981448ed8b70e91fe680389f",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ " 0%| | 0/5101 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "test_pred_df = pd.DataFrame({\n",
+ " 'label': preds1_small.label_ids,\n",
+ " 'pred': p1_bad,\n",
+ " 'lang': [tokenizer.convert_ids_to_tokens(x['input_ids'][0]) for x in tqdm(ds_tokenized['test_small'])],\n",
+ "})"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "id": "1d0a7727-805f-4fcb-a018-e79cc437e057",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "lang\n",
+ "arb_Arab 0.887143\n",
+ "ben_Beng 0.820000\n",
+ "bul_Cyrl 0.820000\n",
+ "cat_Latn 0.911429\n",
+ "ces_Latn 0.880714\n",
+ "dan_Latn 0.858889\n",
+ "deu_Latn 0.806167\n",
+ "ell_Grek 0.842500\n",
+ "eng_Latn 0.676300\n",
+ "est_Latn 0.916552\n",
+ "fin_Latn 0.931034\n",
+ "fra_Latn 0.839628\n",
+ "heb_Hebr 0.858000\n",
+ "hin_Deva 0.780357\n",
+ "hun_Latn 0.884333\n",
+ "ind_Latn 0.740000\n",
+ "ita_Latn 0.801917\n",
+ "nld_Latn 0.703793\n",
+ "pes_Arab 0.911429\n",
+ "pol_Latn 0.874545\n",
+ "por_Latn 0.866679\n",
+ "rus_Cyrl 0.820039\n",
+ "slk_Latn 0.881333\n",
+ "spa_Latn 0.816425\n",
+ "swh_Latn 0.625667\n",
+ "tgl_Latn 0.688667\n",
+ "tur_Latn 0.885000\n",
+ "urd_Arab 0.845435\n",
+ "vie_Latn 0.838065\n",
+ "zho_Hans 0.732609\n",
+ "dtype: float64"
+ ]
+ },
+ "execution_count": 59,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test_pred_df.groupby('lang').apply(lambda x: roc_auc_score(x['label'], x['pred']))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "68254f12-423b-4d55-95f8-c90a8937ee58",
+ "metadata": {},
+ "source": [
+ "# 5. Now unfeeze the model and train again"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "099b3672-829b-426c-bb0c-95b2c36d17e5",
+ "metadata": {},
+ "source": [
+ "For unfrozen fine-tuning, we start with the current model - the one that has its classification head well trained. \n",
+ "\n",
+ "This way, we avoid the problem that the loss propagated from the randomly initialized head distorts the parameters of the underlying sentence encoder. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 71,
+ "id": "6b5721e1-7cef-40ee-af66-4735c4229cf1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# clean up some memory\n",
+ "trainer = None\n",
+ "gc.collect()\n",
+ "torch.cuda.empty_cache()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "id": "4d4ecf91-b185-426b-87da-bbafc4f77aeb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "for param in model.sonar_encoder.parameters():\n",
+ " param.requires_grad = True"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 75,
+ "id": "b5c313e1-6f65-4464-8cbe-bb4882dbb42c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "training_args = TrainingArguments(\n",
+ " output_dir=MODEL2_DIR,\n",
+ " num_train_epochs=2, # now with more parameters to train, we increase the number of epochs\n",
+ " logging_steps=500,\n",
+ " save_steps=5_000,\n",
+ " save_total_limit=50,\n",
+ " per_device_train_batch_size=32,\n",
+ " warmup_steps=1_000, # Adjust the number of the warmup steps to take about 10%-20% of the total number of steps per epoch.\n",
+ " weight_decay=1e-3,\n",
+ " optim=\"adafactor\", # adafactor, if even with smaller batches you are out of memory \n",
+ " gradient_accumulation_steps=1,\n",
+ " run_name='sonar-tox-trained-v2',\n",
+ " learning_rate=1e-4,\n",
+ " report_to=\"wandb\", # turn this off if you don't have wandb\n",
+ " # max_steps=100_000,\n",
+ " bf16=True, # turn this off if your GPU doesn't support it\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 76,
+ "id": "6b2c6f90-83ca-4397-8477-3586d4d66880",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "steps per epoch 7883.00390625\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Here is how many steps per epoch would be made with my setting (8 gpus):\n",
+ "print(\"steps per epoch\", len(ds_tokenized['train']) / training_args.per_device_train_batch_size / 8)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 77,
+ "id": "b5371975-9eee-484c-92e0-6bb8f75b9284",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "trainer = Trainer(\n",
+ " model=model,\n",
+ " args=training_args,\n",
+ " train_dataset=ds_tokenized[\"train\"],\n",
+ " data_collator=data_collator,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "beff2afa-1d8a-4950-9b92-bfe6b907bfe1",
+ "metadata": {},
+ "source": [
+ "We can see that the loss starts at the 0.095 level - slightly "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 78,
+ "id": "25de2ef6-1926-4a00-94bf-57b49c3c39a7",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " \n",
+ "
\n",
+ " [15768/15768 12:13:52, Epoch 2/2]\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | Step | \n",
+ " Training Loss | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 500 | \n",
+ " 0.095400 | \n",
+ "
\n",
+ " \n",
+ " | 1000 | \n",
+ " 0.083500 | \n",
+ "
\n",
+ " \n",
+ " | 1500 | \n",
+ " 0.075400 | \n",
+ "
\n",
+ " \n",
+ " | 2000 | \n",
+ " 0.069800 | \n",
+ "
\n",
+ " \n",
+ " | 2500 | \n",
+ " 0.065700 | \n",
+ "
\n",
+ " \n",
+ " | 3000 | \n",
+ " 0.064500 | \n",
+ "
\n",
+ " \n",
+ " | 3500 | \n",
+ " 0.062000 | \n",
+ "
\n",
+ " \n",
+ " | 4000 | \n",
+ " 0.059500 | \n",
+ "
\n",
+ " \n",
+ " | 4500 | \n",
+ " 0.060500 | \n",
+ "
\n",
+ " \n",
+ " | 5000 | \n",
+ " 0.059000 | \n",
+ "
\n",
+ " \n",
+ " | 5500 | \n",
+ " 0.058600 | \n",
+ "
\n",
+ " \n",
+ " | 6000 | \n",
+ " 0.058700 | \n",
+ "
\n",
+ " \n",
+ " | 6500 | \n",
+ " 0.057000 | \n",
+ "
\n",
+ " \n",
+ " | 7000 | \n",
+ " 0.059200 | \n",
+ "
\n",
+ " \n",
+ " | 7500 | \n",
+ " 0.056600 | \n",
+ "
\n",
+ " \n",
+ " | 8000 | \n",
+ " 0.052500 | \n",
+ "
\n",
+ " \n",
+ " | 8500 | \n",
+ " 0.047500 | \n",
+ "
\n",
+ " \n",
+ " | 9000 | \n",
+ " 0.047600 | \n",
+ "
\n",
+ " \n",
+ " | 9500 | \n",
+ " 0.047300 | \n",
+ "
\n",
+ " \n",
+ " | 10000 | \n",
+ " 0.047200 | \n",
+ "
\n",
+ " \n",
+ " | 10500 | \n",
+ " 0.047300 | \n",
+ "
\n",
+ " \n",
+ " | 11000 | \n",
+ " 0.047100 | \n",
+ "
\n",
+ " \n",
+ " | 11500 | \n",
+ " 0.045700 | \n",
+ "
\n",
+ " \n",
+ " | 12000 | \n",
+ " 0.046100 | \n",
+ "
\n",
+ " \n",
+ " | 12500 | \n",
+ " 0.045700 | \n",
+ "
\n",
+ " \n",
+ " | 13000 | \n",
+ " 0.045600 | \n",
+ "
\n",
+ " \n",
+ " | 13500 | \n",
+ " 0.045600 | \n",
+ "
\n",
+ " \n",
+ " | 14000 | \n",
+ " 0.044500 | \n",
+ "
\n",
+ " \n",
+ " | 14500 | \n",
+ " 0.045800 | \n",
+ "
\n",
+ " \n",
+ " | 15000 | \n",
+ " 0.044400 | \n",
+ "
\n",
+ " \n",
+ " | 15500 | \n",
+ " 0.044600 | \n",
+ "
\n",
+ " \n",
+ "
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "IOPub message rate exceeded.\n",
+ "The Jupyter server will temporarily stop sending output\n",
+ "to the client in order to avoid crashing it.\n",
+ "To change this limit, set the config variable\n",
+ "`--ServerApp.iopub_msg_rate_limit`.\n",
+ "\n",
+ "Current values:\n",
+ "ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+ "ServerApp.rate_limit_window=3.0 (secs)\n",
+ "\n",
+ "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/transformers/configuration_utils.py:397: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.\n",
+ "Non-default generation parameters: {'max_length': 200}\n",
+ " warnings.warn(\n",
+ "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+ " warnings.warn(\n",
+ "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/transformers/configuration_utils.py:397: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.\n",
+ "Non-default generation parameters: {'max_length': 200}\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "TrainOutput(global_step=15768, training_loss=0.05560727533140139, metrics={'train_runtime': 44037.4041, 'train_samples_per_second': 91.652, 'train_steps_per_second': 0.358, 'total_flos': 5.37354496011832e+18, 'train_loss': 0.05560727533140139, 'epoch': 2.0})"
+ ]
+ },
+ "execution_count": 78,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "trainer.train()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6a2081ee-4ad7-4bb4-b506-3c9224392cda",
+ "metadata": {},
+ "source": [
+ "## Evaluating"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 79,
+ "id": "b55d26e8-a05b-4368-9cf1-890c95b7f8fd",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "execution_count": 79,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model.eval()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 80,
+ "id": "aff27f8b-20c2-4467-9910-6f857f381f75",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/home/daviddale/.conda/envs/fs2v04/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:71: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "preds2_small = trainer.predict(ds_tokenized['test_small'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 81,
+ "id": "43889710-ff29-42b8-918d-7f377fdca926",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.8961500666222518\n"
+ ]
+ }
+ ],
+ "source": [
+ "with torch.inference_mode():\n",
+ " p2_bad = torch.softmax(torch.tensor(preds2_small.predictions[0]), dim=-1)[:, 1].cpu().numpy()\n",
+ "print(roc_auc_score(preds2_small.label_ids, p2_bad))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 83,
+ "id": "9a98f19f-d5d4-46a5-9fab-0570ef4b2dc5",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "c1dc507386f041e4ad2a8638569e94df",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ " 0%| | 0/5101 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "test_pred2_df = pd.DataFrame({\n",
+ " 'label': preds2_small.label_ids,\n",
+ " 'pred': p2_bad,\n",
+ " 'lang': [tokenizer.convert_ids_to_tokens(x['input_ids'][0]) for x in tqdm(ds_tokenized['test_small'])],\n",
+ "})"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 84,
+ "id": "56d3dd7b-124d-4c62-a58a-46a7136c136a",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "lang\n",
+ "arb_Arab 0.887143\n",
+ "ben_Beng 0.820000\n",
+ "bul_Cyrl 0.820000\n",
+ "cat_Latn 0.911429\n",
+ "ces_Latn 0.880714\n",
+ "dan_Latn 0.858889\n",
+ "deu_Latn 0.806167\n",
+ "ell_Grek 0.842500\n",
+ "eng_Latn 0.676300\n",
+ "est_Latn 0.916552\n",
+ "fin_Latn 0.931034\n",
+ "fra_Latn 0.839628\n",
+ "heb_Hebr 0.858000\n",
+ "hin_Deva 0.780357\n",
+ "hun_Latn 0.884333\n",
+ "ind_Latn 0.740000\n",
+ "ita_Latn 0.801917\n",
+ "nld_Latn 0.703793\n",
+ "pes_Arab 0.911429\n",
+ "pol_Latn 0.874545\n",
+ "por_Latn 0.866679\n",
+ "rus_Cyrl 0.820039\n",
+ "slk_Latn 0.881333\n",
+ "spa_Latn 0.816425\n",
+ "swh_Latn 0.625667\n",
+ "tgl_Latn 0.688667\n",
+ "tur_Latn 0.885000\n",
+ "urd_Arab 0.845435\n",
+ "vie_Latn 0.838065\n",
+ "zho_Hans 0.732609\n",
+ "dtype: float64"
+ ]
+ },
+ "execution_count": 84,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test_pred_df.groupby('lang').apply(lambda x: roc_auc_score(x['label'], x['pred']))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f1f64a0-a5d6-4aaa-a40f-d22744e830ba",
+ "metadata": {},
+ "source": [
+ "How does it work with our toy sentences?\n",
+ "\n",
+ "We can see that the model is rather confidently predicting the first one as toxic, and the other two as neutral. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 87,
+ "id": "0148ec0e-ac74-49ff-874e-999610fc3106",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[[2.0675792e-04 9.9979323e-01]\n",
+ " [9.9986959e-01 1.3042087e-04]\n",
+ " [9.9996281e-01 3.7189438e-05]]\n"
+ ]
+ }
+ ],
+ "source": [
+ "sample_texts = ['Fuck you bitch!', 'Do you like dogs?', 'One more neutral sentence.']\n",
+ "tokenizer.src_lang = \"eng_Latn\"\n",
+ "\n",
+ "with torch.inference_mode():\n",
+ " out = model(**tokenizer(sample_texts, padding=True, return_tensors='pt').to(model.device))\n",
+ " probabilities = torch.softmax(out.logits, dim=-1).cpu().numpy()\n",
+ "print(probabilities)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f089f5b-8489-4ed2-998d-f9d2f8a5e634",
+ "metadata": {},
+ "source": [
+ "# Packaging"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bf80372c-e981-402c-bf41-9f3690589723",
+ "metadata": {},
+ "source": [
+ "Here we show how to load the model from a directory. \n",
+ "\n",
+ "It has been already saved there by the trainer, but we'll also save the tokenizer, just for convenience."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 89,
+ "id": "d989bbcb-39f7-4cca-9d53-5aa236634fcc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "gc.collect()\n",
+ "torch.cuda.empty_cache()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 90,
+ "id": "6e52b390-0a64-4188-aa81-1047a0b960a3",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "('models/SONAR_toxicity_classifier_unfozen/checkpoint-15768/tokenizer_config.json',\n",
+ " 'models/SONAR_toxicity_classifier_unfozen/checkpoint-15768/special_tokens_map.json',\n",
+ " 'models/SONAR_toxicity_classifier_unfozen/checkpoint-15768/sentencepiece.bpe.model',\n",
+ " 'models/SONAR_toxicity_classifier_unfozen/checkpoint-15768/added_tokens.json',\n",
+ " 'models/SONAR_toxicity_classifier_unfozen/checkpoint-15768/tokenizer.json')"
+ ]
+ },
+ "execution_count": 90,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tokenizer.save_pretrained(f\"{MODEL1_DIR}/checkpoint-3942\")\n",
+ "tokenizer.save_pretrained(f\"{MODEL2_DIR}/checkpoint-15768\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 91,
+ "id": "40bccba0-1922-4b45-9f8f-c4b818dc10c1",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+ "To disable this warning, you can either:\n",
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "total 3.0G\n",
+ " 0 drwxrwxr-x 2 daviddale daviddale 0 Jul 28 14:26 .\n",
+ " 0 drwxrwxr-x 6 daviddale daviddale 0 Jul 28 11:14 ..\n",
+ "1.0K -rw-rw-r-- 1 daviddale daviddale 910 Jul 28 11:14 config.json\n",
+ "2.9G -rw-rw-r-- 1 daviddale daviddale 2.9G Jul 28 11:14 model.safetensors\n",
+ "5.4M -rw-rw-r-- 1 daviddale daviddale 5.4M Jul 28 11:14 optimizer.pt\n",
+ " 14K -rw-rw-r-- 1 daviddale daviddale 14K Jul 28 11:14 rng_state.pth\n",
+ "1.5K -rw-rw-r-- 1 daviddale daviddale 1.1K Jul 28 11:14 scheduler.pt\n",
+ "4.7M -rw-rw-r-- 1 daviddale daviddale 4.7M Jul 28 14:26 sentencepiece.bpe.model\n",
+ "4.5K -rw-rw-r-- 1 daviddale daviddale 4.2K Jul 28 14:26 special_tokens_map.json\n",
+ " 39K -rw-rw-r-- 1 daviddale daviddale 39K Jul 28 14:26 tokenizer_config.json\n",
+ " 31M -rw-rw-r-- 1 daviddale daviddale 31M Jul 28 14:26 tokenizer.json\n",
+ "6.0K -rw-rw-r-- 1 daviddale daviddale 6.0K Jul 28 11:14 trainer_state.json\n",
+ "5.5K -rw-rw-r-- 1 daviddale daviddale 5.3K Jul 28 11:14 training_args.bin\n"
+ ]
+ }
+ ],
+ "source": [
+ "!ls -alsh $MODEL2_DIR/checkpoint-15768"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a9f479ee-08fa-4217-bf45-23135b9cd631",
+ "metadata": {},
+ "source": [
+ "Now, let's load the new model and tokenizer!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 92,
+ "id": "d3f4d3d1-8c08-47de-8b5d-a7001b115313",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "new_model_dir = f\"{MODEL2_DIR}/checkpoint-15768\"\n",
+ "\n",
+ "model2 = SonarForSequenceClassification.from_pretrained(new_model_dir)\n",
+ "tokenizer2 = AutoTokenizer.from_pretrained(new_model_dir)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "40e86fe6-5c2f-488d-bab2-fed7ea7a113a",
+ "metadata": {},
+ "source": [
+ "Checking that the model gets exactly the same predictions as before:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 93,
+ "id": "32f805fb-9d8b-4e46-b4ae-c6965670d9fe",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[[2.0675754e-04 9.9979323e-01]\n",
+ " [9.9986959e-01 1.3042087e-04]\n",
+ " [9.9996281e-01 3.7189402e-05]]\n"
+ ]
+ }
+ ],
+ "source": [
+ "sample_texts = ['Fuck you bitch!', 'Do you like dogs?', 'One more neutral sentence.']\n",
+ "tokenizer2.src_lang = \"eng_Latn\"\n",
+ "\n",
+ "with torch.inference_mode():\n",
+ " out = model2(**tokenizer2(sample_texts, padding=True, return_tensors='pt').to(model2.device))\n",
+ " probabilities = torch.softmax(out.logits, dim=-1).cpu().numpy()\n",
+ "print(probabilities)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.14"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}