This repository contains the code and resources for detecting offensive language in multilingual text, specifically focusing on Malayalam and Tamil. The project leverages the MuRIL (Multilingual Representations for Indian Languages) model and explores various fine-tuning strategies, including standard fine-tuning, LoRA (Low-Rank Adaptation), and QLoRA (Quantized LoRA), to achieve efficient and accurate classification.
The dataset consists of offensive language data for Malayalam and Tamil, split into training, development (validation), and testing sets.
- Malayalam:
mal_full_offensive_train.csv,mal_full_offensive_dev.csv,mal_full_offensive_test.csv - Tamil:
tamil_offensive_full_train.csv,tamil_offensive_full_dev.csv,tamil_offensive_full_test.csv
0: Not Offensive1: Offensive
The text data undergoes several preprocessing steps to ensure quality and consistency before being fed into the model:
- Label Normalization: Converts string labels to binary integers (
0for Not Offensive,1for Offensive). - Unicode Normalization: Applies NFC normalization to handle complex characters in Indian languages.
- Character Repeat Normalization: Reduces excessive character repetitions (e.g., "hellooo" -> "helloo").
- Special Token Replacement: Replaces URLs, user mentions, and numbers with special tokens (
<URL>,<USER>,<NUM>). - Language Tagging: Prepends a language-specific tag (
<ml>for Malayalam,<ta>for Tamil) to each text sequence.
The core model used is google/muril-base-cased. A custom classification head is added on top of the [CLS] token representation, consisting of a linear layer, ReLU activation, dropout, and a final linear layer for binary classification.
The notebook (MuRIL Selective Fine Tuning.ipynb) implements three distinct fine-tuning approaches:
- Freezes the lower layers of the MuRIL encoder.
- Unfreezes only the top layers (layers 9, 10, and 11) and the classification head.
- Uses
BCEWithLogitsLoss(orBCELosswith Sigmoid) andAdamWoptimizer.
- Implements Low-Rank Adaptation (LoRA) using the
peftlibrary. - Targets the
queryandvalueattention modules. - Significantly reduces the number of trainable parameters while maintaining performance.
- Loads the base MuRIL model in 4-bit precision using
bitsandbytes(nf4quantization type, double quantization, andbfloat16/float16compute dtype). - Applies LoRA adapters on top of the quantized model.
- Explores both selective target modules (
query,value) and full-model target modules (query,key,value,dense) for maximum efficiency and performance on limited hardware.
To run the notebook, you will need the following libraries:
torchpandasnumpyscikit-learntransformerspeftacceleratebitsandbytesmatplotlib
You can install the required PEFT and quantization libraries using:
pip install transformers peft accelerate bitsandbytes- Clone the repository.
- Ensure the dataset files are located in the
offensive_dataset/directory (or update the paths in the notebook accordingly). - Open
MuRIL Selective Fine Tuning.ipynbin Jupyter Notebook, Google Colab, or VS Code. - Run the cells sequentially to preprocess the data, train the models, and evaluate their performance.
- The best models are saved as
.ptfiles (e.g.,best_muril_model.pt,LoRA_muril_model1.pt,qLoRA_muril_model1.pt) during the training loop.
The models are evaluated using:
- Binary Accuracy
- Classification Report (Precision, Recall, F1-Score)
- Confusion Matrix
Class weights are computed and applied to the loss function to handle any class imbalances in the dataset.
|
Sajeev Senthil |
Hari Varthan |
Dennis Jerome Richard |
Joseph Binu George |
