Skip to content

pheonix-19/bert-finetune-spam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Fine-Tuning BERT for Spam Classification

Overview

This project demonstrates the fine-tuning of a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model for spam classification. The objective is to classify text messages as either "spam" or "ham" using a labeled dataset of SMS messages.

Key Features:

  • Data preprocessing including tokenization and padding
  • Fine-tuning BERT for text classification
  • Evaluation using precision, recall, and f1-score
  • Confusion matrix for performance visualization

Prerequisites

The following libraries are required:

  • transformers
  • torch
  • pandas
  • scikit-learn
  • numpy

You can install the dependencies using a package manager like pip.

Data Preprocessing

The dataset consists of SMS messages categorized as either "spam" or "ham." To prepare the data for BERT, the following steps are performed:

  1. Data loading: The dataset is loaded into a DataFrame for easier manipulation.
  2. Label encoding: The text labels ("spam" or "ham") are encoded as numerical values (0 for ham and 1 for spam).
  3. Tokenization: BERT requires the input text to be tokenized into subword units. This is done using BertTokenizer.
  4. Padding and Attention Masks: The tokenized text is padded to a fixed length, and attention masks are used to indicate which tokens should be attended to.

Model Fine-Tuning

The BERT model is obtained from the Huggingface model hub and is fine-tuned for binary classification. The final layer of BERT is modified for two output classes (spam and ham). The training involves:

  1. DataLoader: The preprocessed data is divided into training and test sets and loaded into batches.
  2. Training Loop: The model is trained for several epochs, with the optimizer updating the weights to minimize loss.
  3. Evaluation: The trained model is tested on unseen data to evaluate its performance.

Performance Evaluation

The performance is measured using metrics such as precision, recall, and f1-score. These metrics help to understand how well the model distinguishes between spam and ham messages.

The key evaluation metrics are:

  • Precision: The percentage of correctly predicted positive samples (spam) out of all positive predictions.
  • Recall: The percentage of actual positive samples (spam) that were correctly predicted.
  • F1-Score: The harmonic mean of precision and recall, providing a balanced metric.

Confusion Matrix

A confusion matrix is used to visualize the model’s performance by comparing the predicted classifications with the actual labels. It shows:

  • True Positives (TP): Correct spam classifications.
  • True Negatives (TN): Correct ham classifications.
  • False Positives (FP): Incorrect spam predictions (ham messages classified as spam).
  • False Negatives (FN): Incorrect ham predictions (spam messages classified as ham).

Conclusion

This project illustrates how to fine-tune a pre-trained BERT model for binary classification tasks like spam detection. With a few epochs of training, the model achieves high accuracy in identifying spam messages. The use of confusion matrices and evaluation metrics like precision and recall provides insight into the model’s effectiveness in distinguishing between classes.


This documentation outlines the process clearly without diving into the code, making it easy for readers to understand the workflow and results of the project. Let me know if you need further modifications! torch pandas scikit-learn numpy You can install the dependencies using a package manager like pip.

Data Preprocessing The dataset consists of SMS messages categorized as either "spam" or "ham." To prepare the data for BERT, the following steps are performed:

Data loading: The dataset is loaded into a DataFrame for easier manipulation. Label encoding: The text labels ("spam" or "ham") are encoded as numerical values (0 for ham and 1 for spam). Tokenization: BERT requires the input text to be tokenized into subword units. This is done using BertTokenizer. Padding and Attention Masks: The tokenized text is padded to a fixed length, and attention masks are used to indicate which tokens should be attended to. Model Fine-Tuning The BERT model is obtained from the Huggingface model hub and is fine-tuned for binary classification. The final layer of BERT is modified for two output classes (spam and ham). The training involves:

DataLoader: The preprocessed data is divided into training and test sets and loaded into batches. Training Loop: The model is trained for several epochs, with the optimizer updating the weights to minimize loss. Evaluation: The trained model is tested on unseen data to evaluate its performance. Performance Evaluation The performance is measured using metrics such as precision, recall, and f1-score. These metrics help to understand how well the model distinguishes between spam and ham messages.

The key evaluation metrics are:

Precision: The percentage of correctly predicted positive samples (spam) out of all positive predictions. Recall: The percentage of actual positive samples (spam) that were correctly predicted. F1-Score: The harmonic mean of precision and recall, providing a balanced metric. Confusion Matrix A confusion matrix is used to visualize the model’s performance by comparing the predicted classifications with the actual labels. It shows:

True Positives (TP): Correct spam classifications. True Negatives (TN): Correct ham classifications. False Positives (FP): Incorrect spam predictions (ham messages classified as spam). False Negatives (FN): Incorrect ham predictions (spam messages classified as ham). Conclusion This project illustrates how to fine-tune a pre-trained BERT model for binary classification tasks like spam detection. With a few epochs of training, the model achieves high accuracy in identifying spam messages. The use of confusion matrices and evaluation metrics like precision and recall provides insight into the model’s effectiveness in distinguishing between classes.

About

This project demonstrates the fine-tuning of a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model for spam classification. The objective is to classify text messages as either "spam" or "ham" using a labeled dataset of SMS messages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors