Skip to content

MuhammadHelmyOmar/SANAD_Text_Classification

Repository files navigation

SANAD Text Classification

SANAD is a Single-label Arabic News Articles Dataset for automatic text categorization.
NLP pipeline:

Key Actions

  • Consolidated and organized data from multiple directories using pandas and os to streamline preprocessing and analysis workflows.
  • Cleaned and preprocessed a large Arabic text dataset using regex (re) and NLTK, including stopword removal, text normalization, and missing value handling.
  • Performed exploratory text analysis by computing features such as word count, character count, average characters per word, and stopword frequency.
  • Engineered statistical text features like tf-idf to enhance input representation for downstream machine learning tasks using scikit-learn.
  • Trained and evaluated different machine learning models using scikit-learn and Keras, including Logistic Regression (94% accuracy), Naive Bayes (92.4% accuracy), and Random Forest (89.5% accuracy).

About

Using SANAD (Single-label Arabic News Articles Dataset) to automate text categorization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •