SANAD is a Single-label Arabic News Articles Dataset for automatic text categorization.
NLP pipeline:
- Consolidated and organized data from multiple directories using pandas and os to streamline preprocessing and analysis workflows.
- Cleaned and preprocessed a large Arabic text dataset using regex (re) and NLTK, including stopword removal, text normalization, and missing value handling.
- Performed exploratory text analysis by computing features such as word count, character count, average characters per word, and stopword frequency.
- Engineered statistical text features like tf-idf to enhance input representation for downstream machine learning tasks using scikit-learn.
- Trained and evaluated different machine learning models using scikit-learn and Keras, including Logistic Regression (94% accuracy), Naive Bayes (92.4% accuracy), and Random Forest (89.5% accuracy).