This project replicates and discusses the findings from the paper "Exploring the Generalisability of Fake News Detection Models" by Nathaniel Hoy and Theodora Koulouri (2022). It evaluates the generalization of six traditional machine learning models across different preprocessing techniques and datasets.
- Adherence to guidelines and report structure, quality of writing: 5 points
- Relevance of data analysis: 3 points
- Relevance of state-of-the-art analysis: 3 points
- Relevance of the proposed model: 3 points
- Implementation of the model: 3 points
- Analysis of results: 3 points
The goal is to evaluate how well six machine learning models (Logistic Regression, SVM, Random Forest, Gradient Boosting, AdaBoost, Neural Network) generalize to unseen data. We compare five preprocessing methods: Bag-of-Words (BoW), TF-IDF, Word2Vec, BERT and Linguistic Cues.
- ISOT Fake News Dataset: A benchmark dataset with 44,898 articles, including 23,481 fake and 21,417 real news articles. It is used for training the models.
- Fake or Real News (FoR) Dataset: An external dataset with 6,296 articles to test the generalization of the models.
- Logistic Regression
- Support Vector Machines (SVM)
- Random Forest
- Gradient Boosting
- AdaBoost
- Neural Network (NN)
- BoW and TF-IDF: Full text normalization.
- Word2Vec, BERT & LC: Lighter preprocessing for embedding-based models.
- Train & test the models on the ISOT and FoR datasets.
- Test their generalization by doing a cross-evaluation.
- Compare model performance using accuracy, precision, recall, F1-score, and AUC.
- Do a cross dataset evaluation to capture generalization capacity.
Models trained on ISOT achieved near-perfect accuracy but showed performance drops on the external FoR dataset (and vice-verca), highlighting challenges in generalization.
- Clone this repository:
git clone https://github.com/marcderoo/fake-news-detection.git
- Install dependencies:
pip install -r requirements.txt
Chappuis Maxime & Deroo Marc