This project aims to build a supervised learning model to classify news headlines as fake or real. A Natural Language Processing (NLP) pipeline is implemented to preprocess the data, apply vectorization techniques, and train classification models.
-
Data Loading and Exploration:
- Load the dataset containing two main columns:
Headline: The headline text.Veracity: Binary label indicating whether the headline is fake (0) or real (1).
- Explore the distribution of the labels to understand class balance.
- Load the dataset containing two main columns:
-
Text Preprocessing:
- Convert headlines to lowercase.
- Remove special characters and punctuation.
- (Optional) Remove stopwords and apply lemmatization.
-
Text Vectorization:
- Use
CountVectorizerto convert the headlines into a numerical representation based on word frequency. - Experiment with bigrams (
ngram_range=(1, 2)) to capture word relationships.
- Use
-
Training and Evaluating Base Models:
- Test multiple supervised classification models:
- Naive Bayes
- Logistic Regression
- Random Forest
- SVM
- Evaluate each model on the test set using metrics such as accuracy and classification reports.
- Test multiple supervised classification models:
-
Selecting the Best Model:
- The best-performing initial model was Logistic Regression, with an accuracy of ~92.92%.
-
Hyperparameter Tuning:
- Perform hyperparameter tuning on the winning model (
Logistic Regression) usingGridSearchCV. - Explore combinations of
C,solver, andpenalty.
- Perform hyperparameter tuning on the winning model (
-
Evaluating the Optimized Model:
- Recalculate performance metrics and visualize the confusion matrix.
- The optimized model maintained similar performance to the default model.
- Language: Python 3
- Libraries:
pandas: For handling tabular data.numpy: For mathematical operations.scikit-learn: For preprocessing, vectorization, and model training.seabornandmatplotlib: For metric visualization.nltk: For text cleaning and tokenization.
-
Naive Bayes:
- A fast and efficient baseline model for sparse data.
-
Logistic Regression:
- A robust linear model with regularization.
- Initial winner with an accuracy of 92.92%.
-
Random Forest:
- Tree-based model.
- Competitive performance but did not surpass Logistic Regression.
-
SVM:
- Maximizes decision boundaries.
- Requires more training time.
C: [0.01, 0.1, 1, 10, 100]penalty: ['l1', 'l2']solver: ['liblinear', 'lbfgs']
- Accuracy: Proportion of correct predictions.
- Classification Report: Detailed metrics (precision, recall, F1-score) for each class.
- Confusion Matrix: Visualization of true positives, true negatives, false positives, and false negatives.
- Winning Model: Logistic Regression (optimized).
- Performance:
- Accuracy: 92.92%
- Balanced precision and recall across classes.
- Observations:
- Hyperparameter tuning did not significantly improve the performance over the default model.
- The default model settings are near-optimal for this problem.
-
Improving Preprocessing:
- Implement TF-IDF instead of
CountVectorizer. - Experiment with advanced embeddings (Word2Vec, GloVe, or BERT).
- Implement TF-IDF instead of
-
Collect More Data:
- A larger dataset could help improve performance and generalization.
-
Try Advanced Models:
- Gradient boosting models like XGBoost or LightGBM.
- Transformer-based models (e.g., BERT, RoBERTa) for deeper semantic analysis.
If you have any questions or suggestions, feel free to reach out:
- Email: miguelchamizo10@gmail.com
- GitHub: Migueleven
- Linkedin: Miguel Ángel Chamizo Sánchez