Skip to content

This project aims to develop a model that accurately predicts which Arsenal FC subreddit a post belongs to based on its content and metadata, while extracting meaningful insights about community behavior and linguistic patterns

Notifications You must be signed in to change notification settings

Lisanwud/SubredditClassifier

Repository files navigation

Arsenal Subreddit Classification

A Natural Language Processing and Machine Learning analysis that distinguishes between r/Gunners and r/ArsenalFC posts, revealing content patterns and community differences.

Problem Statement

This project aims to develop a model that accurately predicts which Arsenal FC subreddit a post belongs to based on its content and metadata, while extracting meaningful insights about community behavior and linguistic patterns.

Project Overview

  1. Data Collection: Extracted posts from both subreddits using PRAW (Python Reddit API Wrapper)
  2. Text Preprocessing: Applied cleaning, tokenization, stopword removal, and lemmatization
  3. Exploratory Data Analysis: Examined class distribution, post lengths, posting trends, and vocabulary usage
  4. Model Development: Implemented and compared six classifiers with extensive hyperparameter optimization
  5. Evaluation: Measured performance through accuracy, precision, recall, F1 score, and feature importance analysis

Key Findings

Subreddit Activity

Class Balance The dataset contains an equal number of posts (approximately 1,100) from each subreddit, providing an optimal foundation for unbiased model training.

Content Complexity

Post Length Distribution r/ArsenalFC tends toward shorter, more concise posts (50-100 characters), while r/Gunners features a more diverse length distribution with more substantial content (>200 characters).

Temporal Trends

Posts Over Time Both communities showed minimal engagement from 2015-2023, followed by steady growth through 2023-2024, culminating in a significant activity surge in early 2025. r/ArsenalFC consistently maintains higher post volume.

Language Differences

Most Common Words r/Gunners emphasizes organized discourse ("thread", "discussion", "league") with formal structure, while r/ArsenalFC focuses on performance analysis ("season", "game", "team") and manager-related topics ("arteta").

Model Comparison

A total of 480 model configurations were tested across six architectures, with each model undergoing rigorous 5-fold cross-validation:

Model Type Configurations Tested
Logistic Regression 80
Linear SVM 80
SVM (with kernels) 80
Random Forest 80
Extra Trees 80
XGBoost 80

Model Performance

The Extra Trees classifier delivered superior performance with 93.1% accuracy and a 93.08% F1 score. Complete test metrics:

Model Accuracy Precision Recall F1 Score
Logistic Regression 80.71% 80.75% 80.71% 80.71%
Linear SVM 86.90% 86.91% 86.90% 86.90%
SVM (with kernels) 84.29% 84.34% 84.29% 84.28%
Random Forest 90.00% 90.03% 90.00% 90.00%
Extra Trees 93.10% 93.38% 93.10% 93.08%
XGBoost 89.05% 89.41% 89.05% 89.02%

Best Model: Extra Trees

  • Confusion Matrix:
    • True Negatives (Gunners correctly predicted): 187
    • False Positives (Gunners as ArsenalFC): 23
    • False Negatives (ArsenalFC as Gunners): 6
    • True Positives (ArsenalFC correctly predicted): 204
  • Accuracy: 93.1% - Correctly identifies the subreddit for 93.1% of posts
  • Precision: 93.38% - High confidence in positive predictions (89.9% for ArsenalFC, 96.9% for Gunners)
  • Recall: 93.10% - Successfully captures 97.1% of ArsenalFC posts and 89.0% of Gunners posts
  • F1 Score: 93.08% - Strong balance between precision and recall

Confusion Matrix - Extra Trees

Decision Tree Visualization

A representative tree from the Extra Trees ensemble illustrates key decision rules: Extra Trees Decision Tree

Feature Importance (Extra Trees)

Variable Importance (%)
score_log 23.212
score 13.568
post_year 4.288
period_recent 4.280
has_question 4.214
period_current 3.563
past_month 3.167
period_early 2.288
period_mid 1.317
title_length 1.197

Additional model visualizations: Confusion Matrix - Logistic Regression Confusion Matrix - Linear SVM Confusion Matrix - SVM Confusion Matrix - Random Forest Confusion Matrix - XGBoost

Technologies Used

  • Language: Python
  • Data Collection: PRAW (Python Reddit API Wrapper)
  • Data Analysis: Pandas, NumPy
  • NLP Processing: NLTK, Scikit-learn
  • Visualization: Matplotlib, Seaborn
  • Machine Learning: Scikit-learn, XGBoost

Repository Structure

  • subreddit_data_collection.ipynb: Reddit data harvesting scripts
  • subreddit_classifier.ipynb: Main notebook containing preprocessing, EDA, and modeling
  • data/: Collected dataset files
  • images/: Generated visualizations
  • arsenal_subreddit_classifier.pkl: Saved Extra Trees model

Conclusions

The Extra Trees classifier successfully distinguishes between the two Arsenal subreddits with 93.1% accuracy. Post engagement metrics (score_log, score) and temporal features (post_year, period_recent) emerged as the strongest predictors. The analysis reveals distinct community characteristics: r/Gunners favors structured discussion formats, while r/ArsenalFC tends toward performance-centric content. Both communities experienced a notable engagement surge in early 2025.

Future Work

Future development opportunities include:

  • Implementing real-time classification of new posts
  • Investigating the causes behind the 2025 activity spike
  • Enhancing the model with additional features such as:
    • Comment sentiment analysis
    • Emoji usage patterns
    • User interaction networks

About

This project aims to develop a model that accurately predicts which Arsenal FC subreddit a post belongs to based on its content and metadata, while extracting meaningful insights about community behavior and linguistic patterns

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published