A Natural Language Processing and Machine Learning analysis that distinguishes between r/Gunners and r/ArsenalFC posts, revealing content patterns and community differences.
This project aims to develop a model that accurately predicts which Arsenal FC subreddit a post belongs to based on its content and metadata, while extracting meaningful insights about community behavior and linguistic patterns.
- Data Collection: Extracted posts from both subreddits using PRAW (Python Reddit API Wrapper)
- Text Preprocessing: Applied cleaning, tokenization, stopword removal, and lemmatization
- Exploratory Data Analysis: Examined class distribution, post lengths, posting trends, and vocabulary usage
- Model Development: Implemented and compared six classifiers with extensive hyperparameter optimization
- Evaluation: Measured performance through accuracy, precision, recall, F1 score, and feature importance analysis
The dataset contains an equal number of posts (approximately 1,100) from each subreddit, providing an optimal foundation for unbiased model training.
r/ArsenalFC tends toward shorter, more concise posts (50-100 characters), while r/Gunners features a more diverse length distribution with more substantial content (>200 characters).
Both communities showed minimal engagement from 2015-2023, followed by steady growth through 2023-2024, culminating in a significant activity surge in early 2025. r/ArsenalFC consistently maintains higher post volume.
r/Gunners emphasizes organized discourse ("thread", "discussion", "league") with formal structure, while r/ArsenalFC focuses on performance analysis ("season", "game", "team") and manager-related topics ("arteta").
A total of 480 model configurations were tested across six architectures, with each model undergoing rigorous 5-fold cross-validation:
| Model Type | Configurations Tested |
|---|---|
| Logistic Regression | 80 |
| Linear SVM | 80 |
| SVM (with kernels) | 80 |
| Random Forest | 80 |
| Extra Trees | 80 |
| XGBoost | 80 |
The Extra Trees classifier delivered superior performance with 93.1% accuracy and a 93.08% F1 score. Complete test metrics:
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Logistic Regression | 80.71% | 80.75% | 80.71% | 80.71% |
| Linear SVM | 86.90% | 86.91% | 86.90% | 86.90% |
| SVM (with kernels) | 84.29% | 84.34% | 84.29% | 84.28% |
| Random Forest | 90.00% | 90.03% | 90.00% | 90.00% |
| Extra Trees | 93.10% | 93.38% | 93.10% | 93.08% |
| XGBoost | 89.05% | 89.41% | 89.05% | 89.02% |
- Confusion Matrix:
- True Negatives (Gunners correctly predicted): 187
- False Positives (Gunners as ArsenalFC): 23
- False Negatives (ArsenalFC as Gunners): 6
- True Positives (ArsenalFC correctly predicted): 204
- Accuracy: 93.1% - Correctly identifies the subreddit for 93.1% of posts
- Precision: 93.38% - High confidence in positive predictions (89.9% for ArsenalFC, 96.9% for Gunners)
- Recall: 93.10% - Successfully captures 97.1% of ArsenalFC posts and 89.0% of Gunners posts
- F1 Score: 93.08% - Strong balance between precision and recall
A representative tree from the Extra Trees ensemble illustrates key decision rules:

| Variable | Importance (%) |
|---|---|
| score_log | 23.212 |
| score | 13.568 |
| post_year | 4.288 |
| period_recent | 4.280 |
| has_question | 4.214 |
| period_current | 3.563 |
| past_month | 3.167 |
| period_early | 2.288 |
| period_mid | 1.317 |
| title_length | 1.197 |
Additional model visualizations:

- Language: Python
- Data Collection: PRAW (Python Reddit API Wrapper)
- Data Analysis: Pandas, NumPy
- NLP Processing: NLTK, Scikit-learn
- Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-learn, XGBoost
subreddit_data_collection.ipynb: Reddit data harvesting scriptssubreddit_classifier.ipynb: Main notebook containing preprocessing, EDA, and modelingdata/: Collected dataset filesimages/: Generated visualizationsarsenal_subreddit_classifier.pkl: Saved Extra Trees model
The Extra Trees classifier successfully distinguishes between the two Arsenal subreddits with 93.1% accuracy. Post engagement metrics (score_log, score) and temporal features (post_year, period_recent) emerged as the strongest predictors. The analysis reveals distinct community characteristics: r/Gunners favors structured discussion formats, while r/ArsenalFC tends toward performance-centric content. Both communities experienced a notable engagement surge in early 2025.
Future development opportunities include:
- Implementing real-time classification of new posts
- Investigating the causes behind the 2025 activity spike
- Enhancing the model with additional features such as:
- Comment sentiment analysis
- Emoji usage patterns
- User interaction networks
