Notion Reddit Analyzer

A data science tool that analyzes Reddit posts to identify product improvement opportunities for Notion. Scrapes real user feedback via Reddit API, categorizes complaints, and designs A/B tests with statistical rigor.

Interactive Dashboard

Live monitoring of Notion complaints from Reddit

streamlit run dashboard.py

Features:

Real-time filtering by date, subreddit, and category
Complaint volume trends and category breakdowns
Top posts by engagement (upvotes, comments)
Time series visualizations with Plotly
SQL-backed data with 266 highly relevant posts

Overview

Current Dataset:

266 Notion-relevant posts from the past 30 days (76.7% of 347 scraped)
81 generic/off-topic posts filtered out (23.3%)
Scraper can be configured for larger datasets (different time ranges, more subreddits)

Key Findings:

Mobile issues: 32.6% of complaints
Performance issues: 11.8% of complaints
Designed A/B tests with power analysis and sample size calculations

Features

SQL Database

Relational schema with posts, comments, categories, subreddits
Indexed for fast queries
Demonstrates JOIN operations and SQL proficiency
0.81 MB SQLite database

Create Database:

python analysis/create_database.py

Time Series Analysis

Trend detection using linear regression
Week-over-week growth rates
Category-specific trend analysis
Anomaly detection for complaint spikes

Run Analysis:

python analysis/time_series_analysis.py

Project Structure

├── analysis/
│   ├── reddit_scraper.py            # PRAW API scraper with relevance filtering
│   ├── create_database.py           # SQL database creation
│   ├── time_series_analysis.py      # Trend and seasonality analysis
│   ├── test_relevance_filter.py     # Filtering analysis tool
│   ├── complaint_analysis.ipynb     # Jupyter analysis notebook
│   └── statistical_analysis.py      # A/B test utilities
├── data/
│   ├── reddit_posts_raw.json          # 347 scraped posts
│   ├── reddit_posts_categorized.json  # 323 relevant posts by category
│   ├── reddit_analysis_report.md      # Formatted report
│   └── notion_complaints.db           # SQLite database
├── dashboard.py                       # Streamlit dashboard
├── requirements.txt
├── run_scraper.sh
├── REDDIT_SETUP.md
└── DATA_SUMMARY.md

Setup

Prerequisites

Python 3.9+
Reddit API credentials (free - see REDDIT_SETUP.md)

Quick Start

# Clone repository
git clone https://github.com/3yit/Notion_Reddit_Analyzer.git
cd Notion_Reddit_Analyzer

# Set up credentials
cp .env.example .env
# Edit .env with your Reddit API credentials

# Run scraper
./run_scraper.sh

Manual Setup

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run scraper
cd analysis
python reddit_scraper.py

Data Collection

Uses PRAW (Python Reddit API Wrapper) to collect and filter Reddit posts:

Filtering Logic:

Only includes posts actually about Notion (not just mentioning it in a list)
Posts from r/Notion and r/NotionSo are always included
Posts from other subreddits must have "Notion" in the title OR mention it 3+ times
Filters out generic productivity posts that briefly mention Notion alongside 10 other tools

Data collected:

Full post content (word-for-word)
Metadata (score, comments, date)
Top 5 comments per post
Clickable links to original posts

Subreddits searched: r/Notion, r/productivity, r/studytips, r/digitalplanner, r/NotionSo, r/PKM

Analysis Results

Complaint Distribution

Category	Count	Percentage
Mobile	113	32.6%
Onboarding	46	13.3%
Performance	41	11.8%
Pricing	32	9.2%
Collaboration	26	7.5%
Features	23	6.6%

A/B Test Designs

For top complaints, designed experiments with:

Hypothesis statements
Sample size calculations (80% power, 5% significance)
Success metrics
Expected impact estimates

See complaint_analysis.ipynb for full analysis.

Methodology

Data Collection

Source: Reddit API (PRAW)
Time Period: Last 30 days
Sample Size: 347 posts
Authentication: Official Reddit API

Categorization

Method: Two-stage filtering process
1. Relevance Filter: Removes posts that only mention Notion in passing (e.g., "I use Notion, Todoist, and 10 other apps")
2. Keyword Classification: Categorizes relevant posts into complaint types
Categories: Performance, mobile, onboarding, pricing, collaboration, features
Validation: Manual review of sample posts

Relevance Criteria

A post is considered "about Notion" if it meets any of:

Posted in r/Notion or r/NotionSo
Has "Notion" in the title
Mentions "Notion" 3+ times (indicates substantial discussion)
Posts mentioning Notion once in a list with other tools are excluded

Statistical Analysis

Frequency Analysis: Distribution across categories
Engagement Metrics: Upvotes and comments as importance proxies
A/B Test Design: Two-proportion z-test
- Power: 0.80
- Significance: α = 0.05
- Effect sizes from industry benchmarks

Limitations

Keyword matching may miss nuanced complaints
Reddit users may not represent all Notion users (skews toward power users)
Self-selection bias (users with strong opinions more likely to post)
Cannot establish causality without experimental data
Strict relevance filtering excludes posts mentioning Notion only 1-2 times
Dataset size (266 posts) limits statistical power for some analyses

Output Files

After running the scraper:

reddit_posts_raw.json - All 347 scraped posts (before filtering)
reddit_posts_categorized.json - 266 Notion-relevant posts organized by type
reddit_analysis_report.md - Formatted report with top 50 posts
notion_complaints.db - SQLite database with relational schema

All posts include clickable links to original sources for verification.

Technologies

Python 3.9+
Data Collection: PRAW (Reddit API)
Database: SQLite with relational schema
Analysis: pandas, scipy, numpy
Visualization: Plotly, Streamlit
Notebooks: Jupyter

Future Enhancements

Analyze positive feedback alongside complaints to determine if issues affect all users or specific segments
Real-time complaint tracking dashboard
Automated weekly reports
Integration with actual product analytics
Multi-platform analysis (Twitter, Discord, support tickets)
Predictive modeling for churn risk
Causal inference analysis

Data Quality

100% real data from Reddit API
Full source citations
Reproducible analysis
Official API with authentication

License

MIT

Note: Research project using publicly available Reddit data. No proprietary Notion data included.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Notion Reddit Analyzer

Interactive Dashboard

Overview

Features

SQL Database

Time Series Analysis

Project Structure

Setup

Prerequisites

Quick Start

Manual Setup

Data Collection

Analysis Results

Complaint Distribution

A/B Test Designs

Methodology

Data Collection

Categorization

Relevance Criteria

Statistical Analysis

Limitations

Output Files

Technologies

Future Enhancements

Data Quality

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
analysis		analysis
data		data
.env.example		.env.example
.gitignore		.gitignore
DATA_SUMMARY.md		DATA_SUMMARY.md
README.md		README.md
REDDIT_SETUP.md		REDDIT_SETUP.md
dashboard.py		dashboard.py
requirements.txt		requirements.txt
run_scraper.sh		run_scraper.sh

3yit/Notion_Reddit_Analyzer

Folders and files

Latest commit

History

Repository files navigation

Notion Reddit Analyzer

Interactive Dashboard

Overview

Features

SQL Database

Time Series Analysis

Project Structure

Setup

Prerequisites

Quick Start

Manual Setup

Data Collection

Analysis Results

Complaint Distribution

A/B Test Designs

Methodology

Data Collection

Categorization

Relevance Criteria

Statistical Analysis

Limitations

Output Files

Technologies

Future Enhancements

Data Quality

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages