A data science tool that analyzes Reddit posts to identify product improvement opportunities for Notion. Scrapes real user feedback via Reddit API, categorizes complaints, and designs A/B tests with statistical rigor.
Live monitoring of Notion complaints from Reddit
streamlit run dashboard.pyFeatures:
- Real-time filtering by date, subreddit, and category
- Complaint volume trends and category breakdowns
- Top posts by engagement (upvotes, comments)
- Time series visualizations with Plotly
- SQL-backed data with 266 highly relevant posts
Current Dataset:
- 266 Notion-relevant posts from the past 30 days (76.7% of 347 scraped)
- 81 generic/off-topic posts filtered out (23.3%)
- Scraper can be configured for larger datasets (different time ranges, more subreddits)
Key Findings:
- Mobile issues: 32.6% of complaints
- Performance issues: 11.8% of complaints
- Designed A/B tests with power analysis and sample size calculations
- Relational schema with posts, comments, categories, subreddits
- Indexed for fast queries
- Demonstrates JOIN operations and SQL proficiency
- 0.81 MB SQLite database
Create Database:
python analysis/create_database.py- Trend detection using linear regression
- Week-over-week growth rates
- Category-specific trend analysis
- Anomaly detection for complaint spikes
Run Analysis:
python analysis/time_series_analysis.py├── analysis/
│ ├── reddit_scraper.py # PRAW API scraper with relevance filtering
│ ├── create_database.py # SQL database creation
│ ├── time_series_analysis.py # Trend and seasonality analysis
│ ├── test_relevance_filter.py # Filtering analysis tool
│ ├── complaint_analysis.ipynb # Jupyter analysis notebook
│ └── statistical_analysis.py # A/B test utilities
├── data/
│ ├── reddit_posts_raw.json # 347 scraped posts
│ ├── reddit_posts_categorized.json # 323 relevant posts by category
│ ├── reddit_analysis_report.md # Formatted report
│ └── notion_complaints.db # SQLite database
├── dashboard.py # Streamlit dashboard
├── requirements.txt
├── run_scraper.sh
├── REDDIT_SETUP.md
└── DATA_SUMMARY.md
- Python 3.9+
- Reddit API credentials (free - see REDDIT_SETUP.md)
# Clone repository
git clone https://github.com/3yit/Notion_Reddit_Analyzer.git
cd Notion_Reddit_Analyzer
# Set up credentials
cp .env.example .env
# Edit .env with your Reddit API credentials
# Run scraper
./run_scraper.sh# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run scraper
cd analysis
python reddit_scraper.pyUses PRAW (Python Reddit API Wrapper) to collect and filter Reddit posts:
Filtering Logic:
- Only includes posts actually about Notion (not just mentioning it in a list)
- Posts from r/Notion and r/NotionSo are always included
- Posts from other subreddits must have "Notion" in the title OR mention it 3+ times
- Filters out generic productivity posts that briefly mention Notion alongside 10 other tools
Data collected:
- Full post content (word-for-word)
- Metadata (score, comments, date)
- Top 5 comments per post
- Clickable links to original posts
Subreddits searched: r/Notion, r/productivity, r/studytips, r/digitalplanner, r/NotionSo, r/PKM
| Category | Count | Percentage |
|---|---|---|
| Mobile | 113 | 32.6% |
| Onboarding | 46 | 13.3% |
| Performance | 41 | 11.8% |
| Pricing | 32 | 9.2% |
| Collaboration | 26 | 7.5% |
| Features | 23 | 6.6% |
For top complaints, designed experiments with:
- Hypothesis statements
- Sample size calculations (80% power, 5% significance)
- Success metrics
- Expected impact estimates
See complaint_analysis.ipynb for full analysis.
- Source: Reddit API (PRAW)
- Time Period: Last 30 days
- Sample Size: 347 posts
- Authentication: Official Reddit API
- Method: Two-stage filtering process
- Relevance Filter: Removes posts that only mention Notion in passing (e.g., "I use Notion, Todoist, and 10 other apps")
- Keyword Classification: Categorizes relevant posts into complaint types
- Categories: Performance, mobile, onboarding, pricing, collaboration, features
- Validation: Manual review of sample posts
A post is considered "about Notion" if it meets any of:
- Posted in r/Notion or r/NotionSo
- Has "Notion" in the title
- Mentions "Notion" 3+ times (indicates substantial discussion)
- Posts mentioning Notion once in a list with other tools are excluded
- Frequency Analysis: Distribution across categories
- Engagement Metrics: Upvotes and comments as importance proxies
- A/B Test Design: Two-proportion z-test
- Power: 0.80
- Significance: α = 0.05
- Effect sizes from industry benchmarks
- Keyword matching may miss nuanced complaints
- Reddit users may not represent all Notion users (skews toward power users)
- Self-selection bias (users with strong opinions more likely to post)
- Cannot establish causality without experimental data
- Strict relevance filtering excludes posts mentioning Notion only 1-2 times
- Dataset size (266 posts) limits statistical power for some analyses
After running the scraper:
- reddit_posts_raw.json - All 347 scraped posts (before filtering)
- reddit_posts_categorized.json - 266 Notion-relevant posts organized by type
- reddit_analysis_report.md - Formatted report with top 50 posts
- notion_complaints.db - SQLite database with relational schema
All posts include clickable links to original sources for verification.
- Python 3.9+
- Data Collection: PRAW (Reddit API)
- Database: SQLite with relational schema
- Analysis: pandas, scipy, numpy
- Visualization: Plotly, Streamlit
- Notebooks: Jupyter
- Analyze positive feedback alongside complaints to determine if issues affect all users or specific segments
- Real-time complaint tracking dashboard
- Automated weekly reports
- Integration with actual product analytics
- Multi-platform analysis (Twitter, Discord, support tickets)
- Predictive modeling for churn risk
- Causal inference analysis
- 100% real data from Reddit API
- Full source citations
- Reproducible analysis
- Official API with authentication
MIT
Note: Research project using publicly available Reddit data. No proprietary Notion data included.