- π 221,000+ GitHub repositories
- π¬ 3,000,000+ comments processed
- π 105,000+ OSS repositories studied
- π§ͺ 8,000 repositories used for sentiment analysis
- βοΈ Data collected using the official GitHub API at scale
-
Positive communities do NOT make projects more popular
- Correlation between sentiment and stars is effectively zero
-
Using a popular programming language does NOT increase success
- No meaningful relationship between language popularity and star count
-
βοΈ Comments are overwhelmingly positive
- Average sentiment skews positive across OSS communities
This repository contains the full data collection and analysis pipeline used for the AP Research paper:
"Exploring the Impact of Community Sentiment and Programming Language Popularity on Open Source Software Success"
This project investigates what drives the success of open-source software (OSS) on GitHub.
Two primary factors are analyzed:
- Community sentiment (via GitHub comments)
- Programming language popularity
Success is defined using GitHub star count, a widely accepted proxy for repository popularity.
-
Sentiment has no practical impact on success
-
Programming language popularity does not drive success
-
Comments are generally positive
-
Conclusion
- Neither sentiment nor language popularity explains OSS success
Average Stars per Programming Language on GitHub. Apr. 2025

Programming Language Popularity and GitHub Statistics Comparison. Apr. 2025

- 221,672 repositories collected
- Filtered to ~105,000 OSS repositories
- Criteria:
- β₯ 200 stars
- Topic-based filtering to remove non-OSS (e.g. "awesome lists")
- 13,000 repositories sampled
- 8,055 repositories used
- ~3.3 million comments collected
- ~3.05 million human comments analyzed
- Uses GitHub Search API
- Avoids API limits via star-range partitioning
- Collects:
- Name, description, stars, forks
- Language, topics, metadata
- Removes non-OSS repositories via topic blacklist
- Filters to English repositories using Unicode regex detection
- Randomly samples repositories
- Collects up to 500 comments per repo
- Uses:
- Full fetch for small repos
- Random sampling for large repos
- Stores:
- Comment text, author type, metadata
- Tool: SentiStrength-SE
- Outputs:
- Positive score:
1 β 5 - Negative score:
-1 β -5
- Positive score:
- Optimized for software engineering text
- Source: Stack Overflow Developer Survey 2024
- Popularity measured by % of developers using a language
- Handles GitHub API limits (1000-result cap) via range queries
- Efficient large-scale scraping (~221k repos in ~4 hours)
- Comment sampling strategy ensures statistical validity
- Processes millions of comments with batching
- Sentiment analysis is lexicon-based (not context-aware)
- Topic filtering may exclude some valid OSS projects
- Language detection is heuristic (regex-based)
- Focus limited to GitHub (no GitLab/Bitbucket)
- Developers should not rely on sentiment optimization to grow projects
- Choosing a popular language does not guarantee success
- Other factors (not studied here) likely dominate:
- Utility
- Marketing
- Network effects
- Timing
To reproduce the data:
- Create a GitHub personal access token
- Add to
.env: GITHUB_TOKEN=your_token_here - Run scripts:
node src/repos.js
node src/comments.js





