AP Research: OSS Success Analysis on GitHub

Key Insights at a Glance

📊 221,000+ GitHub repositories
💬 3,000,000+ comments processed
📁 105,000+ OSS repositories studied
🧪 8,000 repositories used for sentiment analysis
⚙️ Data collected using the official GitHub API at scale

Surprising Results

Positive communities do NOT make projects more popular
- Correlation between sentiment and stars is effectively zero
Using a popular programming language does NOT increase success
- No meaningful relationship between language popularity and star count
⚖️ Comments are overwhelmingly positive
- Average sentiment skews positive across OSS communities

This repository contains the full data collection and analysis pipeline used for the AP Research paper:

"Exploring the Impact of Community Sentiment and Programming Language Popularity on Open Source Software Success"

Overview

This project investigates what drives the success of open-source software (OSS) on GitHub.

Two primary factors are analyzed:

Community sentiment (via GitHub comments)
Programming language popularity

Success is defined using GitHub star count, a widely accepted proxy for repository popularity.

Key Findings

Sentiment has no practical impact on success
- Weak negative correlations:
  - Positive sentiment vs stars: r ≈ -0.0557
  - Negative sentiment vs stars: r ≈ -0.03
- Statistically significant but not meaningful in practice
Programming language popularity does not drive success
- Correlation: r ≈ -0.00012
- No statistically significant relationship
Comments are generally positive
- Avg positive: ~1.39
- Avg negative: ~-1.23
Conclusion
- Neither sentiment nor language popularity explains OSS success

Other Graphs and Tables

Average Stars per Programming Language on GitHub. Apr. 2025

Programming Language Popularity and GitHub Statistics Comparison. Apr. 2025

Dataset

Repository Data

221,672 repositories collected
Filtered to ~105,000 OSS repositories
Criteria:
- ≥ 200 stars
- Topic-based filtering to remove non-OSS (e.g. "awesome lists")

Comment Data

13,000 repositories sampled
8,055 repositories used
~3.3 million comments collected
~3.05 million human comments analyzed

Methodology

1. Repository Collection (`src/repos.js`)

Uses GitHub Search API
Avoids API limits via star-range partitioning
Collects:
- Name, description, stars, forks
- Language, topics, metadata

2. Filtering

Removes non-OSS repositories via topic blacklist
Filters to English repositories using Unicode regex detection

3. Comment Collection (`src/comments.js`)

Randomly samples repositories
Collects up to 500 comments per repo
Uses:
- Full fetch for small repos
- Random sampling for large repos
Stores:
- Comment text, author type, metadata

4. Sentiment Analysis

Tool: SentiStrength-SE
Outputs:
- Positive score: 1 → 5
- Negative score: -1 → -5
Optimized for software engineering text

5. Language Popularity

Source: Stack Overflow Developer Survey 2024
Popularity measured by % of developers using a language

Technical Highlights

Handles GitHub API limits (1000-result cap) via range queries
Efficient large-scale scraping (~221k repos in ~4 hours)
Comment sampling strategy ensures statistical validity
Processes millions of comments with batching

Limitations

Sentiment analysis is lexicon-based (not context-aware)
Topic filtering may exclude some valid OSS projects
Language detection is heuristic (regex-based)
Focus limited to GitHub (no GitLab/Bitbucket)

Implications

Developers should not rely on sentiment optimization to grow projects
Choosing a popular language does not guarantee success
Other factors (not studied here) likely dominate:
- Utility
- Marketing
- Network effects
- Timing

Reproducibility

To reproduce the data:

Create a GitHub personal access token
Add to .env: GITHUB_TOKEN=your_token_here
Run scripts:

node src/repos.js
node src/comments.js

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
src		src
README.md		README.md
final-paper.pdf		final-paper.pdf
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AP Research: OSS Success Analysis on GitHub

Key Insights at a Glance

Surprising Results

Overview

Key Findings

Other Graphs and Tables

Dataset

Repository Data

Comment Data

Methodology

1. Repository Collection (`src/repos.js`)

2. Filtering

3. Comment Collection (`src/comments.js`)

4. Sentiment Analysis

5. Language Popularity

Technical Highlights

Limitations

Implications

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AP Research: OSS Success Analysis on GitHub

Key Insights at a Glance

Surprising Results

Overview

Key Findings

Other Graphs and Tables

Dataset

Repository Data

Comment Data

Methodology

1. Repository Collection (src/repos.js)

2. Filtering

3. Comment Collection (src/comments.js)

4. Sentiment Analysis

5. Language Popularity

Technical Highlights

Limitations

Implications

Reproducibility

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Repository Collection (`src/repos.js`)

3. Comment Collection (`src/comments.js`)

Packages