Skip to content

BoogeyMan24/ap-research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AP Research: OSS Success Analysis on GitHub

Key Insights at a Glance

  • πŸ“Š 221,000+ GitHub repositories
  • πŸ’¬ 3,000,000+ comments processed
  • πŸ“ 105,000+ OSS repositories studied
  • πŸ§ͺ 8,000 repositories used for sentiment analysis
  • βš™οΈ Data collected using the official GitHub API at scale

Surprising Results

  • Positive communities do NOT make projects more popular

    • Correlation between sentiment and stars is effectively zero
  • Using a popular programming language does NOT increase success

    • No meaningful relationship between language popularity and star count
  • βš–οΈ Comments are overwhelmingly positive

    • Average sentiment skews positive across OSS communities

This repository contains the full data collection and analysis pipeline used for the AP Research paper:

"Exploring the Impact of Community Sentiment and Programming Language Popularity on Open Source Software Success"


Overview

This project investigates what drives the success of open-source software (OSS) on GitHub.

Two primary factors are analyzed:

  1. Community sentiment (via GitHub comments)
  2. Programming language popularity

Success is defined using GitHub star count, a widely accepted proxy for repository popularity.


Key Findings

  • Sentiment has no practical impact on success

    • Weak negative correlations:
      • Positive sentiment vs stars: r β‰ˆ -0.0557 Screenshot 2026-03-20 at 18 05 32
      • Negative sentiment vs stars: r β‰ˆ -0.03 Screenshot 2026-03-20 at 18 05 48
    • Statistically significant but not meaningful in practice
  • Programming language popularity does not drive success

    • Correlation: r β‰ˆ -0.00012

      Screenshot 2026-03-20 at 18 06 35
    • No statistically significant relationship

  • Comments are generally positive

    • Screenshot 2026-03-20 at 18 03 40
    • Avg positive: ~1.39 Screenshot 2026-03-20 at 17 59 56
    • Avg negative: ~-1.23 Screenshot 2026-03-20 at 18 00 10
  • Conclusion

    • Neither sentiment nor language popularity explains OSS success

Other Graphs and Tables

Average Stars per Programming Language on GitHub. Apr. 2025 Screenshot 2026-03-20 at 18 07 17

Programming Language Popularity and GitHub Statistics Comparison. Apr. 2025 Screenshot 2026-03-20 at 18 08 04


Dataset

Repository Data

  • 221,672 repositories collected
  • Filtered to ~105,000 OSS repositories
  • Criteria:
    • β‰₯ 200 stars
    • Topic-based filtering to remove non-OSS (e.g. "awesome lists")

Comment Data

  • 13,000 repositories sampled
  • 8,055 repositories used
  • ~3.3 million comments collected
  • ~3.05 million human comments analyzed

Methodology

1. Repository Collection (src/repos.js)

  • Uses GitHub Search API
  • Avoids API limits via star-range partitioning
  • Collects:
    • Name, description, stars, forks
    • Language, topics, metadata

2. Filtering

  • Removes non-OSS repositories via topic blacklist
  • Filters to English repositories using Unicode regex detection

3. Comment Collection (src/comments.js)

  • Randomly samples repositories
  • Collects up to 500 comments per repo
  • Uses:
    • Full fetch for small repos
    • Random sampling for large repos
  • Stores:
    • Comment text, author type, metadata

4. Sentiment Analysis

  • Tool: SentiStrength-SE
  • Outputs:
    • Positive score: 1 β†’ 5
    • Negative score: -1 β†’ -5
  • Optimized for software engineering text

5. Language Popularity

  • Source: Stack Overflow Developer Survey 2024
  • Popularity measured by % of developers using a language

Technical Highlights

  • Handles GitHub API limits (1000-result cap) via range queries
  • Efficient large-scale scraping (~221k repos in ~4 hours)
  • Comment sampling strategy ensures statistical validity
  • Processes millions of comments with batching

Limitations

  • Sentiment analysis is lexicon-based (not context-aware)
  • Topic filtering may exclude some valid OSS projects
  • Language detection is heuristic (regex-based)
  • Focus limited to GitHub (no GitLab/Bitbucket)

Implications

  • Developers should not rely on sentiment optimization to grow projects
  • Choosing a popular language does not guarantee success
  • Other factors (not studied here) likely dominate:
    • Utility
    • Marketing
    • Network effects
    • Timing

Reproducibility

To reproduce the data:

  1. Create a GitHub personal access token
  2. Add to .env: GITHUB_TOKEN=your_token_here
  3. Run scripts:
node src/repos.js
node src/comments.js

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors