A research-oriented tool to extract static code metrics from trending open-source Python repositories.
Ideal for academic studies, benchmarking, or enriching datasets for training machine learning models.
Software Metric Extractor automates the pipeline of:
- Fetching trending Python repositories from GitHub.
- Cloning and storing them locally.
- Extracting static code metrics using Radon (e.g., Cyclomatic Complexity, Maintainability Index).
- Storing metrics into a MySQL database using SQLAlchemy.
This tool is perfect for researchers and developers studying software quality, complexity, and maintainability patterns in open-source projects.
.
├── cli/
│ └── main.py # CLI entry point
├── core/
│ ├── analyze_metrics.py # Radon-based metrics extraction
│ ├── db.py # DB models & session manager
│ ├── fetch_repos.py # GitHub scraping logic
├── docker-compose.yml # MySQL container setup
├── projects/ # Local clones of repositories
├── requirements.txt # Python dependencies
├── run.py # CLI launcher
└── .env # Environment variables
- Python 3.11+
- Docker & Docker Compose
- A valid GitHub Personal Access Token
git clone https://github.com/yourusername/software-metric-extractor.git
cd software-metric-extractor# .env
# GitHub API Token (for higher rate limits)
GITHUB_TOKEN=ghp_...
# MySQL Database URL (used by SQLAlchemy)
DATABASE_URL=mysql+pymysql://metrics_user:metrics_password@localhost/software_metrics
MYSQL_ROOT_PASSWORD=your_root_password
MYSQL_DATABASE=software_metrics
MYSQL_USER=metrics_user
MYSQL_PASSWORD=your_database_password
# Optional defaults
REPO_LIMIT=100
REPO_LANGUAGE=Pythondocker-compose up -dThis will spin up a MySQL 8.0 container with a database named software_metrics.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython run.py --helppython run.py fetch-repos --limit 50 --language Pythonpython run.py analyzepython run.py reset-dbEach file and project is analyzed for:
- Cyclomatic Complexity
- Maintainability Index
- Lines of Code (LOC)
- Number of Functions
- Comment Lines
This project was originally designed for a research study exploring the relationship between code structure and performance in large language models. The resulting dataset can be used for:
- Code complexity prediction
- Model training for software quality estimations
- Empirical software engineering research
- Python 🐍
- SQLAlchemy ORM
- MySQL 8
- Docker 🐳
- Radon (code analysis)
- GitHub API & GitPython
- Selenium (for trending repo scraping)
MIT License © 2025
Crafted with ❤️ for software engineering research.