GitWatch directs potential contributors to projects on GitHub. This is done by predicting the future number of pushes and watches of a given repository. The repositories are ranked in terms of #pushes predicted. The user can toggle to only show certain languages and/or only repos that will be popular.
Scrape data from the GitHub Archive. This process is very slow.
- Create mySQL tables in the GitWatch database with:
CREATE TABLE repo (id INT UNSIGNED, name TINYTEXT, private BOOL, created_at DATETIME, description TEXT, language TINYTEXT, watchers MEDIUMINT UNSIGNED);CREATE TABLE event (id INT UNSIGNED, type TINYINT UNSIGNED, timestamp DATETIME);CREATE INDEX id_index ON event (id);(This will improve lookup performance a lot!)
runExtractor.shdownloads the json that contains all the events for a given hour.extractor.pyprocesses the json and records the relevant information in the GitWatch mySQL db.extractor_csv.pyoutputs to file rather than SQL for running remotely. The results are moved to SQL locally with csv_to_{repo,event}_sql.py. Usesort <filename> | uniq -uto reduce the size of therepo.csvfile!
populateDB.pypopulates the rest of the database info in SQL using GitHub API. The limit is 5k requests / hour with authentication.
Process the data
process_training.pyqueries sql and creates a csv file for training or applying stuff done in step 3
Training
This was done with the iPython notebook in the directory training. Check it out!
Populate training to db. First create three new columns in the repo table for pred1, pred2, hot.
process_training.pycreates a table of dimension n(repos) x 60 onto which training can be applied.populateDB_withpred.pyfills in the valuesmaskrepos.pywill impose quality constraints on June and July when applying to October
Run the webapp
run.pywill run the web app on the local machinesudo supervisord -c simple.confwill run the web app on AWS.
This project started with the following mission to assign a probability to a given repository
stored on GitHub that the repository contains a bug. The algorithm would have used the previous
commit messages and NLP to assign a probability. The scripts for this process are in the directory
old_NLP_stuff.
Potentially userful scripts that are no longer used in the baseline project are in oldstuff.