Skip to content
Bruno C. da Silva edited this page Jun 14, 2019 · 33 revisions

Welcome to the code-style-mining wiki!

Motivation and Goals

Context:

Code style is a set of rules or guidelines used when writing the source code for a computer program. It is often claimed that following a particular programming style will help programmers read and understand source code conforming to the style, and help to avoid introducing errors. For programmers, following conventional code styles, from an internal or external guidelines, can help future development efforts become easier and more maintainable as more programmers work on the project overtime.

Goal In this project:

Software Development has become an increasingly collaborative process. Employees from one team may collaborate with each other, another team, or independent developers on the other side of the globe. As more and more files are written by more and more coders, the importance of program comprehension and code readability have also grown. Moreover, little is known about how developers around the world actually use different coding styles.

In order to solve this problem, the goal of the project is to mine a large dataset of thousands of software repositories from GitHub to provide a broad view of developers’ coding style choices in various projects and programming languages. GitHub is a web-based hosting service for version control. It’s mostly used by programmers around the world to allow work and issues to be tracked on a project so that small or large teams working together can collaborate more efficiently.

Furthermore, the project will also aim to analyze whether developers’ coding styles comply with well-known coding style sources, such as Google’s and Oracle’s style guidelines. By the end of the year, the project aims to be the first to mine developers’ coding style over an expansive dataset of public projects and showcase that data to enable future analysis work.

Implementation Details

Repository Analyzers

For this project we have two separate repository analyzers for Java and Python. While written in different languages, they share a similar internal architecture and accomplish the same result. The list below covers the details of our implementation, software features, and some distinctions between the two analyzers.

Common Features:

  • Individual repos are analyzed for errors
  • Errors are summarized and stored in JSON objects
  • These JSON objects are saved to a MongoDB cloud instance
  • Uses the Github API as an authenticated user to submit get request for repo files
  • Repos extracted from a list of keywords and also the top 500 repos in each language
  • Semi-autonomous text file system for keeping record of already analyzed repos
  • Locally executed

Python:

  • Command line based application execution
  • Pymongo for making the MongoDB connection
  • Uses Python’s native AST library to find error occurrences
  • Library takes care of finding error occurrences automatically
  • Pipenv dependency installation

Java:

  • Maven based setup
  • MongoDB for Java database connector
  • Manually iterate through repo files during analysis - no AST library like Python
  • No formal error guide (PEP in Python).
  • Custom defined errors based on Google’s Java Style Guide.
  • Errors are summarized into repo summary JSON objects

Architecture

Python and Java Analyzers

The general flow of data through the software can be divided into three stages. The first stage begins with querying the GitHub API for data, then analyzing this fetched data and outputting it in JSON format, and then finally storing this JSON output in a database where it can be later used for visualization. The Python and Java Analyzers have the GitHub API and a MongoDB cloud instance as dependencies.

Fetching the Repo:

  • Use GitHub API to make GET requests in Python/Java code
  • Result is returned as a JSON object
  • Name derived from querying GitHub’s API for a set of keywords
  • Additionally, analyzed top 500 repositories in each language
  • Python analyzes multiple branches
  • Java analyzes a single branch, typically master

Analyzing the Repositories:

  • Python uses the AST library to analyzed the repository JSON objects
  • Python’s results are stored in dictionary
  • AST library handles all the error calculations
  • Java uses a set of custom classes to analyze different repository components
    • Curly braces, tabs vs spaces, naming, line length, etc
    • Java results are written to a JSON object
    • Only raw data, no calculations done

Formatting and Saving Analysis Results:

  • Python saves a python dictionary directly into the Mongo database
    • No formatting work needed, dictionary data structure is BSON compatible
  • Java: a json object containing style results for each file into the repository
    • Summarizes all error violations for each file into one final JSON object
    • More computational overhead than python
  • Final object with summary + individual analysis results saved
  • Both analyzers maintain an in-code connection to our MongoDB cloud instance

Deployment Instructions:
Refer to the respective Python and Java repository pages

Deployment Diagram
patedimg

Webapp

Frontend:

  • React.js (v16.7.0) - a JavaScript library for building user interfaces
  • Node.js (v10.13.0) - JavaScript runtime built on Chrome's V8 JavaScript engine
  • Material UI (v3.8.3) - React components that implement Google's Material Design
  • Victory (v31.1.0) - React.js components for modular charting and data visualization

Backend:

  • Express.js (v4.16.4) - Fast, unopinionated, minimalist web framework
  • Mongoose (v5.4.6) - Mongoose MongoDB ODM, an elegant mongodb object modeling for node.js
  • dotenv (v6.2.0) - Loads environment variables from .env file
  • Node.js (v10.13.0) - JavaScript runtime built on Chrome's V8 JavaScript engine

Hosting service:

  • Heroku - a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud

The backend uses Mongoose to connect to the project's mongodb database in order to query and fetch JSON documents that were saved as analysis output from the Java and Python analyzers. In addition, Express creates the REST API endpoints for outputting JSON data to be consumed by the frontend. The frontend's React web client, in turn, takes the JSON data and presents the code style analysis with charts and other logistics about each analyzed repositories.

Development/Build instructions: https://github.com/tonyc856/code-style-data-vis/blob/master/README.md

What we analyzed so far

How we selected a list of projects

Repositories that were chosen to be analyzed were open source. For each language, we queried Github’s API for repositories matching a set of keywords. Additionally, we also included the top 500 repos based on number of starts to be analyzed.

How many projects we analyzed so far and available on the database

Currently, we have analyzed 1914 repositories for the Python component of the analysis. The Java analysis was lagging behind in development, but has been recently completed this quarter. We have a list of over two thousand repositories (by keyword and by popularity) that we intend to analyze in Java as well. These results will be displayed in a similar manner to the Python analysis results.

Clone this wiki locally