A repository for all my data-related projects.
Data analysis of the 2019 Stack Overflow Developer survey results in four different softwares:
- Microsoft Excel: data transformation and creation of two dashboards
- Microsoft Power BI: usage of the data transformed in Excel to recreate the same dashboards
- Jupyter Notebook (Python): complete analysis from scratch with NumPy, pandas and Plotly
- R Notebook: same analysis made in Python, but replicated with R code
The source dataset can be downloaded here (the file is too large to include in the repository).
You can view the Excel version of the analysis here.
You can find the results of the Power BI version report as a PDF in this folder of the repository or as a downloadable .pbix file in this folder.
You can view the Python version of the analysis in an HTML version of the resulting Jupyter Notebook here.
The R notebook is available here too.
Data analysis of a dataset about anime found on Kaggle. This dataset contains anime listings, the studios responsible for the animation, genres, warnings, etc. This project was divided in two parts:
- Microsoft Power BI data analysis: a straightforward data analysis in Microsoft Power BI
- anime_db: creation of a PostgreSql database and data analysis in Python, using the tables created in the Power BI analysis (exported from Power BI as CSV files). The psycopg2 library was used to perform database operations, and pandas, NumPy and Plotly were used for the data analysis.
The original dataset is available on Kaggle here.
The results of the Power BI analysis are available here (a PDF of the report).
The results of the anime_db work is available here (two Jupyter Notebooks and the database ERD). You can read the first notebook here and the second here. The first is all about the data engineering part, that is, writing data to the database. The second notebook is the actual data analysis.
Similar to the 2019 counterpart, but using the 2020 data. It is similar to last year's analysis but, in my opinion, it is better given the new knowledge and skills I've acquired since the first analysis.
So far, the data analysis of the 2020 Stack Overflow Developer survey was made in:
- Microsoft Power BI: data transformations in Power Query and creation of four dashboards (general data, technology-related, professional status, and other data)
- Microsoft Excel: similar analysis to Power BI, but has only two dashboards, including mostly the same information, but visually rearranged
- Python/PostgreSQL: a two-phase project. The first is data engineering oriented, where I pre-processed the dataset, created a PostgreSQL database using the psycopg2 driver and then inserted the data. The second phase is the proper data analysis, using Plotly for the visualization, similar to what I did in Excel and Power BI
- R: replicated the Python data analysis in R. Extracted the data from the same database I had created and replicated the data analysis code in R
The source dataset can be downloaded here (the file is too large to include in the repository).
You can find the results of the Power BI report as a PDF here and as a downloadable .pbix file here.
You can find the Excel file version of the data analysis, as well as screenshots of the resulting dashboards, here.
For the Python (and PostgreSQL) part, I divided the work in two Jupyter Notebooks. The first notebook covers the data engineering part, and the second covers the data analysis part. You can read the first notebook here and the second notebook here. Both notebooks are also available in this repository.
For the R part, you can read the notebook online here, and find all the code in this repository.
Smaller demos created for specific purposes, such as how to perform a certain data transformation in Python, data analyses in Power BI, etc., including the resulting files of tutorials I've completed. The link to the original datasets can always be found in the respective "source.txt" file. Some examples of the demos available so far: