This documentation refers to 4 projects developed by me and Cristiana Gewerc for the Data Wrangling unit of Monash MDS. The main topics covered on those works were:
- parse data in the required format;
- assess the quality of data for problem identification;
- resolve data quality issues ready for the data analysis process;
- integrate data sources for data enrichment;
- document the wrangling process for professional reporting;
- write program scripts for data wrangling processes.
In a nutshell, the projects were developed in Jupyter Notebook python3 about:
parsing-data
Extraction data from semi-structured text files using only re and pandas libraries. Gets a TXT file and generate a JSON and a CSV.
text-preprocessing:
Extraction of a set of published papers from nonstructured format, preprocessing and convertion into numerical representations.
cleansing-raw-data
Outliers analysis and removal, missing data imputation and data anomalies fix.
data-integration-reshaping
Integrating multiple datasources, including web scraped data, XML files, Shapefiles, txt, GTFS data, csv and xlsx.
