This repository corresponds to Final project of Data Analytics Bootcamp at Ironhack.
The main function of the program is to retrieve a clean dataset obtained from US Department of Transportation website. This data is messy and is divided into several categories by type of handled fluid and time intervals. The final dataset with more than 18k incidents is then explored and analysed in order to look for the main insights.
Check it out at https://public.tableau.com/profile/eloy.gomez.caro.moreno#!/vizhome/ExplorationofPipelineIncidents/Dashboard1
The program also deploys a machine learning model based on some of the original data in order to predict the total cost of a new pipeline incident, which should be an input from the user.
- Python Programming
- Handling Pandas library
- BeautifulSoup
- Data Visualization using Tableau
- SciPy
- Scikit-Learn
Use the package manager conda or pip to install libraries in the environment you are executing the program. Required libraries are shown below:
conda install pandas
conda install numpy
conda install -c conda-forge requests
conda install beautifulsoup4
conda install scipy
conda install -c conda-forge scikit-learn
pip install lightgbmUsing Python, execute the code main.py --path './data/raw/' from terminal or and IDE (i.e. Pycharm).
The program will process the files located in '/data/raw' and will redirect you to my profile on Tableau Public to checkout the results of exploration and analysis.
The ML model will train automatically by executing the program, but must be the user who add the new data for prediction. Such data shall be added in the file 'incidents_predict.csv', located in /data/results/. This may be an exmaple:
| id | FATAL | INJURE | UNINTENTIONAL_RELEASE_BBLS | ACCIDENT_PSIG | MOP_PSIG |
|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 236.0 | 2.0 | 100.0 |
| 1 | 0.0 | 0.0 | 42.0 | 65.0 | 10.0 |
| RECOVERED_BBLS | PIPE_DIAMETER | PIPE_SMYS | EX_HYDROTEST_PRESSURE |
|---|---|---|---|
| 25.0 | 6.0 | 25000.0 | 200.0 |
| 7.0 | 8.0 | 25000.0 | 120.0 |
| MANUFACTURED_YEAR | NORMAL_PSIG | ACCOMPANYING_LIQUID | SIGNIFICANT | SERIOUS |
|---|---|---|---|---|
| 2000.0 | 12.0 | 27.0 | YES | YES |
| 1989.0 | 78.0 | 12.0 | YES | NO |
As a result, the program will create a new file 'cost_prediction.csv' in the same folder.
