Submission:
- Please submit your project via GitHub and send a private message on Slack to both Dan and Ivan with a link to it.
In this project, you will implement the exploratory data analysis plan developed in Project 1. This will lay the groundwork for our modeling exercise in Project 3.
Before completing an analysis, it is critical to understand your data. You will need to identify all the biases of the variables in your model in order to accurately assess the strengths and limitations of your analysis and predictions.
Following these steps will help you better understand your dataset.
Objective: A Jupyter notebook writeup that provides a dataset overview with visualizations and statistical analysis.
- Requirements:
- Read in your dataset, determine how many samples are present, and identify any missing data.
- Create a table of descriptive statistics for each of the variables (count, mean, standard deviation, ...).
- Describe the distributions of your data.
- Plot boxplots for each variable.
- Create a covariance matrix.
- Determine any issues or limitations based on your exploratory analysis.
- Outline exploratory analysis methods.
The dataset is available here.
For this project we will be using an Jupyter notebook. This notebook will use matplotlib for plotting and visualizing our data. This type of visualization is handy for prototyping and quick data analysis. We will discuss more advanced data visualizations for disseminating your work.
- Open the starter code notebook in Anaconda.
- Read in your dataset.
- Try out a few
pandascommands for describing your data:df.describe(),df['columnName'].sum(),df['columnName'].mean(),df['columnName'].count(),df.corr()
- Read the documentation for
pandas. Most of the time, there is a tutorial that you can follow; learning to read documentation is crucial to your success as a data scientist.
Look at some sample notebooks for an example of the types of visualizations you can use in your notebook.
The rubric is available here.
