In this project, I made an exploratoy analysis of a diamonds dataset with pandas, matplotlib, seaborn and tableau. The main goal was to analyse how the distinct variables affect to the price of the diamond and create a model to predict the price of another dataset of diamonds that we didnΒ΄t had the price.
The dataset has 40.455 diamonds and ten rows with their characteristics:
- Carat: weight of the diamond
- Cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- Color: diamond colour, from J (worst) to D (best)
- Clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- Depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
- Table: width of top of diamond relative to widest point (43--95))
- Price: price in USD
- x: length in mm
- y: width in mm
- z: depth in mm
- Explore the dataset with pandas.
- Analyse the statistical values.
- Graphics with seaborn matplotlib for categorical and numerical variables.
- Hypothesis testing grouping variables.
- Report with tableau.
- Making new features
- Create and try ML models and see which one performs better.
βββ ih_datamadpt1120_project_m2
βββ .gitignore
βββ README.md
βββ notebooks
βββ data_analysis_report.ipynb
βββ ML-models.iypnb