We tackled the Kaggle challenge 'House Prices - Advanced Regression Techniques'. It consists of predicting with Machine Learning Regression models the prices of suburbans houses in Ames, Iowa (USA).
Dataset is given by the Kaggle competition. It contains 43 categorical and 36 numerical variables describing the characteristics of residential homes in Ames, Iowa (USA).
The data description file can be found here
- Dataset exploration
- Selection of features (for numerical and categorical variables)
- Data cleaning
- Feature engineering
- Trainning + testing the model
- Improving Predictions
- Final testing
- Either clone the repository or download the files
- Install requirements (requirements.txt)
- Download the dataset from Kaggle
- Open the notebook: House-prices-Advanced-Regression/main.ipynb
- Run the notebook
- Data visualization : correlation matrix, histograms, scatterplots,bars - [Matplotlib, Seaborn]
- Features tweeking :masked variables, one hot encoding, grouping and new feature creation (Neighborhood mean prices).
- Standardization : StandardScaler
- PCA (principal component analysis)
- Pycaret
- Random Forest regressor
- Hyperparameter tuning: gridsearch
The model that we selected, after doing the Pycaret, was RandomForestRegressor.
This were the hyperparameters:
- n_estimators=100
- max_leaf_nodes=40
- max_depth=10
| Test/Train | Score |
|---|---|
| Test | 0.85 |
| Train | 0.88 |
Ironbuddies
Felipe de Ávila Granja Linkedin Kaggle
Luc fley My other projects
