IronRegression: King County Housing Price Prediction

Overview

This project is part of the Ironhack Data Analysis Course (August 2024 - February 2025). In the project, we analyze a dataset of house sale prices for King County, including Seattle. Our main goal is to analyze which features best explain the prices of houses. We do this with a regression analysis.

Data

Here, you can download the dataset which can also be found in this repository. The dataset contains 21 columns and 21,613 rows and the data points span over one year from May 2014 to May 2015.

Repository Structure

.
├── images                                      # data folder: project image/graph files
├── all-code_house-price_IronRegression.ipynb   # all code, EDA, regression models
├── king_ country_ houses_aa.csv                # original dataset
├── presentation                                # pdf with the project presentation
└── README.md

Approach

Out approach to first clean the data, then do some basic and more advanced EDA to understand the structures and relations more deeply. After that, we aimed to build a regression model which explains the target "price" the best.

EDA

General:

There were no null values, no "real" duplicates and no empty spaces in our data. All our data types were numerical, apart from the date column which we changed to datetime.
The dataset contains a mix of numerical and categorial data (all encoded in numbers). Continuous and discrete numerical as well as ordinal and nominal categorial.

Distribution

Most of our data showed little signs of normal distribution.
Many columns proved to have many outliers.
Out target price: The distribution is very positively skewed, there are indications of many outliers.

Correlation:

A simple Pearson correlation on all features (excluding date and id) gave us five features with a higher correlation than 0.5 and two with very high correlation.
Visualizations and more correlation calculations supported the first findings but also gave some contradicting results.

Multicollinearity:

The important features are almost all highly correlated with one another.

Regression models

Because of our findings in the EDA, we decided to keep all features (apart from date and id) for a baseline model. Strictly interpreted, based on correlation and multicollinearity, only one feature would be left as train for the regression model. This seemed unuseful for us, so we decided to got with everything first and then reduce/ engineer.

For the baseline model, we used for different apporaches: Linear, Ridge, Random Forest and XGBoost Regression.

To try to improve the model, we then (among others) took the following steps:

Testing different sets of features based on different conditions.
Normalising the scales of the features.
Oversampling one feature.
Checking Feature Importance and on those grounds defining two new sets of features.
Eliminating Outliers from the Train (before modeling)

Summary

Finally we found that by eliminating the outliers before modeling, we could improve the metrics dramatically. Yet, this has the disadvantage of not having analyzed exactely these. For the best prediction model including the most data, we identified using XGBoost, all features and normalized scales.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Images		Images
.gitignore		.gitignore
README.md		README.md
all-code_house-price_IronRegression.ipynb		all-code_house-price_IronRegression.ipynb
king_ country_ houses_aa.csv		king_ country_ houses_aa.csv
presentation.pdf		presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IronRegression: King County Housing Price Prediction

Overview

Data

Repository Structure

Approach

EDA

General:

Distribution

Correlation:

Multicollinearity:

Regression models

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IronRegression: King County Housing Price Prediction

Overview

Data

Repository Structure

Approach

EDA

General:

Distribution

Correlation:

Multicollinearity:

Regression models

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages