This project aims to predict house prices using a variety of machine learning models. The dataset consists of housing features, including numerical and categorical variables, and a target variable representing house prices.
•The dataset is loaded from CSV files (train.csv and test.csv).
•The train and test datasets are combined for preprocessing.
•Basic data overview: shape, data types, missing values, and quantile statistics.
•Identification of categorical, numerical, and high-cardinality categorical features.
•Summary statistics and visualizations for both categorical and numerical variables.
•Correlation analysis and feature relationships.
•Handling outliers by identifying and replacing extreme values.
•Managing missing values through imputation strategies (e.g., filling missing categorical values with "No").
•Encoding categorical variables for machine learning models.
•Multiple regression models are trained, including:
•Linear Regression, Decision Tree, Random Forest, Gradient Boosting
•K-Nearest Neighbors (KNN), XGBoost, and LightGBM
•Performance evaluation is done using Mean Squared Error (MSE).
•Hyperparameter tuning is performed using GridSearchCV to optimize model performance.
The ultimate goal is to develop a robust machine learning model that accurately predicts house prices based on various features. The project explores different algorithms, feature engineering techniques, and evaluation metrics to achieve optimal results.