The code provided is a Python script designed to predict house prices using a machine learning model, specifically a Random Forest Regressor. Here’s a detailed breakdown of how the code works, leading to the final output, which is the predicted price for a new house based on user input.
-
Loading and Preparing Data:
• data = pd.read_csv('House Data.csv'): The house dataset is loaded from a CSV file named House Data.csv. • Converting numeric columns: The columns 'total_sqft', 'bath', 'balcony', and 'price' are converted to numeric data types using pd.to_numeric(errors='coerce'), meaning any invalid (non-numeric) data is replaced with NaN. • Dropping missing values: Any rows with NaN values are removed using data.dropna(), ensuring clean data for model training.
-
Separating Features and Target:
• X = data.drop('price', axis=1): All columns except 'price' are treated as features (X). • y = data['price']: The column 'price' is treated as the target variable (y), which the model will predict.
-
Splitting the Data:
• train_test_split(X, y, test_size=0.2, random_state=42): The dataset is split into training and testing sets, with 80% used for training and 20% for testing. The random_state=42 ensures reproducibility of the split.
-
Preprocessing (OneHot Encoding Categorical Variables):
• Categorical Features: Columns like 'area_type', 'availability', 'location', 'size', and 'society' are categorical (i.e., they contain text values). These need to be transformed into numeric values using one-hot encoding before being fed into the machine learning model. • Numeric Features: Columns such as 'total_sqft', 'bath', and 'balcony' are numeric and are passed through without any transformations. • ColumnTransformer: The preprocessing steps (numeric passthrough and categorical one-hot encoding) are handled by ColumnTransformer.
-
Model Setup:
• Pipeline: A pipeline is set up to chain two steps together:
- Preprocessing: The numeric features are passed through unchanged, and the categorical features are one-hot encoded.
- Random Forest Regressor: A Random Forest model with 100 trees (n_estimators=100) is used as the regressor to predict house prices.
-
Model Training:
• model.fit(X_train, y_train): The pipeline is trained on the training data (X_train and y_train).
-
Getting User Input:
• get_user_input(): This function prompts the user to input details about a new house they want to predict the price for. The input includes: • Area Type (e.g., Super built-up area, Plot area) • Availability (e.g., Ready to Move, Under Construction) • Location (e.g., Whitefield, Koramangala) • Size (e.g., 2BHK, 3BHK) • Society (Name of the housing society) • Total Square Feet • Number of Bathrooms • Number of Balconies
These inputs are collected into a DataFrame to be used for making predictions.
-
Making Predictions:
• Preprocessing User Input: The user-provided house data is preprocessed using the same steps as the training data. This ensures that the categorical variables are one-hot encoded, and numeric variables are handled correctly. • Prediction: The preprocessed data is fed into the trained Random Forest Regressor, which outputs a predicted price.
-
Output:
• The predicted house price is printed with two decimal places. For example:
Predicted Price: 95.75
Details about the output:
• The output represents the predicted price of the house based on the model’s understanding of the relationships between house features (like square footage, number of bathrooms, location, etc.) and price, learned from the training data.
• The predicted price is a continuous value, meaning the model predicts a specific numerical value (e.g., 95.75) for the house’s price based on the input features.
• The final printed value (predicted_price[0]: .2f) ensures the price is shown to two decimal places for clarity.
Example Walkthrough:
If the user enters the following information:
Area Type (e.g., Super build-up Area): Super built-up Area Availability (e.g., Ready to Move): Ready to Move Location (e.g., Electronic city Phase II): Whitefield Size (e.g., 2BHK): 3BHK Society: Prestige Lakeside Habitat Total Square Feet : 1500 Number of Bathrooms : 3 Number of Balconies : 2
The model will use these inputs to predict the house price based on patterns learned during training. For example, the output might look like:
Predicted Price: 120.50
This would indicate that the model predicts a price of 120.50 (in lakhs or another currency unit depending on the dataset) for a house with the given characteristics.