Skip to content

Harsh1574/Car_Price_Predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

🚗 Car Price Predictor — Project Report

Goal: Build a machine learning regression model to predict the resale price of used cars based on features like year, mileage, fuel type, transmission, seller type, ownership history, and car model.

Dataset: Car Dekho listings — 4,340 rows × 8 columns
Target Variable: selling_price (in INR)
Tools & Libraries: Python, Pandas, NumPy, Scikit-learn, XGBoost, Matplotlib


đź“‹ Table of Contents

  1. Feature Engineering
  2. Model Training — Baseline Comparison
  3. Hyperparameter Tuning
  4. Final Results
  5. Key Learnings

đź”§ Feature Engineering

Dataset Overview

The raw dataset contained 4,340 car listings with the following columns:

Column Type Description
name String Full car name (brand + model + variant)
year Integer Year of manufacture
selling_price Integer Target — resale price in INR
km_driven Integer Total kilometres driven
fuel Categorical Petrol / Diesel / CNG / LPG / Electric
seller_type Categorical Individual / Dealer / Trustmark Dealer
transmission Categorical Manual / Automatic
owner Categorical First / Second / Third / Fourth & Above / Test Drive

No missing values were found across all 8 columns.


Step 1 — Feature Extraction from Car Name

The name column contained the full car name (e.g. "Maruti Wagon R LXI Minor"). Two new structured features were extracted from it:

  • Company — the first word of the name (brand), e.g. Maruti, Hyundai, Honda
  • model — the second word of the name (model), e.g. Wagon, Verna, Amaze

The Company column was explored (29 unique brands found) but ultimately dropped to reduce dimensionality, keeping only model (185 unique values). The original name column was then dropped entirely.


Step 2 — Categorical Consolidation

The seller_type column had three categories: Individual, Dealer, and Trustmark Dealer. Since Trustmark Dealer had only 102 entries (vs 994 for Dealer), the two were merged into a single Dealer category — reducing noise and simplifying encoding.

Before: Individual (3244), Dealer (994), Trustmark Dealer (102)
After: Individual (3244), Dealer (1096)


Step 3 — Feature Type Classification

Features were classified into three groups to apply the most appropriate transformation to each:

Type Features Transformation
Discrete / High-cardinality categorical year, model OrdinalEncoder
Low-cardinality categorical fuel, seller_type, transmission, owner OneHotEncoder (drop='first')
Continuous numerical km_driven StandardScaler

Why this split?

  • year and model were ordinally encoded because their high cardinality would create too many columns with OHE
  • Low-cardinality columns (≤ 9 unique values) were one-hot encoded to avoid implying false ordinality
  • km_driven was scaled because it had a very wide range (1 to 806,599 km) with extreme outliers

Step 4 — Outlier Analysis

Descriptive statistics and box plots were generated for the numerical features:

Feature Min Mean Max Std Dev
year 1992 2013 2020 4.2
km_driven 1 66,216 806,599 46,644

km_driven showed significant positive skew with extreme outliers (max ~806K km). These were not removed — the StandardScaler handles variance, and tree-based models are inherently robust to outliers.


Step 5 — Preprocessing Pipeline

All transformations were chained into a single ColumnTransformer for clean, reproducible preprocessing:

ColumnTransformer
├── OrdinalEncoder      →  year, model
├── OneHotEncoder       →  fuel, seller_type, transmission, owner
└── StandardScaler      →  km_driven

The final preprocessed feature matrix had shape (4340, 13) — a compact representation ready for model training.


🤖 Model Training — Baseline Comparison

The data was split 80% training / 20% testing (random_state=42), yielding:

  • Training set: 3,472 samples
  • Test set: 868 samples

Ten regression algorithms were trained on default (untuned) settings and evaluated using three metrics:

Metric What it measures
R² Score Proportion of variance explained (higher = better, max 1.0)
MAE Average absolute prediction error in INR (lower = better)
MSE Mean squared error — penalises large errors heavily

Baseline Results

Model Train R² Test R² Test MAE (INR) Notes
Linear Regression 0.470 0.401 222,782 Underfits — linear boundary too simple
Ridge Regression 0.470 0.401 222,714 Marginal improvement over Linear
Lasso Regression 0.470 0.401 222,781 Similar to Linear
Decision Tree 0.999 0.427 131,133 Severe overfitting
Random Forest 0.971 0.610 112,192 Good generalisation
SVR -0.073 -0.064 304,312 Failed — needs feature scaling tuning
K-Nearest Neighbors 0.866 0.618 108,170 Competitive test performance
AdaBoost 0.551 0.322 232,567 Weak learner combination underperforms
Gradient Boosting 0.852 0.587 146,813 Strong candidate
XGBoost 0.990 0.653 96,370 Best baseline test performance

Key Observations

  • Linear models (Linear, Ridge, Lasso) plateau at ~0.40 test R² — the car price relationship is non-linear and these models lack the capacity to capture it.
  • Decision Tree perfectly memorises training data (R² = 0.999) but collapses on test data (R² = 0.427) — classic overfitting.
  • SVR performed worst overall — it requires careful feature scaling and kernel tuning to work well on this data.
  • Top 3 candidates for tuning based on test R² and MAE: XGBoost (0.653), KNN (0.618), and Gradient Boosting (0.587).

⚙️ Hyperparameter Tuning

The top 4 models (XGBoost, KNN, Gradient Boosting, AdaBoost) were tuned using two strategies:

Strategy 1 — RandomizedSearchCV (3-fold CV, 100 iterations)

A wide search across the parameter space to quickly identify promising regions.

Search Spaces:

Model Parameters Searched
KNN n_neighbors, weights, algorithm
AdaBoost n_estimators, learning_rate, loss
Gradient Boosting n_estimators, learning_rate, max_depth, min_samples_split, min_samples_leaf
XGBoost n_estimators, learning_rate, max_depth, min_child_weight, gamma

Best Parameters Found (RandomizedSearchCV):

Model Best Parameters
KNN weights=distance, n_neighbors=3, algorithm=brute
AdaBoost n_estimators=50, learning_rate=0.01, loss=linear
Gradient Boosting n_estimators=300, lr=0.5, max_depth=3, min_samples_split=2, min_samples_leaf=1
XGBoost n_estimators=300, lr=0.5, max_depth=3, min_child_weight=2, gamma=0

Results After RandomizedSearchCV:

Model Train R² Test R² Test MAE (INR) Δ vs Baseline
AdaBoost 0.604 0.377 207,704 +0.055
Gradient Boosting 0.980 0.661 101,511 +0.074 âś…
XGBoost 0.974 0.673 100,179 +0.020 âś…
KNN 0.999 0.665 84,766 +0.047 âś…

Strategy 2 — GridSearchCV (5-fold CV, exhaustive search)

A thorough search across the exact same parameter grids for final confirmation.

Best Parameters Found (GridSearchCV):

Model Best Parameters
KNN weights=distance, n_neighbors=3, algorithm=auto
AdaBoost n_estimators=50, learning_rate=0.01, loss=linear
Gradient Boosting n_estimators=300, lr=0.5, max_depth=3, min_samples_split=2, min_samples_leaf=1
XGBoost n_estimators=300, lr=0.5, max_depth=3, min_child_weight=2, gamma=0

GridSearchCV confirmed the same optimal parameters as RandomizedSearchCV for all models — a strong sign of stability in these results.

Results After GridSearchCV:

Model Train R² Test R² Test MAE (INR) Δ vs Baseline
AdaBoost 0.599 0.371 207,093 +0.049
Gradient Boosting 0.980 0.630 103,214 +0.043
XGBoost 0.974 0.673 100,179 +0.020
KNN 0.999 0.664 84,868 +0.046

🏆 Final Results

Top 3 Models — Head to Head

Model Test R² Test MAE (INR) Train R² Overfitting?
XGBoost 0.673 100,179 0.974 Moderate
KNN 0.664 84,868 0.999 High (memorises training data)
Gradient Boosting 0.630 103,214 0.980 Moderate

Winner: XGBoost Regressor 🥇

Final Configuration:

XGBRegressor(
    n_estimators     = 300,
    learning_rate    = 0.5,
    max_depth        = 3,
    min_child_weight = 2,
    gamma            = 0
)

Final Test Performance:

  • R² Score: 0.673 — explains ~67% of variance in car resale prices
  • MAE: ₹100,179 — predictions are off by ~₹1 lakh on average
  • MSE: 99,854,286,848

Why XGBoost?

  • Best balance of test R² and generalisation (train vs test gap is smaller than KNN)
  • KNN achieves a lower MAE (₹84K) but its near-perfect training R² (0.999) suggests it memorises the training set — a concern for production use
  • Gradient Boosting trails both on R² and MAE after tuning
  • XGBoost's regularisation parameters (min_child_weight, gamma) naturally control overfitting, making it the most deployable model

đź’ˇ Key Learnings

1. Feature engineering matters more than model choice.
Extracting Company and model from the raw name column, and thoughtfully choosing OrdinalEncoder vs OneHotEncoder per feature, gave tree-based models the structured information they needed to split on meaningful boundaries.

2. Tree-based ensemble models dominate tabular regression.
Linear models capped out at R² = 0.40. XGBoost and Gradient Boosting — which build hundreds of trees correcting each other's errors — nearly doubled that performance without any domain-specific feature engineering.

3. High train R² alone is not success — it's a red flag.
Decision Tree (train R²: 0.999, test R²: 0.427) and KNN (train R²: 0.999, test R²: 0.664) showed that memorising training data doesn't mean learning the underlying pattern. The gap between train and test R² is the real signal.

4. RandomizedSearchCV is an efficient first step before GridSearchCV.
Running RandomizedSearchCV first narrowed the search space with far fewer evaluations. GridSearchCV then confirmed those results exhaustively — both agreed on the same optimal parameters, validating the approach.

5. Two search strategies confirming the same result builds confidence.
When RandomizedSearchCV and GridSearchCV independently converge to identical best parameters, it strongly suggests those parameters represent a genuine optimum rather than a lucky random pick.

6. MAE is more interpretable than MSE for business problems.
MSE penalises large errors exponentially and is hard to explain to stakeholders. MAE in INR directly answers "how wrong is my prediction on average?" — making it the preferred metric for communicating model quality in a real-world pricing context.

About

🚗Predicts used car resale prices using XGBoost (R² 0.673, MAE ₹1L). Built with feature engineering, baseline comparison of 10 regression models, and hyperparameter tuning via RandomizedSearchCV and GridSearchCV on the Car Dekho dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors