🚗 Car Price Predictor — Project Report

Goal: Build a machine learning regression model to predict the resale price of used cars based on features like year, mileage, fuel type, transmission, seller type, ownership history, and car model.

Dataset: Car Dekho listings — 4,340 rows × 8 columns
Target Variable: selling_price (in INR)
Tools & Libraries: Python, Pandas, NumPy, Scikit-learn, XGBoost, Matplotlib

🔧 Feature Engineering

Dataset Overview

The raw dataset contained 4,340 car listings with the following columns:

Column	Type	Description
`name`	String	Full car name (brand + model + variant)
`year`	Integer	Year of manufacture
`selling_price`	Integer	Target — resale price in INR
`km_driven`	Integer	Total kilometres driven
`fuel`	Categorical	Petrol / Diesel / CNG / LPG / Electric
`seller_type`	Categorical	Individual / Dealer / Trustmark Dealer
`transmission`	Categorical	Manual / Automatic
`owner`	Categorical	First / Second / Third / Fourth & Above / Test Drive

No missing values were found across all 8 columns.

Step 1 — Feature Extraction from Car Name

The name column contained the full car name (e.g. "Maruti Wagon R LXI Minor"). Two new structured features were extracted from it:

Company — the first word of the name (brand), e.g. Maruti, Hyundai, Honda
model — the second word of the name (model), e.g. Wagon, Verna, Amaze

The Company column was explored (29 unique brands found) but ultimately dropped to reduce dimensionality, keeping only model (185 unique values). The original name column was then dropped entirely.

Step 2 — Categorical Consolidation

The seller_type column had three categories: Individual, Dealer, and Trustmark Dealer. Since Trustmark Dealer had only 102 entries (vs 994 for Dealer), the two were merged into a single Dealer category — reducing noise and simplifying encoding.

Before: Individual (3244), Dealer (994), Trustmark Dealer (102)
After: Individual (3244), Dealer (1096)

Step 3 — Feature Type Classification

Features were classified into three groups to apply the most appropriate transformation to each:

Type	Features	Transformation
Discrete / High-cardinality categorical	`year`, `model`	OrdinalEncoder
Low-cardinality categorical	`fuel`, `seller_type`, `transmission`, `owner`	OneHotEncoder (drop='first')
Continuous numerical	`km_driven`	StandardScaler

Why this split?

year and model were ordinally encoded because their high cardinality would create too many columns with OHE
Low-cardinality columns (≤ 9 unique values) were one-hot encoded to avoid implying false ordinality
km_driven was scaled because it had a very wide range (1 to 806,599 km) with extreme outliers

Step 4 — Outlier Analysis

Descriptive statistics and box plots were generated for the numerical features:

Feature	Min	Mean	Max	Std Dev
`year`	1992	2013	2020	4.2
`km_driven`	1	66,216	806,599	46,644

km_driven showed significant positive skew with extreme outliers (max ~806K km). These were not removed — the StandardScaler handles variance, and tree-based models are inherently robust to outliers.

Step 5 — Preprocessing Pipeline

All transformations were chained into a single ColumnTransformer for clean, reproducible preprocessing:

ColumnTransformer
├── OrdinalEncoder      →  year, model
├── OneHotEncoder       →  fuel, seller_type, transmission, owner
└── StandardScaler      →  km_driven

The final preprocessed feature matrix had shape (4340, 13) — a compact representation ready for model training.

🤖 Model Training — Baseline Comparison

The data was split 80% training / 20% testing (random_state=42), yielding:

Training set: 3,472 samples
Test set: 868 samples

Ten regression algorithms were trained on default (untuned) settings and evaluated using three metrics:

Metric	What it measures
R² Score	Proportion of variance explained (higher = better, max 1.0)
MAE	Average absolute prediction error in INR (lower = better)
MSE	Mean squared error — penalises large errors heavily

Baseline Results

Model	Train R²	Test R²	Test MAE (INR)	Notes
Linear Regression	0.470	0.401	222,782	Underfits — linear boundary too simple
Ridge Regression	0.470	0.401	222,714	Marginal improvement over Linear
Lasso Regression	0.470	0.401	222,781	Similar to Linear
Decision Tree	0.999	0.427	131,133	Severe overfitting
Random Forest	0.971	0.610	112,192	Good generalisation
SVR	-0.073	-0.064	304,312	Failed — needs feature scaling tuning
K-Nearest Neighbors	0.866	0.618	108,170	Competitive test performance
AdaBoost	0.551	0.322	232,567	Weak learner combination underperforms
Gradient Boosting	0.852	0.587	146,813	Strong candidate
XGBoost	0.990	0.653	96,370	Best baseline test performance

Key Observations

Linear models (Linear, Ridge, Lasso) plateau at ~0.40 test R² — the car price relationship is non-linear and these models lack the capacity to capture it.
Decision Tree perfectly memorises training data (R² = 0.999) but collapses on test data (R² = 0.427) — classic overfitting.
SVR performed worst overall — it requires careful feature scaling and kernel tuning to work well on this data.
Top 3 candidates for tuning based on test R² and MAE: XGBoost (0.653), KNN (0.618), and Gradient Boosting (0.587).

⚙️ Hyperparameter Tuning

The top 4 models (XGBoost, KNN, Gradient Boosting, AdaBoost) were tuned using two strategies:

Strategy 1 — RandomizedSearchCV (3-fold CV, 100 iterations)

A wide search across the parameter space to quickly identify promising regions.

Search Spaces:

Model	Parameters Searched
KNN	`n_neighbors`, `weights`, `algorithm`
AdaBoost	`n_estimators`, `learning_rate`, `loss`
Gradient Boosting	`n_estimators`, `learning_rate`, `max_depth`, `min_samples_split`, `min_samples_leaf`
XGBoost	`n_estimators`, `learning_rate`, `max_depth`, `min_child_weight`, `gamma`

Best Parameters Found (RandomizedSearchCV):

Model	Best Parameters
KNN	weights=distance, n_neighbors=3, algorithm=brute
AdaBoost	n_estimators=50, learning_rate=0.01, loss=linear
Gradient Boosting	n_estimators=300, lr=0.5, max_depth=3, min_samples_split=2, min_samples_leaf=1
XGBoost	n_estimators=300, lr=0.5, max_depth=3, min_child_weight=2, gamma=0

Results After RandomizedSearchCV:

Model	Train R²	Test R²	Test MAE (INR)	Δ vs Baseline
AdaBoost	0.604	0.377	207,704	+0.055
Gradient Boosting	0.980	0.661	101,511	+0.074 ✅
XGBoost	0.974	0.673	100,179	+0.020 ✅
KNN	0.999	0.665	84,766	+0.047 ✅

Strategy 2 — GridSearchCV (5-fold CV, exhaustive search)

A thorough search across the exact same parameter grids for final confirmation.

Best Parameters Found (GridSearchCV):

Model	Best Parameters
KNN	weights=distance, n_neighbors=3, algorithm=auto
AdaBoost	n_estimators=50, learning_rate=0.01, loss=linear
Gradient Boosting	n_estimators=300, lr=0.5, max_depth=3, min_samples_split=2, min_samples_leaf=1
XGBoost	n_estimators=300, lr=0.5, max_depth=3, min_child_weight=2, gamma=0

GridSearchCV confirmed the same optimal parameters as RandomizedSearchCV for all models — a strong sign of stability in these results.

Results After GridSearchCV:

Model	Train R²	Test R²	Test MAE (INR)	Δ vs Baseline
AdaBoost	0.599	0.371	207,093	+0.049
Gradient Boosting	0.980	0.630	103,214	+0.043
XGBoost	0.974	0.673	100,179	+0.020
KNN	0.999	0.664	84,868	+0.046

🏆 Final Results

Top 3 Models — Head to Head

Model	Test R²	Test MAE (INR)	Train R²	Overfitting?
XGBoost	0.673	100,179	0.974	Moderate
KNN	0.664	84,868	0.999	High (memorises training data)
Gradient Boosting	0.630	103,214	0.980	Moderate

Winner: XGBoost Regressor 🥇

Final Configuration:

XGBRegressor(
    n_estimators     = 300,
    learning_rate    = 0.5,
    max_depth        = 3,
    min_child_weight = 2,
    gamma            = 0
)

Final Test Performance:

R² Score: 0.673 — explains ~67% of variance in car resale prices
MAE: ₹100,179 — predictions are off by ~₹1 lakh on average
MSE: 99,854,286,848

Why XGBoost?

Best balance of test R² and generalisation (train vs test gap is smaller than KNN)
KNN achieves a lower MAE (₹84K) but its near-perfect training R² (0.999) suggests it memorises the training set — a concern for production use
Gradient Boosting trails both on R² and MAE after tuning
XGBoost's regularisation parameters (min_child_weight, gamma) naturally control overfitting, making it the most deployable model

💡 Key Learnings

1. Feature engineering matters more than model choice.
Extracting Company and model from the raw name column, and thoughtfully choosing OrdinalEncoder vs OneHotEncoder per feature, gave tree-based models the structured information they needed to split on meaningful boundaries.

2. Tree-based ensemble models dominate tabular regression.
Linear models capped out at R² = 0.40. XGBoost and Gradient Boosting — which build hundreds of trees correcting each other's errors — nearly doubled that performance without any domain-specific feature engineering.

3. High train R² alone is not success — it's a red flag.
Decision Tree (train R²: 0.999, test R²: 0.427) and KNN (train R²: 0.999, test R²: 0.664) showed that memorising training data doesn't mean learning the underlying pattern. The gap between train and test R² is the real signal.

4. RandomizedSearchCV is an efficient first step before GridSearchCV.
Running RandomizedSearchCV first narrowed the search space with far fewer evaluations. GridSearchCV then confirmed those results exhaustively — both agreed on the same optimal parameters, validating the approach.

5. Two search strategies confirming the same result builds confidence.
When RandomizedSearchCV and GridSearchCV independently converge to identical best parameters, it strongly suggests those parameters represent a genuine optimum rather than a lucky random pick.

6. MAE is more interpretable than MSE for business problems.
MSE penalises large errors exponentially and is hard to explain to stakeholders. MAE in INR directly answers "how wrong is my prediction on average?" — making it the preferred metric for communicating model quality in a real-world pricing context.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataset		dataset
notebook		notebook
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚗 Car Price Predictor — Project Report

📋 Table of Contents

🔧 Feature Engineering

Dataset Overview

Step 1 — Feature Extraction from Car Name

Step 2 — Categorical Consolidation

Step 3 — Feature Type Classification

Step 4 — Outlier Analysis

Step 5 — Preprocessing Pipeline

🤖 Model Training — Baseline Comparison

Baseline Results

Key Observations

⚙️ Hyperparameter Tuning

Strategy 1 — RandomizedSearchCV (3-fold CV, 100 iterations)

Strategy 2 — GridSearchCV (5-fold CV, exhaustive search)

🏆 Final Results

Top 3 Models — Head to Head

Winner: XGBoost Regressor 🥇

💡 Key Learnings

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚗 Car Price Predictor — Project Report

📋 Table of Contents

🔧 Feature Engineering

Dataset Overview

Step 1 — Feature Extraction from Car Name

Step 2 — Categorical Consolidation

Step 3 — Feature Type Classification

Step 4 — Outlier Analysis

Step 5 — Preprocessing Pipeline

🤖 Model Training — Baseline Comparison

Baseline Results

Key Observations

⚙️ Hyperparameter Tuning

Strategy 1 — RandomizedSearchCV (3-fold CV, 100 iterations)

Strategy 2 — GridSearchCV (5-fold CV, exhaustive search)

🏆 Final Results

Top 3 Models — Head to Head

Winner: XGBoost Regressor 🥇

💡 Key Learnings

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages