Goal: Build a machine learning regression model to predict the resale price of used cars based on features like year, mileage, fuel type, transmission, seller type, ownership history, and car model.
Dataset: Car Dekho listings — 4,340 rows × 8 columns
Target Variable: selling_price (in INR)
Tools & Libraries: Python, Pandas, NumPy, Scikit-learn, XGBoost, Matplotlib
- Feature Engineering
- Model Training — Baseline Comparison
- Hyperparameter Tuning
- Final Results
- Key Learnings
The raw dataset contained 4,340 car listings with the following columns:
| Column | Type | Description |
|---|---|---|
name |
String | Full car name (brand + model + variant) |
year |
Integer | Year of manufacture |
selling_price |
Integer | Target — resale price in INR |
km_driven |
Integer | Total kilometres driven |
fuel |
Categorical | Petrol / Diesel / CNG / LPG / Electric |
seller_type |
Categorical | Individual / Dealer / Trustmark Dealer |
transmission |
Categorical | Manual / Automatic |
owner |
Categorical | First / Second / Third / Fourth & Above / Test Drive |
No missing values were found across all 8 columns.
The name column contained the full car name (e.g. "Maruti Wagon R LXI Minor"). Two new structured features were extracted from it:
Company— the first word of the name (brand), e.g.Maruti,Hyundai,Hondamodel— the second word of the name (model), e.g.Wagon,Verna,Amaze
The Company column was explored (29 unique brands found) but ultimately dropped to reduce dimensionality, keeping only model (185 unique values). The original name column was then dropped entirely.
The seller_type column had three categories: Individual, Dealer, and Trustmark Dealer. Since Trustmark Dealer had only 102 entries (vs 994 for Dealer), the two were merged into a single Dealer category — reducing noise and simplifying encoding.
Before: Individual (3244), Dealer (994), Trustmark Dealer (102)
After: Individual (3244), Dealer (1096)
Features were classified into three groups to apply the most appropriate transformation to each:
| Type | Features | Transformation |
|---|---|---|
| Discrete / High-cardinality categorical | year, model |
OrdinalEncoder |
| Low-cardinality categorical | fuel, seller_type, transmission, owner |
OneHotEncoder (drop='first') |
| Continuous numerical | km_driven |
StandardScaler |
Why this split?
yearandmodelwere ordinally encoded because their high cardinality would create too many columns with OHE- Low-cardinality columns (≤ 9 unique values) were one-hot encoded to avoid implying false ordinality
km_drivenwas scaled because it had a very wide range (1 to 806,599 km) with extreme outliers
Descriptive statistics and box plots were generated for the numerical features:
| Feature | Min | Mean | Max | Std Dev |
|---|---|---|---|---|
year |
1992 | 2013 | 2020 | 4.2 |
km_driven |
1 | 66,216 | 806,599 | 46,644 |
km_driven showed significant positive skew with extreme outliers (max ~806K km). These were not removed — the StandardScaler handles variance, and tree-based models are inherently robust to outliers.
All transformations were chained into a single ColumnTransformer for clean, reproducible preprocessing:
ColumnTransformer
├── OrdinalEncoder → year, model
├── OneHotEncoder → fuel, seller_type, transmission, owner
└── StandardScaler → km_driven
The final preprocessed feature matrix had shape (4340, 13) — a compact representation ready for model training.
The data was split 80% training / 20% testing (random_state=42), yielding:
- Training set: 3,472 samples
- Test set: 868 samples
Ten regression algorithms were trained on default (untuned) settings and evaluated using three metrics:
| Metric | What it measures |
|---|---|
| R² Score | Proportion of variance explained (higher = better, max 1.0) |
| MAE | Average absolute prediction error in INR (lower = better) |
| MSE | Mean squared error — penalises large errors heavily |
| Model | Train R² | Test R² | Test MAE (INR) | Notes |
|---|---|---|---|---|
| Linear Regression | 0.470 | 0.401 | 222,782 | Underfits — linear boundary too simple |
| Ridge Regression | 0.470 | 0.401 | 222,714 | Marginal improvement over Linear |
| Lasso Regression | 0.470 | 0.401 | 222,781 | Similar to Linear |
| Decision Tree | 0.999 | 0.427 | 131,133 | Severe overfitting |
| Random Forest | 0.971 | 0.610 | 112,192 | Good generalisation |
| SVR | -0.073 | -0.064 | 304,312 | Failed — needs feature scaling tuning |
| K-Nearest Neighbors | 0.866 | 0.618 | 108,170 | Competitive test performance |
| AdaBoost | 0.551 | 0.322 | 232,567 | Weak learner combination underperforms |
| Gradient Boosting | 0.852 | 0.587 | 146,813 | Strong candidate |
| XGBoost | 0.990 | 0.653 | 96,370 | Best baseline test performance |
- Linear models (Linear, Ridge, Lasso) plateau at ~0.40 test R² — the car price relationship is non-linear and these models lack the capacity to capture it.
- Decision Tree perfectly memorises training data (R² = 0.999) but collapses on test data (R² = 0.427) — classic overfitting.
- SVR performed worst overall — it requires careful feature scaling and kernel tuning to work well on this data.
- Top 3 candidates for tuning based on test R² and MAE: XGBoost (0.653), KNN (0.618), and Gradient Boosting (0.587).
The top 4 models (XGBoost, KNN, Gradient Boosting, AdaBoost) were tuned using two strategies:
A wide search across the parameter space to quickly identify promising regions.
Search Spaces:
| Model | Parameters Searched |
|---|---|
| KNN | n_neighbors, weights, algorithm |
| AdaBoost | n_estimators, learning_rate, loss |
| Gradient Boosting | n_estimators, learning_rate, max_depth, min_samples_split, min_samples_leaf |
| XGBoost | n_estimators, learning_rate, max_depth, min_child_weight, gamma |
Best Parameters Found (RandomizedSearchCV):
| Model | Best Parameters |
|---|---|
| KNN | weights=distance, n_neighbors=3, algorithm=brute |
| AdaBoost | n_estimators=50, learning_rate=0.01, loss=linear |
| Gradient Boosting | n_estimators=300, lr=0.5, max_depth=3, min_samples_split=2, min_samples_leaf=1 |
| XGBoost | n_estimators=300, lr=0.5, max_depth=3, min_child_weight=2, gamma=0 |
Results After RandomizedSearchCV:
| Model | Train R² | Test R² | Test MAE (INR) | Δ vs Baseline |
|---|---|---|---|---|
| AdaBoost | 0.604 | 0.377 | 207,704 | +0.055 |
| Gradient Boosting | 0.980 | 0.661 | 101,511 | +0.074 âś… |
| XGBoost | 0.974 | 0.673 | 100,179 | +0.020 âś… |
| KNN | 0.999 | 0.665 | 84,766 | +0.047 âś… |
A thorough search across the exact same parameter grids for final confirmation.
Best Parameters Found (GridSearchCV):
| Model | Best Parameters |
|---|---|
| KNN | weights=distance, n_neighbors=3, algorithm=auto |
| AdaBoost | n_estimators=50, learning_rate=0.01, loss=linear |
| Gradient Boosting | n_estimators=300, lr=0.5, max_depth=3, min_samples_split=2, min_samples_leaf=1 |
| XGBoost | n_estimators=300, lr=0.5, max_depth=3, min_child_weight=2, gamma=0 |
GridSearchCV confirmed the same optimal parameters as RandomizedSearchCV for all models — a strong sign of stability in these results.
Results After GridSearchCV:
| Model | Train R² | Test R² | Test MAE (INR) | Δ vs Baseline |
|---|---|---|---|---|
| AdaBoost | 0.599 | 0.371 | 207,093 | +0.049 |
| Gradient Boosting | 0.980 | 0.630 | 103,214 | +0.043 |
| XGBoost | 0.974 | 0.673 | 100,179 | +0.020 |
| KNN | 0.999 | 0.664 | 84,868 | +0.046 |
| Model | Test R² | Test MAE (INR) | Train R² | Overfitting? |
|---|---|---|---|---|
| XGBoost | 0.673 | 100,179 | 0.974 | Moderate |
| KNN | 0.664 | 84,868 | 0.999 | High (memorises training data) |
| Gradient Boosting | 0.630 | 103,214 | 0.980 | Moderate |
Final Configuration:
XGBRegressor(
n_estimators = 300,
learning_rate = 0.5,
max_depth = 3,
min_child_weight = 2,
gamma = 0
)
Final Test Performance:
- R² Score: 0.673 — explains ~67% of variance in car resale prices
- MAE: ₹100,179 — predictions are off by ~₹1 lakh on average
- MSE: 99,854,286,848
Why XGBoost?
- Best balance of test R² and generalisation (train vs test gap is smaller than KNN)
- KNN achieves a lower MAE (₹84K) but its near-perfect training R² (0.999) suggests it memorises the training set — a concern for production use
- Gradient Boosting trails both on R² and MAE after tuning
- XGBoost's regularisation parameters (
min_child_weight,gamma) naturally control overfitting, making it the most deployable model
1. Feature engineering matters more than model choice.
Extracting Company and model from the raw name column, and thoughtfully choosing OrdinalEncoder vs OneHotEncoder per feature, gave tree-based models the structured information they needed to split on meaningful boundaries.
2. Tree-based ensemble models dominate tabular regression.
Linear models capped out at R² = 0.40. XGBoost and Gradient Boosting — which build hundreds of trees correcting each other's errors — nearly doubled that performance without any domain-specific feature engineering.
3. High train R² alone is not success — it's a red flag.
Decision Tree (train R²: 0.999, test R²: 0.427) and KNN (train R²: 0.999, test R²: 0.664) showed that memorising training data doesn't mean learning the underlying pattern. The gap between train and test R² is the real signal.
4. RandomizedSearchCV is an efficient first step before GridSearchCV.
Running RandomizedSearchCV first narrowed the search space with far fewer evaluations. GridSearchCV then confirmed those results exhaustively — both agreed on the same optimal parameters, validating the approach.
5. Two search strategies confirming the same result builds confidence.
When RandomizedSearchCV and GridSearchCV independently converge to identical best parameters, it strongly suggests those parameters represent a genuine optimum rather than a lucky random pick.
6. MAE is more interpretable than MSE for business problems.
MSE penalises large errors exponentially and is hard to explain to stakeholders. MAE in INR directly answers "how wrong is my prediction on average?" — making it the preferred metric for communicating model quality in a real-world pricing context.